# Task 3: The impact of dimensionality reduction in classification

The [20newsgroup](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups_vectorized.html) dataset is a high-dimensional text dataset of scikit-learn, used primarily in classification problems. It includes 18846 news articles from 20 categories. The number of features is 130107, a number that may easily trigger the curse of dimensionality for many machine learning algorithms.

Apply PCA for various sizes of the input space (e.g. 50, 100, 500, 1000, 10000, and so on). Compare the performance of LogisticRegression, Random Forest Classifier and Multilayer Perceptron on both the reduced, and original dimensional spaces.

* Hint 1: Fetch the dataset in [vectorized format](sklearn.datasets.fetch_20newsgroups_vectorized) and convert it by using the [TfidfTransfomer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).
* Hint 2: In the vast majority of cases, the vectorized text datasets are stored in sparse vectors (namely, most of their components are zero). PCA will not work with such datasets. Use scikit-learn's [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) instead. TruncatedSVD is similar to PCA, however, it  does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.


In [1]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.decomposition import TruncatedSVD as TSVD
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [2]:
def truncated(components):
    tsvd = TSVD(n_components=components, random_state=0)
    return tsvd.fit_transform(X_train_tf), tsvd.transform(X_test_tf)
    
def LogReg(log_reg,Xtrain,Xtest,Ytrain,Ytest):
    log_reg.fit(Xtrain, Ytrain)
    lr_pred = log_reg.predict(Xtest)
    return accuracy_score(Ytest, lr_pred)

def RandFor(rf,Xtrain,Xtest,Ytrain,Ytest):  
    rf.fit(Xtrain, Ytrain)
    rf_pred = rf.predict(Xtest)
    return accuracy_score(Ytest, rf_pred)  

def MLP_clas(mlp,Xtrain,Xtest,Ytrain,Ytest):
    mlp.fit(Xtrain, Ytrain)
    mlp_pred = mlp.predict(Xtest)
    return accuracy_score(Ytest, mlp_pred)

In [3]:
initial_dataset=fetch_20newsgroups_vectorized
print("Data fetched successfully.")

X_train = initial_dataset(subset='train').data
Y_train = initial_dataset(subset='train').target
X_test = initial_dataset(subset='test').data
Y_test = initial_dataset(subset='test').target


tfidf = TfidfTransformer()
X_train_tf=tfidf.fit_transform(X_train)
X_test_tf=tfidf.transform(X_test)
print("Data transformed successfully.")


X_train_tf_50, X_test_tf_50 = truncated(50)
X_train_tf_100, X_test_tf_100 = truncated(100)
X_train_tf_500, X_test_tf_500 = truncated(500)
X_train_tf_1000, X_test_tf_1000 = truncated(1000)
print("Data truncated successfully.")

Data fetched successfully.
Data transformed successfully.
Data truncated successfully.


In [4]:
# Logistic Regression
print("Begining Logistic Regression models testing...")
log_reg = LogisticRegression()
accuracy_initial_lr = LogReg(log_reg,X_train_tf,X_test_tf,Y_train,Y_test)
accuracy_lr_50 = LogReg(log_reg,X_train_tf_50, X_test_tf_50 ,Y_train,Y_test)
accuracy_lr_100 = LogReg(log_reg,X_train_tf_100, X_test_tf_100 ,Y_train,Y_test)
accuracy_lr_500 = LogReg(log_reg,X_train_tf_500, X_test_tf_500 ,Y_train,Y_test)
accuracy_lr_1000 = LogReg(log_reg,X_train_tf_1000, X_test_tf_1000 ,Y_train,Y_test)
print ("Done")

Begining Logistic Regression models testing...
Done


In [5]:
# Testing to find best parameters for Random Forest(50 features per entry)
estimators=[100,200,300,400,500]
final_estimator=100
temp=0
for estimator in estimators:
    rf = RandomForestClassifier(n_estimators=estimator)
    accuracy_rf_50 = RandFor(rf, X_train_tf_50, X_test_tf_50 ,Y_train,Y_test)
    print("Accuracy:",accuracy_rf_50, "( number of estimators:",estimator,")")
    if accuracy_rf_50 > temp:
        final_estimator=estimator
        temp = accuracy_rf_50

final_estimator

Accuracy: 0.6453797132235793 ( number of estimators: 100 )
Accuracy: 0.6590546999468933 ( number of estimators: 200 )
Accuracy: 0.6598513011152416 ( number of estimators: 300 )
Accuracy: 0.6633032395114179 ( number of estimators: 400 )
Accuracy: 0.6642326075411578 ( number of estimators: 500 )


500

In [6]:
# Random Forest 
print("Begining Random Forest models testing...")
rf = RandomForestClassifier(n_estimators=final_estimator)

accuracy_initial_rf = RandFor(rf, X_train_tf,X_test_tf,Y_train,Y_test)
accuracy_rf_50 = RandFor(rf, X_train_tf_50, X_test_tf_50 ,Y_train,Y_test)
accuracy_rf_100 = RandFor(rf, X_train_tf_100, X_test_tf_100 ,Y_train,Y_test)
accuracy_rf_500 = RandFor(rf, X_train_tf_500, X_test_tf_500 ,Y_train,Y_test)
accuracy_rf_1000 = RandFor(rf, X_train_tf_1000, X_test_tf_1000 ,Y_train,Y_test)
accuracy_rf_50
print ("Done")

Begining Random Forest models testing...
Done


In [7]:
# Testing to find best parameters for MLP classifier (50 features per entry)
rates = [0.001,0.002,0.01,0.1]
layers = [50,100,200,300,400]
final_parameters = [0,0]
temp = 0
for rate in rates:
    for layer in layers:
      mlp = MLPClassifier(hidden_layer_sizes=(layer,),max_iter=300, learning_rate_init=rate)
      accuracy_mlp_50 = MLP_clas(mlp,X_train_tf_50, X_test_tf_50 ,Y_train,Y_test)
      print("Accuracy:", accuracy_mlp_50,"(layers:",layer,", learning_rate:",rate,")")
      if accuracy_mlp_50 > temp:
        final_parameters = [rate , layer]
        temp = accuracy_mlp_50
final_parameters



Accuracy: 0.7110993096123208 (layers: 50 , learning_rate: 0.001 )




Accuracy: 0.7150823154540626 (layers: 100 , learning_rate: 0.001 )




Accuracy: 0.7247742963356346 (layers: 200 , learning_rate: 0.001 )




Accuracy: 0.7181359532660648 (layers: 300 , learning_rate: 0.001 )




Accuracy: 0.717870419543282 (layers: 400 , learning_rate: 0.001 )




Accuracy: 0.7132235793945831 (layers: 50 , learning_rate: 0.002 )




Accuracy: 0.7099044078597982 (layers: 100 , learning_rate: 0.002 )




Accuracy: 0.7140201805629315 (layers: 200 , learning_rate: 0.002 )




Accuracy: 0.7164099840679766 (layers: 300 , learning_rate: 0.002 )




Accuracy: 0.7168082846521509 (layers: 400 , learning_rate: 0.002 )




Accuracy: 0.6974243228890069 (layers: 50 , learning_rate: 0.01 )




Accuracy: 0.6783058948486458 (layers: 100 , learning_rate: 0.01 )
Accuracy: 0.6804301646309081 (layers: 200 , learning_rate: 0.01 )
Accuracy: 0.6955655868295274 (layers: 300 , learning_rate: 0.01 )
Accuracy: 0.6946362187997875 (layers: 400 , learning_rate: 0.01 )
Accuracy: 0.6795007966011684 (layers: 50 , learning_rate: 0.1 )
Accuracy: 0.6593202336696761 (layers: 100 , learning_rate: 0.1 )
Accuracy: 0.6625066383430696 (layers: 200 , learning_rate: 0.1 )
Accuracy: 0.661975570897504 (layers: 300 , learning_rate: 0.1 )
Accuracy: 0.6575942644715879 (layers: 400 , learning_rate: 0.1 )


[0.001, 200]

In [8]:
final_parameters

[0.001, 200]

In [9]:
# MLP
print("Begining MLP models testing...")
mlp = MLPClassifier(hidden_layer_sizes=(final_parameters[1],), max_iter= 300, learning_rate_init=final_parameters[0])
accuracy_initial_mlp = MLP_clas(mlp,X_train_tf,X_test_tf,Y_train,Y_test)
accuracy_mlp_50 = MLP_clas(mlp,X_train_tf_50, X_test_tf_50 ,Y_train,Y_test)
accuracy_mlp_100 = MLP_clas(mlp,X_train_tf_100, X_test_tf_100 ,Y_train,Y_test)
accuracy_mlp_500 = MLP_clas(mlp,X_train_tf_500, X_test_tf_500 ,Y_train,Y_test)
accuracy_mlp_1000 = MLP_clas(mlp,X_train_tf_1000, X_test_tf_1000 ,Y_train,Y_test)
print ("Done")

Begining MLP models testing...




Done


In [10]:
#Print number of features for best accuracy score in every model
lr_scores={"50":accuracy_initial_lr,"100":accuracy_lr_100,"500":accuracy_lr_500,"1000":accuracy_lr_1000}
rf_scores={"50":accuracy_initial_rf,"100":accuracy_rf_100,"500":accuracy_rf_500,"1000":accuracy_rf_1000}
mlp_scores={"50":accuracy_initial_mlp,"100":accuracy_mlp_100,"500":accuracy_mlp_500,"1000":accuracy_mlp_1000}

best_lr= max(lr_scores, key=lr_scores.get)
best_rf = max(rf_scores, key=rf_scores.get)
best_mlp = max(mlp_scores, key=mlp_scores.get)

print("Number of features for best score - Logistic Regression is", best_lr,"with accuracy equal to", lr_scores[best_lr])
print("Number of features for best score - Random Forrest is", best_rf,"with accuracy equal to", rf_scores[best_rf])
print("Number of features for best score - MultiLayer Perceptron is", best_mlp,"with accuracy equal to", mlp_scores[best_mlp])

Number of features for best score - Logistic Regression is 50 with accuracy equal to 0.8274030801911842
Number of features for best score - Random Forrest is 50 with accuracy equal to 0.7879713223579394
Number of features for best score - MultiLayer Perceptron is 50 with accuracy equal to 0.846388741370154


# Deliverables

You will provide your solutions in this notebook. Just place your code below each task. You must also provide comments on the obtained results.

In the sequel:
1. Rename the notebook (i.e. the `.ipynb` file) by giving it your last and first names.
2. Go to _File &raquo; Download as &raquo; HTML (.html)_. Download the generated HTML file.
3. Compress both the `.ipynb` and `.html` files into a single `.zip` file. **Ensure that the `.zip` archive includes only these two files.**
4. Rename the `.zip` file and give it your last and first names. Finally, upload it to the e-learning platform.
