# Task 3: The impact of dimensionality reduction in classification

The [20newsgroup](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups_vectorized.html) dataset is a high-dimensional text dataset of scikit-learn, used primarily in classification problems. It includes 18846 news articles from 20 categories. The number of features is 130107, a number that may easily trigger the curse of dimensionality for many machine learning algorithms.

Apply PCA for various sizes of the input space (e.g. 50, 100, 500, 1000, 10000, and so on). Compare the performance of LogisticRegression, Random Forest Classifier and Multilayer Perceptron on both the reduced, and original dimensional spaces.

* Hint 1: Fetch the dataset in [vectorized format](sklearn.datasets.fetch_20newsgroups_vectorized) and convert it by using the [TfidfTransfomer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).
* Hint 2: In the vast majority of cases, the vectorized text datasets are stored in sparse vectors (namely, most of their components are zero). PCA will not work with such datasets. Use scikit-learn's [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) instead. TruncatedSVD is similar to PCA, however, it  does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.


https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
This module contains two loaders. The first one, <b>sklearn.datasets.fetch_20newsgroups</b>, returns a list of the raw texts that can be fed to text feature extractors such as <b>CountVectorizer</b> with custom parameters so as to extract feature vectors. The second one, <b>sklearn.datasets.fetch_20newsgroups_vectorized</b>, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.

In [None]:
from sklearn.datasets import fetch_20newsgroups # import 20newsgroups dataset, a list of the raw texts
print(type(fetch_20newsgroups)) # <class 'function'>
fetch_20newsgroups

In [None]:
# The F-score will be lower because it is more realistic.
# newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=categories)
newsgroups_train = fetch_20newsgroups(subset='train') # store features train dataset
print(type(newsgroups_train))
print(len(newsgroups_train.data))
newsgroups_train     #11314 articles/samples of the train dataset

In [None]:
newsgroups_test = fetch_20newsgroups(subset='test') # store features test dataset
print(type(newsgroups_test))
print(len(newsgroups_test.data))    
newsgroups_test   # 7532 articles/samples of the test dataset

In [None]:
print(type(newsgroups_train.target_names))   
print(newsgroups_train.target_names) # map numbers to categories  (20 classes-topics of the train dataset)
print(newsgroups_train.target.shape) # (11314,)
print(newsgroups_train.target[:10])  # 10 articles of the train dataset with their corresponding category   
print("\n11314 articles of the train dataset with their corresponding category")
print(list(newsgroups_train.target[:11314]))  # 11314 articles of the train dataset with their corresponding category

In [None]:
# Converting text to vectors of numerical values suitable for statistical analysis
from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf counts the frequency of a word in the document
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(newsgroups_train.data)
print(type(X_train)) # scipy.sparse.csr.csr_matrix
print(X_train.shape) # samples-aticles(rows):11314, distinct words/features(columns):130107
y_train = newsgroups_train.target  # 11314 articles of the train dataset with their corresponding category
print(type(y_train)) # <class 'numpy.ndarray'>
print(y_train.shape) # (11314,)
print("\nX_train=",X_train)
print("\ny_train=",y_train)

In [None]:
X_test = tfidf.transform(newsgroups_test.data)
print(type(X_test)) # <class 'scipy.sparse.csr.csr_matrix'>
print(X_test.shape) # (7532 posts, 130107 distinct words)
y_test = newsgroups_test.target
print(type(y_test)) # <class 'numpy.ndarray'>
print(y_test.shape) # (7532,)
print("\nX_test=",X_test)
print("\ny_test=",y_test)

In [None]:
print(len(tfidf.vocabulary_))
tfidf.vocabulary_ # word:#appearances

In [None]:
# The extracted TF-IDF vectors are very sparse
print(X_train.nnz) # not zero terms
print(round(X_train.nnz / float(X_train.shape[0]))) # avg of 158 non-zero components by sample of 130107-dimensional space-post 

In [None]:
import numpy as np
np.asarray(np.unique(y_train, return_counts=True)).T # [Class:#News] (20 categories with their corresponding # of articles)

### Ready to run our models

In [None]:
# https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py
import numpy as np, pandas as pd, os, sys, scipy.io, logging
from optparse import OptionParser
from time import time
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

## Truncated Singular Value Decomposition (TSVD)

In [None]:
from sklearn.decomposition import TruncatedSVD as TSVD    # Dimensionality reduction using truncated SVD

## Dimensional reduction to 50, 100, 500, 1000, 10000, 20000, 50000, 100000 (TSVD)

In [None]:
%%time
scores = []
for c in [50,100]: #,500,1000,10000, 20000, 50000, 100000
    print("\nFor TSVD({}):".format(c))
    # run many diff Truncated Singular Value Decomposition (TSVD)
    tsvd = TSVD(n_components=c, random_state=0)                    #dimensional reduction, convert matrix from 130.107 columns to c columns 
    X_train_tsvd = tsvd.fit_transform(X_train)
    X_test_tsvd = tsvd.transform(X_test)         # perform dimensionality reduction to the X_test dataset
    print('Shape of X train is',X_train.shape, '\nReduced shape of X_train is',X_train_tsvd.shape) # original vs reduced dimensional space on X_train
    print('\nThe shape of y_train is: ',y_train.shape)
    print('\nShape of X_test is',X_test.shape, '\nReduced shape of X_test is',X_test_tsvd.shape) # original vs reduced dimensional space on X_test   
    print('\nThe shape of y_test is: ',y_test.shape)
    print('\nTotal variance of X_train is',sum(tsvd.explained_variance_))  # what percentage of the previous training matrix does this matrix represent
    print('Total percentage variance of X_train is',sum(tsvd.explained_variance_ratio_))
    print('\nVariance of the training samples=',tsvd.explained_variance_)     #The variance of the training samples transformed by a projection to each component (what percentage of the previous columns does these new columns represent).
    print('\nPercentage variance of the training samples=',tsvd.explained_variance_ratio_)  #Percentage of variance explained by each of the selected components.    
    scores.append(tsvd.explained_variance_ratio_.sum()) # store all scores
print('\nPercentage variance on the train subset for c=50,100,500,1000,10000,20000,50000,100000 is',scores)

#MemoryError: Unable to allocate 19.4 GiB for an array with shape (130107, 20010) and data type float64
#MemoryError: Unable to allocate 48.5 GiB for an array with shape (130107, 50010) and data type float64

### Run a simple classifier

### Logistic Regression
**Regularization**:
- prevention of overfitting - (according to Muller and Guido ML book)
- L1 - assumes only a few features are important
- L2 - does not assume only a few features are important - used by default in scikit-learn LogisticRegression
               
**'C'**:
- parameter to control the strength of regularization
- lower C => log_reg adjusts to the majority of data points.
- higher C => correct classification of each data point.

C is known as the alpha parameter

Defaults of Logistic Regression (penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

In [None]:
%%time

### Logistic Regression without dimensional reduction
log_reg = LogisticRegression() # penalty='l2', C=1, solver='lbfgs', max_iter=100
log_reg.fit(X_train, y_train)
lr_pred = log_reg.predict(X_test)  #predicted values for x_test
print('The accuracy score for logistic regression is: ',accuracy_score(y_test, lr_pred))   #y_true,Y_predicted
print('\nThe predicted values of the X_test is: ',lr_pred)
print('\nThe shape of the predicted values for the X_test is: ',lr_pred.shape)

In [None]:
print('Accuracy on the training subset: {:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg.score(X_test, y_test)))

In [None]:
report=classification_report(y_test, lr_pred)
print(report)

<b>Logistic Regression with dimensional reduction to 50,100,500,1000, 10000 columns</b>

In [None]:
%%time
#from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
classifier = LogisticRegression()
scores = []
for c in [50, 100]: #, 500, 1000, 10000, 20000, 50000, 100000
    print("For TSVD({}):".format(c))
    # run many diff Truncated Singular Value Decomposition (TSVD)
    tsvd = TSVD(n_components=c, random_state=0)                    #dimensional reduction 
    X_train_tsvd = tsvd.fit_transform(X_train)
    X_test_tsvd = tsvd.transform(X_test)
    ### Logistic Regression with dimensional reduction     
    log_reg = LogisticRegression() # penalty='l2', C=1, solver='lbfgs', max_iter=100
    log_reg.fit(X_train_tsvd, y_train)
    lr_pred = log_reg.predict(X_test_tsvd)   #predicted values for x_test
    # accuracy_score(y_test, lr_pred)
    print('Accuracy on the training subset: {:.3f}'.format(log_reg.score(X_train_tsvd, y_train)))
    print('Accuracy on the test subset: {:.3f}'.format(log_reg.score(X_test_tsvd, y_test)))
    scores.append(log_reg.score(X_test_tsvd, y_test)) # store all scores
    #print(confusion_matrix(y_test, lr_pred))
    report=classification_report(y_test, lr_pred)
    print(report)
    
print('\nAccuracy on the test subset for c=50,100,500,1000,10000,20000,50000,100000 is',scores)

In [None]:
#Storing the statistics
print('Logistic regression with dimensional reduction')
print('\nFor TSVD(50):')
print('Accuracy on the test subset: 0.67')
print('Average precision between true and predicted values for the test subject: 0.67')
print('Average recall between true and predicted values for the test subject: 0.66')
print('Average F1-score between true and predicted values for the test subject: 0.65')
print('\nFor TSVD(100):')
print('Accuracy on the test subset: 0.72')
print('Average precision between true and predicted values for the test subject: 0.72')
print('Average recall between true and predicted values for the test subject: 0.71')
print('Average F1-score between true and predicted values for the test subject: 0.71')
print('\nFor TSVD(500):')
print('Accuracy on the test subset: 0.79')
print('Average precision between true and predicted values for the test subject: 0.79')
print('Average recall between true and predicted values for the test subject: 0.78')
print('Average F1-score between true and predicted values for the test subject: 0.78')
print('\nFor TSVD(1000):')
print('Accuracy on the test subset: 0.80')
print('Average precision between true and predicted values for the test subject: 0.81')
print('Average recall between true and predicted values for the test subject: 0.79')
print('Average F1-score between true and predicted values for the test subject: 0.79')
print('\nFor TSVD(10000):')
print('Accuracy on the test subset: 0.83')
print('Average precision between true and predicted values for the test subject: 0.83')
print('Average recall between true and predicted values for the test subject: 0.82')
print('Average F1-score between true and predicted values for the test subject: 0.82')

Comparing the average for the metrics of Precision, Recall, F1-score and Accuracy , it can be concluded that the metrics are better when the dimension is increased. Moreover, comparing the first statistics of logistic regression with the statistics of dimensional reduction to 10.000 dimensions , it can be concluded that they are similar, thus there is not a significant loss of information. So, logistic regression with dimensional reduction to 10.000 is a great represantation of the model.

### Random Forest

In [None]:
%%time
### Random Forest with no dimensional reduction
rf = RandomForestClassifier() #defaults: n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)    #predicted values for x_test
print('Accuracy between the true and predicted values for test subject is: ',accuracy_score(y_test, rf_pred))

In [None]:
print('Accuracy on the training subset: {:.3f}'.format(rf.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(rf.score(X_test, y_test)))

In [None]:
report=classification_report(y_test, rf_pred)
print(report)

<b>Random Forest with dimensional reduction to 50,100,500,1000, 10000 columns</b>

In [None]:
%%time
# run with Truncated Singular Value Decomposition (TSVD)
classifier = RandomForestClassifier()
scores = []
for c in [50,100]: #, 500, 1000, 10000, 20000, 50000, 100000
    print("For TSVD({}):".format(c))
    # run many diff Truncated Singular Value Decomposition (TSVD)
    tsvd = TSVD(n_components=c, random_state=0)                    #dimensional reduction 
    X_train_tsvd = tsvd.fit_transform(X_train)
    X_test_tsvd = tsvd.transform(X_test)
    ### Random Forest with dimensional reduction     
    rf = RandomForestClassifier()  # defaults: n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1
    rf.fit(X_train_tsvd, y_train)
    rf_pred = rf.predict(X_test_tsvd) #predicted values for x_test
    # accuracy_score(y_test, lr_pred)
    print('Accuracy on the training subset: {:.3f}'.format(rf.score(X_train_tsvd, y_train)))
    print('Accuracy on the test subset: {:.3f}'.format(rf.score(X_test_tsvd, y_test)))
    scores.append(rf.score(X_test_tsvd, y_test)) # store all scores
    report=classification_report(y_test, rf_pred)
    print(report)   
print('\nAccuracy on the test subset for c=50,100,500,1000,10000,20000,50000,100000 is',scores)

In [None]:
#Storing the statistics
print('Random Forest with dimensional reduction')
print('\nFor TSVD(50):')
print('Accuracy on the test subset: 0.65')
print('Average precision between true and predicted values for the test subject: 0.65')
print('Average recall between true and predicted values for the test subject: 0.64')
print('Average F1-score between true and predicted values for the test subject: 0.64')
print('\nFor TSVD(100):')
print('Accuracy on the test subset: 0.67')
print('Average precision between true and predicted values for the test subject: 0.68')
print('Average recall between true and predicted values for the test subject: 0.66')
print('Average F1-score between true and predicted values for the test subject: 0.66')
print('\nFor TSVD(500):')
print('Accuracy on the test subset: 0.67')
print('Average precision between true and predicted values for the test subject: 0.68')
print('Average recall between true and predicted values for the test subject: 0.66')
print('Average F1-score between true and predicted values for the test subject: 0.66')
print('\nFor TSVD(1000):')
print('Accuracy on the test subset: 0.67')
print('Average precision between true and predicted values for the test subject: 0.68')
print('Average recall between true and predicted values for the test subject: 0.66')
print('Average F1-score between true and predicted values for the test subject: 0.66')
print('\nFor TSVD(10000):')
print('Accuracy on the test subset: 0.58')
print('Average precision between true and predicted values for the test subject: 0.59')
print('Average recall between true and predicted values for the test subject: 0.57')
print('Average F1-score between true and predicted values for the test subject: 0.56')

The interesting point in this case is that when the dimensions are decreased to 100,500 and 1.000 , we have a good enough score (accuracy of 0.67,hence just 10% loss from the original dimensions) while taking into account this massive change-reduction in the number of features.

### Multi-layer Perceptron

In [None]:
%%time
# https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
# mlp = MLPClassifier(solver='adam', activation='relu',alpha=1e-4,hidden_layer_sizes=(100,), random_state=1,max_iter=200,verbose=10,learning_rate_init=.001)
mlp = MLPClassifier(random_state=1, hidden_layer_sizes=(50,))
mlp.fit(X_train, y_train)
mlp_pred = mlp.predict(X_test)
print(accuracy_score(y_test, mlp_pred))
print (mlp.n_layers_)
print (mlp.n_iter_)
print (mlp.loss_)

In [None]:
print('Accuracy on the training subset: {:.3f}'.format(mlp.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(mlp.score(X_test, y_test)))

In [None]:
report=classification_report(y_test, mlp_pred)
print(report)

In [None]:
%%time
# mlp = MLPClassifier(solver='adam', activation='relu',alpha=1e-4,hidden_layer_sizes=(100,), random_state=1,max_iter=200,verbose=10,learning_rate_init=.001)
classifier = MLPClassifier()
scores = []
for c in [50, 100]: #, 500, 1000, 10000,  20000, 50000, 100000
    print("For TSVD({}):".format(c))
    # run many diff Truncated Singular Value Decomposition (TSVD)
    tsvd = TSVD(n_components=c, random_state=0)   #dimensional reduction 
    X_train_tsvd = tsvd.fit_transform(X_train)
    X_test_tsvd = tsvd.transform(X_test)
    
    mlp = MLPClassifier(random_state=1, hidden_layer_sizes=(50,))
    mlp.fit(X_train_tsvd, y_train)
    mlp_pred = mlp.predict(X_test_tsvd)
    print(accuracy_score(y_test, mlp_pred))
    print('Accuracy on the training subset: {:.3f}'.format(mlp.score(X_train_tsvd, y_train)))
    print('Accuracy on the test subset: {:.3f}'.format(mlp.score(X_test_tsvd, y_test)))
    mlp.score(X_test_tsvd, y_test)
    scores.append(mlp.score(X_test_tsvd, y_test))
    report=classification_report(y_test,mlp_pred)
    print(report)
print('\nAccuracy on the test subset for c=50,100,500,1000,10000,20000,50000,100000 is',scores)

In [None]:
#Storing the statistics
print('Multi-layer Perceptron with dimensional reduction')
print('\nFor TSVD(50):')
print('Accuracy on the test subset: 0.71')
print('Average precision between true and predicted values for the test subject: 0.71')
print('Average recall between true and predicted values for the test subject: 0.70')
print('Average F1-score between true and predicted values for the test subject: 0.70')
print('\nFor TSVD(100):')
print('Accuracy on the test subset: 0.75')
print('Average precision between true and predicted values for the test subject: 0.74')
print('Average recall between true and predicted values for the test subject: 0.74')
print('Average F1-score between true and predicted values for the test subject: 0.74')
print('\nFor TSVD(500):')
print('Accuracy on the test subset: 0.78')
print('Average precision between true and predicted values for the test subject: 0.78')
print('Average recall between true and predicted values for the test subject: 0.78')
print('Average F1-score between true and predicted values for the test subject: 0.78')
print('\nFor TSVD(1000):')
print('Accuracy on the test subset: 0.8')
print('Average precision between true and predicted values for the test subject: 0.8')
print('Average recall between true and predicted values for the test subject: 0.8')
print('Average F1-score between true and predicted values for the test subject: 0.8')
print('\nFor TSVD(10000):')
print('Accuracy on the test subset: 0.84')
print('Average precision between true and predicted values for the test subject: 0.84')
print('Average recall between true and predicted values for the test subject: 0.83')
print('Average F1-score between true and predicted values for the test subject: 0.83')

Comparing the average for the metrics of Precision, Recall, F1-score and Accuracy , it can be concluded that the metrics are better when the dimension is increased. We see that from 10.000 features we have gotten close to the accuracy of the original dataset's accuracy. 

<b>These final metrics are better from the previous ones,so the best model so far is Multi-layer Perceptron with dimensional reduction.</b>

In [None]:
%%time
classifier = MLPClassifier()
scores = []
for c in [50, 100]: #, 500, 1000, 10000,  20000, 50000, 100000
    print("For TSVD({}):".format(c))
    # run many diff Truncated Singular Value Decomposition (TSVD)
    tsvd = TSVD(n_components=c, random_state=0)   #dimensional reduction 
    X_train_tsvd = tsvd.fit_transform(X_train)
    X_test_tsvd = tsvd.transform(X_test)
    
    # run multi-submodels for same TSVD(c)
    pipe = Pipeline([('classifier', classifier)])
    params = {
    'classifier__hidden_layer_sizes': [(100,), (100, 10)], # , (80, 15)  ,(100, 10) differentions of the model   
    # 'classifier__activation': ['relu', 'tanh', 'logistic'],
    'classifier__learning_rate_init': [0.001, 0.01] # 0.0001,                     
    # 'classifier__solver': ['sgd', 'adam']
    }
    grid = GridSearchCV(pipe, params, cv=2, verbose=1, n_jobs=8)
    print(grid.fit(X_train_tsvd, y_train))
    grid_preds = grid.predict(X_test_tsvd)
    # print(grid_preds)
    print()
    print("accuracy_score:", accuracy_score(y_test, grid_preds))
    # print(grid.best_estimator_), print("")
    print("best_params_:", grid.best_params_)
    # print(pipe.steps)
    means = grid.cv_results_['mean_test_score']
    for mean, params in zip(means, grid.cv_results_['params']):
        print('%0.3f for %r' % (mean, params))
        scores.append((mean, params)) # store all scores
    report=classification_report(y_test, grid_preds)
    print(report)    

In [None]:
#Storing the statistics
print('Multi-layer Perceptron with dimensional reduction,pipeline and grid search')
print('\nFor TSVD(50):')
print('Accuracy on the test subset: 0.71')
print('Average precision between true and predicted values for the test subject: 0.71')
print('Average recall between true and predicted values for the test subject: 0.70')
print('Average F1-score between true and predicted values for the test subject: 0.70')
print('\nFor TSVD(100):')
print('Accuracy on the test subset: 0.75')
print('Average precision between true and predicted values for the test subject: 0.74')
print('Average recall between true and predicted values for the test subject: 0.74')
print('Average F1-score between true and predicted values for the test subject: 0.74')
print('\nFor TSVD(500):')
print('Accuracy on the test subset: 0.79')
print('Average precision between true and predicted values for the test subject: 0.78')
print('Average recall between true and predicted values for the test subject: 0.78')
print('Average F1-score between true and predicted values for the test subject: 0.78')
print('\nFor TSVD(1000):')
print('Accuracy on the test subset: 0.80')
print('Average precision between true and predicted values for the test subject: 0.79')
print('Average recall between true and predicted values for the test subject: 0.79')
print('Average F1-score between true and predicted values for the test subject: 0.79')
print('\nFor TSVD(10000):')
print('Accuracy on the test subset: 0.73')
print('Average precision between true and predicted values for the test subject: 0.76')
print('Average recall between true and predicted values for the test subject: 0.72')
print('Average F1-score between true and predicted values for the test subject: 0.73')

Multi-layer Perceptron with dimensional reduction,pipeline and grid search does not have better results.It seams that 50 hidden layers are good enough parameter for this data set.

<b>Comparing all models, it is clear that the best model for dimensional reduction is Multi-layer Perceptron with 50 hidden layers.</b>

The above results with reduction to 10.000 were managed to be done by allocating more memory (changing the pagefile on windows). However, running with 10.000 and more dimensions will cause memory problems, thus run out of memory and will not be able to operate the above computations.