# Dimensionality Reduction


## The Curse of Dimensionality and Reasons for Dimensionality Reduction



### To practice PCA, we first retrieve the familiar 20 newsgroups documents as shown below.

## Dimensionality Reduction Using PCA (Principal Component Analysis)



In [1]:
from sklearn.datasets import fetch_20newsgroups

# Create a list of topics you want to select from the 20 topics
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

# Load the training dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)

In [2]:
# Check the size of the data
print(f"Number of documents: {len(newsgroups_train.data)}")

Number of documents: 2034


In [3]:
# Print the 5th document
print(newsgroups_train.data[5])

From: Nanci Ann Miller <nm0w+@andrew.cmu.edu>
Subject: Re: Genocide is Caused by Atheism
Organization: Sponsored account, School of Computer Science, Carnegie Mellon, Pittsburgh, PA
Lines: 27
NNTP-Posting-Host: andrew.cmu.edu
In-Reply-To: <1993Apr5.020504.19326@ultb.isc.rit.edu>

snm6394@ultb.isc.rit.edu (S.N. Mozumder ) writes:
> More horrible deaths resulted from atheism than anything else.

There are definitely quite a few horrible deaths as the result of both
atheists AND theists.  I'm sure Bobby can list quite a few for the atheist
side but fails to recognize that the theists are equally proficient at
genocide.  Perhaps, since I'm a bit weak on history, somone here would like
to give a list of wars caused/led by theists?  I can think of a few (Hitler
claimed to be a Christian for example) but a more complete list would
probably be more effective in showing Bobby just how absurd his statement
is.

> Peace,

On a side note, I notice you always sign your posts "Peace".  Perhaps you
s

In [4]:
# Get the category index of the 5th document
category_index = newsgroups_train.target[5]

In [5]:
# Get the category name using the index
category_name = newsgroups_train.target_names[category_index]

In [6]:
# Print the category name
print(f"Category: {category_name}")

Category: alt.atheism


### Remove parts of the email that may provide hints for classification, leaving only the content for pure classification


In [7]:

# Load the training dataset
newsgroups_train = fetch_20newsgroups(subset='train',
# Remove parts that provide hints (headers, footers, quotes) to classify purely based on the content
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
# Load the test dataset
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

After performing preprocessing steps such as tokenization, stopword removal, and stemming as done before, the data is prepared for document classification by converting it into a feature vector based on Bag of Words (BOW).

In [8]:
import nltk
nltk.download('stopwords')

X_train = newsgroups_train.data   # Training dataset documents
y_train = newsgroups_train.target # Training dataset labels

X_test = newsgroups_test.data     # Test dataset documents
y_test = newsgroups_test.target   # Test dataset labels

from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")

from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

RegTok = RegexpTokenizer("[\w']{3,}") # Define tokenizer using regular expressions
english_stops = set(stopwords.words('english')) # Load English stopwords

def tokenizer(text):
    tokens = RegTok.tokenize(text.lower()) # Check if this works as expected
    # Exclude stopwords
    words = [word for word in tokens if (word not in english_stops) and len(word) > 2]
    # Apply Porter Stemmer
    features = (list(map(lambda token: PorterStemmer().stem(token),words)))
    return features

tfidf = TfidfVectorizer(tokenizer=tokenizer)
X_train_tfidf = tfidf.fit_transform(X_train) # Transform the training set
X_test_tfidf = tfidf.transform(X_test) # Transform the test set

[nltk_data] Downloading package stopwords to /home/minjoo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


To compare the classification performance after dimensionality reduction, the classification performance before dimensionality reduction is measured in advance using Scikit-learn's Logistic Regression library as shown below.

In [9]:
from sklearn.linear_model import LogisticRegression 

LR_clf = LogisticRegression()  # Declare the classifier
LR_clf.fit(X_train_tfidf, y_train)  # Train the classifier using the train data
print('#Train set score: {:.3f}'.format(LR_clf.score(X_train_tfidf, y_train)))
print('#Test set score: {:.3f}'.format(LR_clf.score(X_test_tfidf, y_test)))

#Train set score: 0.962
#Test set score: 0.761


1: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

- Scikit-learn provides the PCA library to support Principal Component Analysis.
- Parameters:
    -  n_components: Specifies the size of the dimensions to reduce.
    - svd_solver: The default is auto, which automatically selects the solver considering the original and target dimensions. If you don’t want to deal with this, you can leave it as the default.
    - explained_variance: The variance explained by each new axis.
    - explained_variance_ratio: Represents the ratio of the explained variance to the total variance before reduction. -> If the new axes explain all the original variance, the sum of explained_variance_ratio will be 1.

- In the following example, the dimensionality is reduced from 20,085 to 2,000. ('2,034' represents the number of documents. In other words, this TF-IDF matrix contains information about 2,034 documents, and 20,085 represents the number of unique words (features, tokens)).
    - Therefore, this matrix can be viewed as a two-dimensional array structure where each document (2,034 documents) has weights (TF-IDF values) related to 20,085 unique words.
    - These values represent how important a particular word (its weight) is in each document.

In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2000, random_state=7)
# Scikit-learn's PCA does not directly support operations on sparse vector formats.
# In other words, you cannot directly pass the matrix transformed by CountVectorizer or TfidfVectorizer as an argument.
# Therefore, as shown below, first convert the format using the `toarray()` method, and then pass it as an argument to the `transform()` or `fit_transform()` methods.
X_train_pca = pca.fit_transform(X_train_tfidf.toarray()) 
X_test_pca = pca.transform(X_test_tfidf.toarray())

print('Original tfidf matrix shape:', X_train_tfidf.shape)  # The original dimensionality can be calculated
print('PCA Converted matrix shape:', X_train_pca.shape)
# After reduction, the sum of 'explained_variance_ratio_' is printed to see how much of the original variance is explained.
print('Sum of explained variance ratio: {:.3f}'.format(pca.explained_variance_ratio_.sum()))

Original tfidf matrix shape: (2034, 20085)
PCA Converted matrix shape: (2034, 2000)
Sum of explained variance ratio: 1.000


- As we can see from the results, the original dimensionality is 20,085, which represents the number of words in the feature vector, and the reduced dimensionality is 2,000, as intended.
- Since the original number of dimensions is large, the computation is quite intensive, and as a result, it takes a considerable amount of time.
- Although it took some time, the dimensionality was reduced to almost 1/10, and the explained variance is still almost 100%, meaning there is minimal information loss.
- Let's examine the performance change by classifying the documents using the reduced data.

In [11]:
LR_clf.fit(X_train_pca, y_train)
print('#Train set score: {:.3f}'.format(LR_clf.score(X_train_pca, y_train)))
print('#Test set score: {:.3f}'.format(LR_clf.score(X_test_pca, y_test)))

#Train set score: 0.962
#Test set score: 0.760


- From the results above, we can confirm that the performance is the same as before dimensionality reduction. 
- PCA maintains the maximum amount of information through linear combinations, so it produces different performance compared to feature selection.

So, how does it compare with feature selection using Lasso regression? To explore this, let's first perform Lasso regression as shown below.

In [12]:
lasso_clf = LogisticRegression(penalty='l1', solver='liblinear', C=1)
lasso_clf.fit(X_train_tfidf, y_train)

# {:.3f} is used in Python's string formatting to display numbers up to 3 decimal places
print('#Train set score: {:.3f}'.format(lasso_clf.score(X_train_tfidf, y_train)))
print('#Test set score: {:.3f}'.format(lasso_clf.score(X_test_tfidf, y_test)))

import numpy as np
# Print the number of non-zero coefficients (features that were used)
print('#Used features count: {}'.format(np.sum(lasso_clf.coef_ != 0)), 'out of', X_train_tfidf.shape[1])

#Train set score: 0.790
#Test set score: 0.718
#Used features count: 321 out of 20085


In [13]:
value = 3.14159
print('{:.3f}'.format(value))

3.142


- From the results above, we can see that the final number of features used is 321, and the performance on the text set has significantly dropped to 0.718 compared to before dimensionality reduction.
- For comparison with PCA, let's set the target dimension to 321, the same as in Lasso regression, and train the model with the transformed data as shown below.

In [14]:
pca = PCA(n_components=321, random_state=7)

X_train_pca = pca.fit_transform(X_train_tfidf.toarray())
X_test_pca = pca.transform(X_test_tfidf.toarray())
print('PCA Converted X shape:', X_train_pca.shape)
print('Sum of explained variance ratio: {:.3f}'.format(pca.explained_variance_ratio_.sum()))

LR_clf.fit(X_train_pca, y_train)
print('#Train set score: {:.3f}'.format(LR_clf.score(X_train_pca, y_train)))
print('#Test set score: {:.3f}'.format(LR_clf.score(X_test_pca, y_test)))

PCA Converted X shape: (2034, 321)
Sum of explained variance ratio: 0.437
#Train set score: 0.874
#Test set score: 0.752


- Despite the explained variance ratio dropping significantly to 43.7%, the accuracy on the text set only decreased by 1%, from 76.1% to 75.1%.
- Therefore, although the reduced dimensionality is the same as in Lasso regression, the classifier's performance is far superior.
- Now, let's be a little more ambitious and reduce the dimensionality to 100 as shown below.

In [15]:
pca = PCA(n_components=100, random_state=7)

X_train_pca = pca.fit_transform(X_train_tfidf.toarray())
X_test_pca = pca.transform(X_test_tfidf.toarray())
print('PCA Converted X shape:', X_train_pca.shape)
print('Sum of explained variance ratio: {:.3f}'.format(pca.explained_variance_ratio_.sum()))

LR_clf.fit(X_train_pca, y_train)
print('#Train set score: {:.3f}'.format(LR_clf.score(X_train_pca, y_train)))
print('#Test set score: {:.3f}'.format(LR_clf.score(X_test_pca, y_test)))

PCA Converted X shape: (2034, 100)
Sum of explained variance ratio: 0.211
#Train set score: 0.808
#Test set score: 0.738


- The accuracy on the test set is 73.8%, which is still better than the performance of the Lasso regression with 321 features.

## Dimensionality Reduction and Meaning Extraction Using LSA


### Dimensionality Reduction and Performance Using LSA


Why do we use .toarray() when performing PCA, but not when applying LSA
- Scikit-learn's PCA cannot directly handle sparse vector formats, meaning matrices generated by CountVectorizer or TfidfVectorizer cannot be used as is. Therefore, to convert a sparse matrix into a standard numpy array, you need to first use the toarray() method. 
- On the other hand, TruncatedSVD can process sparse vector formats directly.

In [16]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2000, random_state=7)  # Specify the number of components to compress
X_train_lsa = svd.fit_transform(X_train_tfidf)
X_test_lsa = svd.transform(X_test_tfidf)

print('LSA Converted X shape:', X_train_lsa.shape)
# Print the sum of the explained variance ratio to see how much variance is explained by the selected components
print('Sum of explained variance ratio: {:.3f}'.format(svd.explained_variance_ratio_.sum()))

LR_clf.fit(X_train_lsa, y_train)  # Train the classifier on the LSA-transformed data
print('#Train set score: {:.3f}'.format(LR_clf.score(X_train_lsa, y_train)))
print('#Test set score: {:.3f}'.format(LR_clf.score(X_test_lsa, y_test)))

LSA Converted X shape: (2034, 2000)
Sum of explained variance ratio: 1.000
#Train set score: 0.962
#Test set score: 0.761


In [17]:
svd = TruncatedSVD(n_components=100, random_state=1)  # Specify the number of components to compress
X_train_lsa = svd.fit_transform(X_train_tfidf)
X_test_lsa = svd.transform(X_test_tfidf)

print('LSA Converted X shape:', X_train_lsa.shape)
# Print the sum of the explained variance ratio to see how much variance is explained by the selected components
print('Sum of explained variance ratio: {:.3f}'.format(svd.explained_variance_ratio_.sum()))

LR_clf.fit(X_train_lsa, y_train)  # Train the classifier on the LSA-transformed data
print('#Train set score: {:.3f}'.format(LR_clf.score(X_train_lsa, y_train)))
print('#Test set score: {:.3f}'.format(LR_clf.score(X_test_lsa, y_test)))

LSA Converted X shape: (2034, 100)
Sum of explained variance ratio: 0.209
#Train set score: 0.811
#Test set score: 0.743
