# PCA to speed up a classification problem

Shenyue Jia | [jiashenyue.info](https://jiashenyue.info/)

## Task

- Perform PCA to speed up a classification algorithm on a high-dimensional dataset
- Use [MNIST digits dataset](https://en.wikipedia.org/wiki/MNIST_database). This dataset has 28x28 pixel images of handwritten digits 0-9.
- Use PCA to lower the dimensions in this dataset while retaining 95% of the variance

## Load data

In [14]:
# load libraries
# data manipulation
import pandas as pd
from keras.datasets import mnist
# machine learning
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
# visualization
import matplotlib.pyplot as plt

In [15]:
# get training and testing data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

print(f'Dimension of X_train = {X_train.shape}')

Dimension of X_train = (60000, 28, 28)


In [16]:
# reshape data
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

print(f'After reshaping, X_train has {X_train.shape[0]} rows, {X_train.shape[1]} columns.')

After reshaping, X_train has 60000 rows, 784 columns.


## Prepare data

Create a processing pipeline to

- Scale the data
- Apply PCA

In [17]:
# create a scaler
scaler = StandardScaler()

In [18]:
# Create a PCA object that will retain 95% of the variance when transforming
pca = PCA(n_components = .95)

## KNN with PCA

Check how many dimensions can PCA reduce.

In [19]:
# Combine the scaler and the PCA in a pipeline.
scaler_pca = make_pipeline(scaler, pca)

# Transform the training data and check shape of new features after applying PCA
X_train_pca = scaler_pca.fit_transform(X_train)
print(f'The total number of columns have decreased from {X_train.shape[1]} to {X_train_pca.shape[1]}.')

The total number of columns have decreased from 784 to 331.


In [20]:
# Create and fit a KNN model WITH PCA.
knn_pca = make_pipeline(scaler, pca, KNeighborsClassifier())
knn_pca.fit(X_train, y_train)

## KNN without PCA

In [21]:
# Create and fit a KNN model WITH PCA.
knn_nopca = make_pipeline(scaler, KNeighborsClassifier())
knn_nopca.fit(X_train, y_train)

## Apply KNN with and without PCA to predict

In [22]:
%%time
preds_pca = knn_pca.predict(X_test)

CPU times: user 32.2 s, sys: 919 ms, total: 33.1 s
Wall time: 14.5 s


In [23]:
# print the classiffication report from the test data
print(classification_report(y_test, preds_pca))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       980
           1       0.96      0.99      0.98      1135
           2       0.96      0.94      0.95      1032
           3       0.94      0.96      0.95      1010
           4       0.95      0.94      0.95       982
           5       0.94      0.93      0.93       892
           6       0.96      0.97      0.97       958
           7       0.94      0.93      0.93      1028
           8       0.96      0.91      0.93       974
           9       0.93      0.92      0.92      1009

    accuracy                           0.95     10000
   macro avg       0.95      0.95      0.95     10000
weighted avg       0.95      0.95      0.95     10000



In [24]:
%%time
preds_nopca = knn_nopca.predict(X_test)

CPU times: user 1min 5s, sys: 1.68 s, total: 1min 7s
Wall time: 35.3 s


In [25]:
# print the classiffication report from the test data
print(classification_report(y_test, preds_nopca))

              precision    recall  f1-score   support

           0       0.95      0.98      0.97       980
           1       0.96      0.99      0.97      1135
           2       0.96      0.93      0.94      1032
           3       0.92      0.95      0.94      1010
           4       0.94      0.94      0.94       982
           5       0.93      0.92      0.93       892
           6       0.96      0.97      0.97       958
           7       0.94      0.92      0.93      1028
           8       0.96      0.90      0.93       974
           9       0.92      0.92      0.92      1009

    accuracy                           0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000



## Summary

- Adding PCA as a part of pre-processing step of KNN classification significantly reduced the time to execute the model from `35.3 s` to `14.5 s`
- Adding PCA also slightly increased the overall model performance in terms of accuracy (from `0.94` to `0.95`)
  - It also improved the performance of per class precision, recall, and F1-score