# PCA Practice

# Learning Objectives

- Demonstrate the ability to perform PCA to complete the assignment.

# Task

Your task is to perform PCA to speed up a classification algorithm on a high-dimensional dataset. You will fit a model on the original scaled data, and a different one on data after transformation using a PCA model. You will compare the computation time and the evaluation scores.

We will use the MNIST digits dataset, which comes pre-installed in sklearn. This dataset has 28x28 pixel images of handwritten digits 0-9. Your task is to classify these to determine which digits they are.

https://en.wikipedia.org/wiki/MNIST_database 

Use PCA to lower the dimensions in this dataset while retaining 95% of the variance. You can do this when instantiating the PCA by giving the `n_components=` argument a float between 0 and 1.

`pca = PCA(n_components = .95)`

# Import libraries

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load data

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

# check
mnist.data.shape

(70000, 784)

# Inspect data

In [7]:
# access X features using mnist.data
# access y target using mnist.target

# check for duplicates
mnist.data.duplicated().sum()

0

In [9]:
# check for missing values
print(mnist.data.isna().sum().sum())
print(mnist.target.isna().sum().sum())

0
0


In [10]:
# check class balance for target
mnist.target.value_counts()

1    7877
7    7293
3    7141
2    6990
9    6958
0    6903
6    6876
8    6825
4    6824
5    6313
Name: class, dtype: int64

# Preprocessing

In [16]:
# make scaler and pca pipeline for model 1
# always scale data for PCA, a distance-based algorithm

# instantiate scaler
scaler = StandardScaler()

# instantiate PCA for .95 variance explained
pca = PCA(n_components = .95)

# make pipeline with scaler and pca
model1_pipe = make_pipeline(scaler, pca)

In [17]:
# validate model with train/test split
X_train, X_test, y_train, y_test = train_test_split(mnist.data, 
                                                    mnist.target, 
                                                    random_state = 42)

# Modeling

## Model 1: KNN with PCA

In [23]:
# model 1: knn with pca
model1 = KNeighborsClassifier()

# add model to model1_pipe
model1_pipe = make_pipeline(model1_pipe, model1)

# fit on training data
model1_pipe.fit(X_train, y_train)

In [24]:
%%time

# make predictions and time it
model1_test_preds = model1_pipe.predict(X_test)

CPU times: total: 38 s
Wall time: 3.8 s


## Model 2: KNN without PCA

In [25]:
# model 2: knn without pca
model2 = KNeighborsClassifier()

# make model2_pipe
model2_pipe = make_pipeline(StandardScaler(), model2)

# fit on training data
model2_pipe.fit(X_train, y_train)

In [26]:
%%time

# make predictions and time it
model2_test_preds = model2_pipe.predict(X_test)

CPU times: total: 1min 28s
Wall time: 8.23 s


# Evaluation

In [29]:
# model 1
print('Model 1: KNN with PCA')
print('Test set metrics:')
print(classification_report(y_test, model1_test_preds))

Model 1: KNN with PCA
Test set metrics:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      1714
           1       0.96      0.99      0.97      1977
           2       0.95      0.94      0.94      1761
           3       0.94      0.94      0.94      1806
           4       0.94      0.94      0.94      1587
           5       0.95      0.93      0.94      1607
           6       0.96      0.98      0.97      1761
           7       0.94      0.93      0.94      1878
           8       0.97      0.90      0.93      1657
           9       0.91      0.93      0.92      1752

    accuracy                           0.95     17500
   macro avg       0.95      0.95      0.95     17500
weighted avg       0.95      0.95      0.95     17500



In [31]:
# model 2
print('Model 2: KNN without PCA')
print('Test set metrics:')
print(classification_report(y_test, model2_test_preds))

Model 2: KNN without PCA
Test set metrics:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      1714
           1       0.95      0.99      0.97      1977
           2       0.95      0.93      0.94      1761
           3       0.93      0.94      0.94      1806
           4       0.94      0.93      0.94      1587
           5       0.94      0.93      0.94      1607
           6       0.96      0.97      0.97      1761
           7       0.94      0.93      0.93      1878
           8       0.97      0.89      0.93      1657
           9       0.90      0.92      0.91      1752

    accuracy                           0.94     17500
   macro avg       0.94      0.94      0.94     17500
weighted avg       0.94      0.94      0.94     17500



# Questions

1. Which model performed the best on the test set?

The model with PCA performed just a bit better than the model without PCA on the test data. Its metrics were all the same or 1% better than the other model's metrics.

2. Which model was the fastest at making predictions?

The model with PCA was fastest at making predictions. The total CPU time was 38 seconds (wall time of 3.8 seconds) This was faster compared to the model without PCA with a total CPU time of 1 minute, 28 seconds (wall time 8.23 seconds). 