<a href="https://colab.research.google.com/github/mvince33/Coding-Dojo/blob/main/week10/pca_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PCA Exercise
- Michael Vincent
- 8/21/22


## Imports

In [13]:
# Imports
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

## Load and prepare the data

In [14]:
# Load the data
mnist = fetch_openml('mnist_784')

In [15]:
# Get the features and target
X = mnist.data
y = mnist.target

In [16]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [17]:
# Scale the data

# Code has been updated. I was using 
#.fit_transform() on the test data.
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## KNN models

In [18]:
# Instantiate two k-nearest neighbors model. One 
# that uses PCA, and one that doesn't

# Code has been updated so that there isn't an unneccesary
# pipeline within a pipeline.
knn_pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())
knn_pca_pipe = make_pipeline(StandardScaler(), PCA(n_components = 0.95), KNeighborsClassifier())

In [19]:
# Train the model that doesn't use PCA
knn_pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kneighborsclassifier', KNeighborsClassifier())])

In [20]:
# Train the model that uses PCA
knn_pca_pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=0.95)),
                ('kneighborsclassifier', KNeighborsClassifier())])

In [21]:
%%time
# Make predictions using the model without PCA
preds_no_pca = knn_pipe.predict(X_test)

CPU times: user 1min 43s, sys: 1.72 s, total: 1min 44s
Wall time: 1min 3s


In [22]:
%%time
# Make predictions using the model with PCA
preds_pca = knn_pca_pipe.predict(X_test)

CPU times: user 54.1 s, sys: 1.12 s, total: 55.3 s
Wall time: 36.3 s


> We see that the model using PCA was able to make predictions in less than half the time it took the model that didn't use PCA to make predictions.

In [23]:
# Evaluate the models
print('Metrics for the model without PCA')
print(classification_report(preds_no_pca, y_test))
print()
print('Metrics for the model with PCA')
print(classification_report(preds_pca, y_test))

Metrics for the model without PCA
              precision    recall  f1-score   support

           0       0.98      0.96      0.97      1748
           1       0.99      0.95      0.97      2066
           2       0.93      0.95      0.94      1737
           3       0.94      0.93      0.94      1824
           4       0.93      0.94      0.94      1576
           5       0.93      0.94      0.94      1591
           6       0.97      0.96      0.97      1780
           7       0.93      0.94      0.93      1864
           8       0.89      0.97      0.93      1525
           9       0.92      0.90      0.91      1789

    accuracy                           0.94     17500
   macro avg       0.94      0.94      0.94     17500
weighted avg       0.94      0.94      0.94     17500


Metrics for the model with PCA
              precision    recall  f1-score   support

           0       0.98      0.96      0.97      1748
           1       0.99      0.96      0.97      2048
           2

# Question Responses

a. We see from the above classification reports that the model using PCA actually performed slightly better across for the metrics precision, recall, and accuracy.

b. The model using PCA was able to make predictions in less than half the time it took the model that didn't use PCA to make predictions.

