**PCA Exercise**

Student: Michael Menjares

- Your task is to perform PCA to speed up a classification algorithm on a high-dimensional dataset. 
- You will fit a model on the original scaled data, and a different one on data after transformation using a PCA model. 
- You will compare the computation time and the evaluation scores.

- We will use the MNIST digits dataset, which comes pre-installed in sklearn. 
- This dataset has 28x28 pixel images of handwritten digits 0-9. Your task is to classify these to determine which digits they are.

- Use PCA to lower the dimensions in this dataset while retaining 95% of the variance. You can do this when instantiating the PCA by giving the `n_components=` argument a float between 0 and 1.

- You can access the X features data using mnist.data.

- And, you can access the y target using mnist.target.

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.metrics import silhouette_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [64]:
# load the dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
# view the shape of the dataset
mnist.data.shape

(70000, 784)

In [65]:
mnist.data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 784 entries, pixel1 to pixel784
dtypes: float64(784)
memory usage: 418.7 MB


In [66]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, random_state=42)

In [67]:
scaler = StandardScaler()
pca = PCA(0.95)
knn = KNeighborsClassifier()

pp = make_pipeline(scaler,pca)

In [68]:
pcamodel = make_pipeline(pp, knn)
nopcamodel = make_pipeline(scaler,knn)

In [69]:
pcamodel.fit(X_train,y_train)

In [70]:
%%time
preds_pca = pcamodel.predict(X_test)

CPU times: total: 1min 4s
Wall time: 9.94 s


In [71]:
print('PCA Training accuracy:', pcamodel.score(X_train, y_train))
print('PCA Testing accuracy:', pcamodel.score(X_test, y_test))

PCA Training accuracy: 0.9650285714285715
PCA Testing accuracy: 0.9477714285714286


In [72]:
nopcamodel.fit(X_train,y_train)

In [73]:
%%time
preds_no_pca = nopcamodel.predict(X_test)

CPU times: total: 2min 26s
Wall time: 21 s


In [74]:
print('NO PCA Training accuracy:', nopcamodel.score(X_train, y_train))
print('NO PCA Testing accuracy:', nopcamodel.score(X_test, y_test))

NO PCA Training accuracy: 0.9625904761904762
NO PCA Testing accuracy: 0.9442285714285714


a. Which model performed the best on the test set?
    PCA model 

b. Which model was the fastest at making predictions?
    PCA model