## MSDS 422 Multi-Class Models: PCA and Random Forests
### Assignment 5 (with PCA)
### John Moderwell
### Introduction:
In previous assignments, data from the Boston Housing Study was used to train, test and evaluate regression models as well as decision tree/random forest models. In this assignment, the MNIST dataset will be used for benchmark testing alternative modeling approaches. This will involve random forest classification and principal component analysis (PCA).

### Data Ingestion

In [45]:
# seed value for random number generators to obtain reproducible results
RANDOM_SEED = 1

# import base packages 
import numpy as np

In [46]:
import pandas as pd

In [47]:
import time

In [48]:
#import relevant Scikit Learn packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.decomposition import PCA, FactorAnalysis

In [85]:
#set working directory
import os
os.chdir('C:\\Users\\R\\Desktop\\MSDS 422\\Assignment 5')

In [50]:
from six.moves import urllib
from sklearn.datasets import fetch_mldata

from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
  content = response.read()
  f.write(content)
  mnist_raw = loadmat(mnist_path)
  mnist = {
  "data": mnist_raw["data"].T,
  "target": mnist_raw["label"][0],
  "COL_NAMES": ["label", "data"],
  "DESCR": "mldata.org dataset: mnist-original",}

In [51]:
#examine dataset
print(mnist)

#create exaplanatory and response variables
X, y = mnist['data'], mnist['target']

{'data': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'target': array([0., 0., 0., ..., 9., 9., 9.]), 'COL_NAMES': ['label', 'data'], 'DESCR': 'mldata.org dataset: mnist-original'}


In [52]:
#take a closer look at structure of data
print('\n Structure of explanatory variable:', X.shape)
print('\n Structure of response:', y.shape)


 Structure of explanatory variable: (70000, 784)

 Structure of response: (70000,)


### Principal Component Analysis (PCA) 
Identify optimal number of components for model fitting

In [84]:
#PCA
print('')
print('----- Principal Component Analysis -----')
print('')
#See how many components should be included in PCA reduced dataset
#See how long it takes 
t0=time.time()
pca_data = X
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(pca_data)
t1=time.time()
print("PCA on full 70000 dataset took {:.2f}s".format(t1 - t0))
print('PCA number of components:',pca.n_components_)


----- Principal Component Analysis -----

PCA on full 70000 dataset took 6.66s
PCA number of components: 154


In [79]:
#examine explained variance
pca_explained_variance = pca.explained_variance_ratio_

explained_variance = pd.DataFrame(pca_explained_variance, columns=['Explained Variance'])

explained_variance.describe().round(7)

Unnamed: 0,Explained Variance
count,154.0
mean,0.006171
std,0.013253
min,0.000449
25%,0.000752
50%,0.001579
75%,0.004761
max,0.097461


### Random Forest Classifier with PCA (154 variables)
Assess classification performance using 154 variables and evaluating with F1-score

In [80]:
# Use components and find train and test sets
principalDf = pd.DataFrame(data = mnist_X_reduced)
# Only the explanatory variables were used for PCA, therefore, y tables remain the same
x_train = principalDf[0:59999]
x_test = principalDf[60000:69999]
y_train = y[0:59999,]
y_test = y[60000:69999,]

In [82]:
# Fit Random Forest model
#See how long it takes to evaluate reduced dataset using using RF model
t0=time.time()
rf = RandomForestClassifier(random_state = 9999, n_estimators=10, bootstrap=True
, max_features='sqrt')
model = rf.fit(x_train, y_train)
# Calculate predictions
y_predict = model.predict(x_test)
# Calculate F1 score
# Ideal value is 1
f1 = f1_score(y_test, y_predict, average='weighted')
t1=time.time()
print("Random Forest classification on reduced dataset (154 components): {:.2f}s".format(t1 - t0))
print('F1 Score:',f1)

Random Forest classification on reduced dataset (154 components): 8.53s
F1 Score: 0.8980761888146362


### Conclusion
Performing a principal component analysis on the data had a significant effect on the time it took to evaluate data using a random forest model. The total time it took to identify the principal components (154) and then build, fit and evaluate the model was approximately 15.13 seconds. The F1 score was .89807. On the other hand, building, fitting and evaluating the RF model on the entire dataset took 315.94 seconds (approximately 21x longer). The F1 score .97158. As we can see, there is a trade off between efficiency and accuracy. While a PCA reduced dataset takes less time to evaluate, it results in a less accurate score.