## MSDS 422 Multi-Class Models: PCA and Random Forests

### Assignment 5 (No PCA)
### John Moderwell

## Introduction:
In previous assignments, data from the Boston Housing Study was used to train, test and evaluate regression models as well as decision tree/random forest models. In this assignment, the MNIST dataset will be used for benchmark testing alternative modeling approaches. This will involve random forest classification and principal component analysis (PCA).

## Data Ingestion

In [65]:
# seed value for random number generators to obtain reproducible results
RANDOM_SEED = 1

# import base packages 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import time
import seaborn as sns

In [66]:
#import relevant Scikit Learn packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.decomposition import PCA

In [67]:
#set working directory
import os
os.chdir('C:\\Users\\R\\Desktop\\MSDS 422\\Assignment 5')


In [68]:
from six.moves import urllib
from sklearn.datasets import fetch_mldata

from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
  content = response.read()
  f.write(content)
  mnist_raw = loadmat(mnist_path)
  mnist = {
  "data": mnist_raw["data"].T,
  "target": mnist_raw["label"][0],
  "COL_NAMES": ["label", "data"],
  "DESCR": "mldata.org dataset: mnist-original",}

In [69]:
#examine dataset
print(mnist)

#create exaplanatory and response variables
X, y = mnist['data'], mnist['target']

{'data': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'target': array([0., 0., 0., ..., 9., 9., 9.]), 'COL_NAMES': ['label', 'data'], 'DESCR': 'mldata.org dataset: mnist-original'}


In [70]:
#take a closer look at structure of data
print('\n Structure of explanatory variable:', X.shape)
print('\n Structure of response:', y.shape)


 Structure of explanatory variable: (70000, 784)

 Structure of response: (70000,)


In [71]:
#Data is to be trained on 60,000 observations and tested on remaining 10,000
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [72]:
#observe frequency distribution for training set (first 60,000 observations)
mnist_y_0_59999_df = pd.DataFrame({'label': y[0:59999,]}) 
print('\nFrequency distribution for 60,000 observations (for model building)')
print(mnist_y_0_59999_df['label'].value_counts(ascending = True))


Frequency distribution for 60,000 observations (for model building)
5.0    5421
4.0    5842
8.0    5851
6.0    5918
0.0    5923
9.0    5948
2.0    5958
3.0    6131
7.0    6265
1.0    6742
Name: label, dtype: int64


In [73]:
#observe frequency distribution for test set (last 10,000 observations)
mnist_y_60000_69999_df = pd.DataFrame({'label': y[60000:69999,]}) 
print('\nFrequency distribution for last 10,000 observations (holdout sample)')
print(mnist_y_60000_69999_df['label'].value_counts(ascending = True))


Frequency distribution for last 10,000 observations (holdout sample)
5.0     892
6.0     958
8.0     974
0.0     980
4.0     982
9.0    1008
3.0    1010
7.0    1028
2.0    1032
1.0    1135
Name: label, dtype: int64


### Random Forest Classifier (all variables)
Assess classification performance using all 784 variables and evaluating with F1-score

In [74]:
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

In [76]:
from sklearn.ensemble import RandomForestClassifier

#Fit Random Forest Classifier to full dataset and then evaluate on test set
#See how long it takes to evaluate entire dataset using RF model
t0=time.time()
clf = RandomForestClassifier(n_estimators=784, random_state = 9999, bootstrap = True)
model = clf.fit(X_train, y_train)

# Calculate predictions
y_predict = model.predict(X_test)

# Calculate F1 score
# Ideal value is 1
f1 = f1_score(y_test, y_predict, average='weighted')
t1=time.time()

print("Random Forest classification on full dataset took {:.2f}s".format(t1 - t0))
print('F1 Score:',f1)

Random Forest classification on full dataset took 315.51s
F1 Score: 0.9722867467081924
