# Assignment 5: Principal Component Analysis - Oscar Hernandez

## Purpose 

This exercise involves benchmark testing of alternative modeling approaches. In particular, we will be focusing on developing a multiclass classifier that can correctly label a number image (0-9). The MNIST data set will be used to train each model which consists of 70000 observations and 784 features. At the end, we will make a recommendation of whether using PCA is helpful as a preliminary step of building a multiclass classifier.  

#### This report will cover the following: 
* Build a random forest classifier using the MNIST train and test data provided 
* Build a random forest classifier after using PCA on the entire MNIST data set 
* Build a random forest classifier after PCA using a different train/test method
* Calculate the time it took to execute certain parts of the model building process and evaluate using F1 scores 
* Provide management recommendation 

### Section 1 - Random Forest Classifier

#### Section 1 will involve training a random forest classifier using the 60000 observations in the train data set. We wil also evaluate how long it took to train the model and validate it on the test data set using a couple different methods. 

In [75]:
#Load all the necessary modules 
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.metrics import f1_score
import time
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA
import numpy as np
from sklearn.utils import shuffle

In [13]:
#Import the MNIST train data set which is 60,000 instances
#Display the shape of the newly created object
train = pd.read_csv("mnist_train.csv")
train.shape

(60000, 785)

In [14]:
#Import the MNIST test data set which is 10,000 instances
#Display the shape of the newly created object
test = pd.read_csv("mnist_test.csv")
test.shape

(10000, 785)

In [18]:
#Next, split the labels from train and test data set
X_train = train.drop("label", axis =1)
y_train = train["label"].copy()
X_test = test.drop("label", axis =1)
y_test = test["label"].copy()

In [28]:
#Utilize RandomForestClassifier to fit model using train data set 
forest_clf = RandomForestClassifier(n_estimators= 100, bootstrap=True,
                                    random_state = 1)

In [29]:
#Fit a RF model using train data 
#Also, calculate how long it took to train the model 
rf_start_time =time.time()
forest_clf.fit(X_train, y_train)
rf_elapsed_time = time.time() - rf_start_time
print (rf_elapsed_time)

60.29916071891785


In [31]:
#Since the model is created, now we can evalute it on the 
#test data set using a couple different ways 
#First, we will use the model to predict new instances 
y_pred_forest = forest_clf.predict(X_test)

In [43]:
#Now using the f1_score function, we will compute the F1 score
#for each class and output it as an array 
f1_score(y_test, y_pred_forest, average= None)

array([ 0.98132256,  0.99118166,  0.96621622,  0.96153846,  0.97305541,
        0.9689441 ,  0.98123045,  0.96477495,  0.95777549,  0.95072175])

In [37]:
#Alternatively, we can output the F1 scores using another function
#that also outputs the precision and recall for each class 
print (classification_report(y_test, y_pred_forest))

             precision    recall  f1-score   support

          0       0.97      0.99      0.98       980
          1       0.99      0.99      0.99      1135
          2       0.96      0.97      0.97      1032
          3       0.96      0.97      0.96      1010
          4       0.97      0.97      0.97       982
          5       0.98      0.96      0.97       892
          6       0.98      0.98      0.98       958
          7       0.97      0.96      0.96      1028
          8       0.96      0.95      0.96       974
          9       0.95      0.95      0.95      1009

avg / total       0.97      0.97      0.97     10000



In [46]:
#Lastly, we will use the cross_val_score function as another way
#to validate the model; this uses K-Fold cross-validation 
cross_val_score(forest_clf, X_test, y_test, cv =5, scoring="accuracy")

array([ 0.92361458,  0.92661008,  0.9475    ,  0.96596597,  0.96643287])

##### Key Takeaways
* This was a straightforward process that involved using the train and test data that was already split for us
* It took 60 seconds to train the random forest algorithm using the train data 
* The average F1 score for all the classes was 97% which is very high 

### Section 2 - PCA 

#### This section will focus on how PCA is used to lower the dimensions of the entire data set 

In [49]:
#Combine the train and test data set into one dataframe using concat
df = pd.concat([train, test])

In [52]:
#Initialize the PCA object and ensure that the principal components
#generated explain 95% of the variability 
#Also, we will time how long it took to execute the code 
pca_start_time = time.time()
pca = PCA(n_components=0.95, random_state =7)
df_reduced = pca.fit_transform(df)
pca_elapsed_time = time.time() - pca_start_time
print (pca_elapsed_time)

39.20349931716919


In [54]:
#Print out how the reduced dataframe to see if PCA code worked
df_reduced.shape

(70000, 154)

##### Key Takeaways
* We combined both the train and test data sets to create one data frame
* It took 39 seconds to the complete the PCA process using the new data frame 
* 95% of the variation can be explained by 154 components

### Section 3 - Random Forest Classifier w/ PCA 

#### This section is identical to Section 1 except that PCA will be incorporated

In [96]:
#Split the reduced data frame from Section 2 into a train and test set 
train_reduced, test_reduced = df[:60000], df[60000:] 

In [59]:
#Next, split the labels from reduced train and test data set
X_train_reduced = train_reduced.drop("label", axis =1)
y_train_reduced = train_reduced["label"].copy()
X_test_reduced = test_reduced.drop("label", axis =1)
y_test_reduced = test_reduced["label"].copy()

In [60]:
#Utilize RandomForestClassifier to fit model using reduced train data set 
pca_forest_clf = RandomForestClassifier(n_estimators= 100, bootstrap=True, 
                                        random_state = 2)

In [61]:
#Fit a RF model using reduced training data 
#Also, calculate how long it took to train the model 
pca_rf_start_time =time.time()
pca_forest_clf.fit(X_train_reduced, y_train_reduced)
pca_rf_elapsed_time = time.time() - pca_rf_start_time
print (pca_rf_elapsed_time)

55.023298025131226


In [62]:
#Since the model is created, now we can evalute it on the 
#reduced test data set  
#First, we will use the model to predict new instances 
y_pred_pca_forest = pca_forest_clf.predict(X_test_reduced)

In [63]:
#We will now output the F1 scores using classification_report function
#that also outputs the precision and recall for each class 
print (classification_report(y_test_reduced, y_pred_pca_forest))

             precision    recall  f1-score   support

          0       0.97      0.99      0.98       980
          1       0.99      0.99      0.99      1135
          2       0.96      0.97      0.96      1032
          3       0.96      0.97      0.96      1010
          4       0.97      0.97      0.97       982
          5       0.97      0.96      0.97       892
          6       0.98      0.98      0.98       958
          7       0.97      0.96      0.97      1028
          8       0.96      0.96      0.96       974
          9       0.96      0.95      0.95      1009

avg / total       0.97      0.97      0.97     10000



### Section 4 - Evaluation 

#### Interestingly enough, the test performance (as measured by F1 scores) was identical for the original 784-variable model versus the 95-percent-PCA model. There were small differences between classes but the average at the end was the same. The original 784-variable model took 60 seconds to make while the 95-percent-PCA model took 95 seconds to make, which included the PCA process. The actual training of the PCA model was 55 seconds which was less than the original model. 

### Section 5 - Fix the Issue 

#### We believe that data leakage is the flaw in this experiment. First, we are provided with the train and test data separately. However, during the process there wasn't any instructions on shuffling the train or test set which may negatively affect any cross-validation done on folds (some folds may be missing some digits). Another issue contributing to data leakage is that for Step 2, PCA was completed on the entire set which should only have been done on the train set excluding the labels. As such, the labels were used when completing PCA which shouldn't have been done. The test set excluding the labels should have been transformed separately.   

In [77]:
#We are going to take the combined data set and reshuffle it 
df2 = shuffle(df)

In [78]:
#This will split into train and test sets 
train_2, test_2 = df2[:60000], df2[60000:] 

In [79]:
#Next, split the labels from new train and test data set
X_train_2 = train_2.drop("label", axis =1)
y_train_2 = train_2["label"].copy()
X_test_2 = test_2.drop("label", axis =1)
y_test_2 = test_2["label"].copy()

In [80]:
#Initialize the PCA object and ensure that the principal components
#generated explain 95% of the variability 
#Also, we will time how long it took to execute the code 
new_start_time = time.time()
pca_2 = PCA(n_components=0.95, random_state = 10).fit(X_train_2)
X_train_reduced2 = pca_2.transform(X_train_2)
new_elapsed_time = time.time() - new_start_time
print (new_elapsed_time)

47.312687158584595


In [83]:
#Also, need to apply PCA to test set for validation later  
#We will time this as well 
new2_start_time = time.time()
X_test_reduced2 = pca_2.transform(X_test_2)
new2_elapsed_time = time.time() - new2_start_time
print(new2_elapsed_time)

0.14291596412658691


In [84]:
#Print out how the new reduced train set to see if PCA code worked
X_train_reduced2.shape 

(60000, 154)

In [85]:
#Utilize RandomForestClassifier to fit model using new reduced train data set 
pca2_forest_clf = RandomForestClassifier(n_estimators= 100, bootstrap=True, 
                                         random_state = 30)

In [86]:
#Fit a RF model using new reduced train data 
#Also, calculate how long it took to train the model 
pca2_rf_start_time =time.time()
pca2_forest_clf.fit(X_train_reduced2, y_train_2)
pca2_rf_elapsed_time = time.time() - pca2_rf_start_time
print (pca2_rf_elapsed_time)

115.92287611961365


In [87]:
#Since the model is created, now we can evalute it on the 
#new reduced test data set  
#First, we will use the model to predict new instances 
y_pred_pca2_forest = pca2_forest_clf.predict(X_test_reduced2)

In [89]:
#We will now output the F1 scores using classification_report function
#that also outputs the precision and recall for each class 
print (classification_report(y_test_2, y_pred_pca2_forest))

             precision    recall  f1-score   support

          0       0.96      0.98      0.97       948
          1       0.97      0.98      0.98      1087
          2       0.94      0.94      0.94      1012
          3       0.92      0.92      0.92      1033
          4       0.95      0.95      0.95       975
          5       0.94      0.93      0.94       906
          6       0.96      0.98      0.97       984
          7       0.95      0.95      0.95      1064
          8       0.91      0.92      0.92       982
          9       0.94      0.91      0.93      1009

avg / total       0.95      0.95      0.95     10000



##### Key Takeaways
* The average F1 score for all the classes in this model was 95%, which is less than the model developed without PCA (97%)
* The total development time for this section was 163 seconds 
* The time to train this RF model was 115 seconds which was almost 2X the time it took to train the model without PCA 

### Management Problem

When buildling ML models, predictive accuracy may come at the cost of development time. During this exercise, we timed various parts of the development process. Therefore, when considering implementing models for computer vision, I would recommend using PCA as a preliminary to machine learning classification. Although the development time was longer when utilizing PCA during this exercise, the accuracy (as measured by the F1 score) was very close (95% to 97%) to the non-PCA model. In particular when it comes to random forest algorithms, they don't tend to perform well on very high dimensional data therefore using PCA would be helpful. I would imagine that for larger data sets than the MNIST data set, using PCA would lead to better results versus not using it.   