<a href="https://colab.research.google.com/github/mrutherfoord/portfolio/blob/master/PCA_and_Random_Forest_with_MNIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PCA and Random Forest with MNIST

For this project I will be using the MNIST dataset with Random Forests with and without Principle Component Analysis to see what happens to the accuracy when I reduce the dimensionality of the data. This data contains 70,000 instances of 28x28 pixels of hand-written numbers (0-9), and is divided into a training set of 42,000 instances and a test set of 28,000 instances. Each instance is comprised of 784 variables that each represent one pixel in the 28x28 image.

In [0]:
# importing relevant modules
import pandas as pd
import numpy as np

# model building modules
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

# metrics
from sklearn.metrics import accuracy_score

from time import time

import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

# Part 1

In part one, I created a random forest module with 10 estimators and submitted it to Kaggle.com for an accuracy score.

In [0]:
# importing data and separating the training set into and X and y dataframes
mnist = pd.read_csv('train.csv')
X = mnist.drop(['label'], axis=1)
y = mnist['label']

# importing test data as dataframe
test = pd.read_csv('test.csv')

In [0]:
# combining data for part 2
mnist_combined = X.append(test, ignore_index=True)
mnist_combined.shape
print('Shape of combined data: {}'.format(mnist_combined.shape))
print('-' * 50)

#converting dataframes to array
X_train = X.values
print('Shape of training data (explanatory variables): {}'.format(X_train.shape))
print('-' * 50)

y_train = y.values
y_train.reshape(-1,1)
print('Shape of training data (target): {}'.format(y_train.shape))
print('-' * 50)

X_test = test.values
print('Shape of testing data: {}'.format(X_test.shape))

Shape of combined data: (70000, 784)
--------------------------------------------------
Shape of training data (explanatory variables): (42000, 784)
--------------------------------------------------
Shape of training data (target): (42000,)
--------------------------------------------------
Shape of testing data: (28000, 784)


As stated before, I used 10 estimators for the Random Forest. I also set max_features to 'sqrt'.

In [0]:
# instantiate Classifier
rf_clf = RandomForestClassifier(n_estimators=10, max_features='sqrt', random_state=42)

In [0]:
%%time
#fitting random forest model with timing
rf_clf.fit(X_train, y_train)

Wall time: 5.79 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [0]:
# predictions for Random Forest
rf_clf.predict(X_test)

array([2, 0, 9, ..., 3, 9, 2], dtype=int64)

In [0]:
# Saving predictions as dataframe
rf_predictions = pd.DataFrame(rf_clf.predict(test), columns=['Label'])
rf_predictions.index += 1

In [0]:
# Exporting datafram to .csv for Kaggle submission 
rf_predictions.to_csv('predictions.csv', index_label='ImageID')

# First Kaggle Submission

Score: .93814
User ID: mrutherfoord

For only 10 estimators, the model was fairly accurate, although there is room for improvement. It also took 5.79 seconds to train the model.

# Parts 2 and 3

For part 2, I used PCA on the entire dataset, preserving 95% of the varaince, then fit a new random forest with the reduced dataset. I was led to believe that scaling was included in scikit-learn's PCA, but it seems to not be the case. I chose to use StandardScaler.

In [0]:
scaler = StandardScaler()

In [0]:
mnist_combined_scaled = scaler.fit_transform(mnist_combined)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [0]:
# Instantiate pca with 95% variance
pca = PCA(n_components=.95, random_state=42)

In [0]:
%%time
# fitting pca to entire dataset
pca.fit(mnist_combined_scaled)

Wall time: 15.3 s


PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)

In [0]:
print('Number of components left: {}'.format(pca.n_components_))

Number of components left: 332


In [0]:
# Mapping PCA to training and testing sets
pca_X_train = pca.transform(X_train_scaled)
pca_X_test = pca.transform(X_test_scaled)

In [0]:
# Checking shape of reduced data
print('Shape of reduced dataset: {}'.format(pca_X_train.shape))

Shape of reduced dataset: (42000, 332)


In [0]:
%%time
# Timing and fitting new random forest model with reduced data
rf_clf.fit(pca_X_train, y_train)

Wall time: 21.2 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

It is interesting to see that the timing took almost 4 times as long. Researching this on Stack Overflow, I found that this can happen with PCA and random forests for a few reasons. First, the data reduction is not as large as it looks since I am using square root for max_features. Sqrt(784) is 28, and the sqrt(334) is 18.27. Second, reducing the amount of features can make it more difficult for the algorithm to find the best splits, which may require more iterations, and thus increased time.

In [0]:
# New random forest predictions
rf_clf.predict(pca_X_test)

array([2, 0, 4, ..., 3, 9, 2], dtype=int64)

In [0]:
# Creating dataframe of predictions
pca_rf_predictions = pd.DataFrame(rf_clf.predict(pca_X_test), columns=['Label'])
pca_rf_predictions.index += 1

In [0]:
# Exporting .csv for Kaggle submission
pca_rf_predictions.to_csv('pca_predictions.csv', index_label='ImageID')

In [0]:
pca_rf_predictions.head(5)

Unnamed: 0,Label
1,2
2,0
3,4
4,2
5,3


# Second Kaggle Submission

Score: .87028

The accuracy was affected by the feature reduction. This can happen because features with lower variance can still be highly correlated with the target values. For example, if I was building a model for detecting the flu and had body mass and body temperature as features, body mass would have more much more variability, but body temperature would be more correlated with the target. (Flu example found on StackExchange). 

# Part 4

The problem with part 2 is that I ran the PCA on both the training and testing sets, when it should only be fitted to the training set. Failure to do so can lead to data leakage and an overly optimistic model. This is because the model has now 'seen' data outside the training set and now knows about data that it shouldn't. 

For this part, I instead only fit the PCA to the training data, then mapped it to the testing data. I still used the scaled data and also added 10 fold cross-validation.

In [0]:
%%time
pca_X_train2 = pca.fit_transform(X_train_scaled)

Wall time: 9.7 s


In [0]:
%%time
pca_X_test2 = pca.transform(X_test_scaled)

Wall time: 725 ms


In [0]:
%%time
rf_clf.fit(pca_X_train2, y_train)

Wall time: 18.4 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

This model took slightly less time than the previous one.

In [0]:
cv_score = cross_val_score(rf_clf, pca_X_train2, y_train, cv=10)
print('Cross Validated Score: {}'.format(cv_score))

Cross Validated Score: [0.86471707 0.86941009 0.86815802 0.86574625 0.87119048 0.87330317
 0.86496785 0.87228973 0.87416587 0.87416587]


It doesn't appear that only using only the training set for PCA changes the accuracy much in this case, but using PCA on the entire set can create an overly-optimistic model that doesn't generalize well.

In [0]:
rf_clf.predict(pca_X_test2)

array([2, 0, 2, ..., 3, 9, 2], dtype=int64)

In [0]:
pca_rf_predictions2 = pd.DataFrame(rf_clf.predict(pca_X_test2), columns=['Label'])
pca_rf_predictions2.index += 1

In [0]:
pca_rf_predictions2.to_csv('pca_predictions2.csv', index_label='ImageID')

In [0]:
pca_rf_predictions2.head(5)

Unnamed: 0,Label
1,2
2,0
3,2
4,4
5,3


# Third Kaggle Submission

Score .86314

This model is less accurate than in part two, which is to be expected since PCA was only used on the training data.

To see how accuracy improved with more trees, I used a random forest with 100 estimators.

In [0]:
# instantiate Classifier - 100 
rf_clf_100 = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)

In [0]:
%%time
rf_clf_100.fit(X_train, y_train)

Wall time: 1min 3s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [0]:
rf_clf_100.predict(X_test)

array([2, 0, 9, ..., 3, 9, 2], dtype=int64)

In [0]:
rf_predictions_100 = pd.DataFrame(rf_clf_100.predict(test), columns=['Label'])
rf_predictions_100.index += 1

In [0]:
rf_predictions_100.to_csv('predictions_100.csv', index_label='ImageID')

# Fourth Kaggle Score

Score: .96585

It's clear that more trees will continue to improve the accuracy of this model. Performing a gridsearch as well as potentially stacking with other algorithms could help improve this model. 

# Conclusion

It's possible that I would recommend PCA for certain algorithms, but I'm not sure I would recommend it be used in conjunction with random forests for this particular dataset. The performance time suffers as the random forest uses more iterations to find the best splits. Also, as mentioned earlier, removing variables with lower variability compresses the data, but pixels with low variability may be highly correlated with the target. Removing them can reduce prediction accuracy. PCA can be a good choice, but it depends on data, and I didn't find it to be helpful in this case. 