# Digit Recognition with PCA and Random Forests

The purpose of this notebook is to use PCA to develop a random forest model that correctly recognizes hand-drawn digits from 0-9. A successful algorithm could be used to recognize hand written addresses for sorting or correctly identifying forms based upon their name.

The major design flaw with the proposed experiment is that we have leakage between the training and test set since the instructions tell us to combine the two to conduct principal component analysis. Instead, principal component analysis should be conducted on just the training set to determine which vectors provide the most information.

In [20]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier


## Ingestion

In [4]:
train = pd.read_csv('https://raw.githubusercontent.com/jhancuch/pca-random-forest-digit-recognizer/main/data/train.csv')

In [5]:
test = pd.read_csv('https://raw.githubusercontent.com/jhancuch/pca-random-forest-digit-recognizer/main/data/test.csv')

In [6]:
train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
test.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
train.columns

Index(['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
       'pixel6', 'pixel7', 'pixel8',
       ...
       'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
       'pixel780', 'pixel781', 'pixel782', 'pixel783'],
      dtype='object', length=785)

In [10]:
test.columns

Index(['pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5', 'pixel6',
       'pixel7', 'pixel8', 'pixel9',
       ...
       'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
       'pixel780', 'pixel781', 'pixel782', 'pixel783'],
      dtype='object', length=784)

## EDA

In [12]:
train['label'].value_counts()

1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
Name: label, dtype: int64

## Model Development

We will develop three models. Random Forest, Random Forest with PCA analysis, and K-Means clustering. 

In [17]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

### Random Forest

In [27]:
rf = Pipeline([("scaler", StandardScaler()), ("model", RandomForestClassifier())])

In [None]:
%time
rf.fit(X_train, y_train)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.53 µs


In [None]:
rf_predictions = rf.predict(test)

In [22]:
rf_search_pipe = Pipeline([("scaler", StandardScaler()), ("model", RandomForestClassifier())])

In [25]:
rf_params = {'model__criterion': ['gini', 'entropy'],
             'model__class_weight': ['balanced', 'balanced_subsample', None],
             'model__n_estimators': [100, 500, 1000]}

In [26]:
rf_search = GridSearchCV(estimator=rf_search_pipe, param_grid=rf_params, verbose = 3, scoring = 'accuracy', n_jobs = 3)
rf_search.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


KeyboardInterrupt: 

In [None]:
print("Best Score: {0}): ".format(rf_search.best_score_))
print("Params of Best Score: {0}".format(rf_search.best_params_)

In [None]:
rf_pipe = Pipeline([("scaler", StandardScaler()), ("model", RandomForestClassifier())])


In [None]:
%time

rf.fit(X_train, y_train)
rf_predictions = gbt1.predict(test)

In [None]:


gbt = Pipeline([("scaler", StandardScaler()), ("model", GradientBoostingClassifier(loss = 'exponential', learning_rate = 0.1, max_depth = 4, max_features = 'sqrt', n_estimators = 200))])

gbt1.fit(X_train, y_train.ravel())
gbt1_predictions = gbt1.predict(X_validation)

In [None]:

             
gbt_search = GridSearchCV(estimator=gbt_pipe, param_grid=gbt_params, verbose = 1, scoring = 'roc_auc', n_jobs = 3)
gbt_search.fit(X_train, y_train)

print("Best parameter (CV score=%0.3f):" % gbt_search.best_score_)
print(gbt_search.best_params_)

gbt = Pipeline([("scaler", StandardScaler()), ("model", GradientBoostingClassifier(loss = 'exponential', learning_rate = 0.1, max_depth = 4, max_features = 'sqrt', n_estimators = 200))])

gbt1.fit(X_train, y_train.ravel())
gbt1_predictions = gbt1.predict(X_validation)