# Digit Recognition with PCA and Random Forests

The purpose of this notebook is to use PCA to develop a random forest model that correctly recognizes hand-drawn digits from 0-9. A successful algorithm could be used to recognize hand written addresses for sorting or correctly identifying forms based upon their name.

The major design flaw with the proposed experiment is that we have leakage between the training and test set since the instructions tell us to combine the two to conduct principal component analysis. Instead, principal component analysis should be conducted on just the training set to determine which vectors provide the most information.

In [27]:
import pandas as pd
import numpy as np

import os

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score


## Ingestion

In [11]:
train = pd.read_csv('https://raw.githubusercontent.com/jhancuch/pca-random-forest-digit-recognizer/main/data/train.csv')

In [12]:
test = pd.read_csv('https://raw.githubusercontent.com/jhancuch/pca-random-forest-digit-recognizer/main/data/test.csv')

In [13]:
train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
test.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
train.columns

Index(['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
       'pixel6', 'pixel7', 'pixel8',
       ...
       'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
       'pixel780', 'pixel781', 'pixel782', 'pixel783'],
      dtype='object', length=785)

In [10]:
test.columns

Index(['pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5', 'pixel6',
       'pixel7', 'pixel8', 'pixel9',
       ...
       'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
       'pixel780', 'pixel781', 'pixel782', 'pixel783'],
      dtype='object', length=784)

In [37]:
len(train)

42000

In [36]:
len(test)

28000

## EDA

In [12]:
train['label'].value_counts()

1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
Name: label, dtype: int64

## Model Development

We will develop three models. Random Forest, Random Forest with PCA analysis, and K-Means clustering. 

In [14]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

In [15]:
os.chdir(os.getcwd() + '/submissions')

### Random Forest

In [8]:
rf = Pipeline([("scaler", StandardScaler()), ("model", RandomForestClassifier())])

In [11]:
%%time
rf.fit(X_train, y_train)
rf_predictions = rf.predict(test)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.34 µs


In [12]:
rf_predictions

array([2, 0, 9, ..., 3, 9, 2])

In [40]:
Id = []
for i in range(1, len(rf_predictions) + 1):
    Id.append(i)

rf_sub = pd.DataFrame(list(zip(Id, rf_predictions)), columns = ['ImageId', 'Label'])
rf_sub.to_csv('rf submission.csv', index = False)

### Random Forest with PCA

Must first combine the test and train datasets for PCA before splitting them apart again. Additionally, the values need to be scaled.

In [56]:
combined = X_train.append(test)
combined_scaled = StandardScaler().fit_transform(combined)

In [57]:
%%time
pca = PCA(n_components=0.95)
pca.fit(combined_scaled)
combined_reduced = pca.transform(combined_scaled)

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 23.8 µs


In [58]:
pca.n_components_

332

In [64]:
X_train_pca = combined_reduced[:42000, :]

In [66]:
test_pca = combined_reduced[42000:, :]

In [67]:
rf_pca = RandomForestClassifier()

In [68]:
%%time
rf_pca.fit(X_train_pca, y_train)
rf_pca_predictions = rf_pca.predict(test_pca)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.63 µs


In [69]:
rf_pca_predictions

array([2, 0, 9, ..., 3, 9, 2])

In [70]:
Id = []
for i in range(1, len(rf_pca_predictions) + 1):
    Id.append(i)

rf_sub = pd.DataFrame(list(zip(Id, rf_pca_predictions)), columns = ['ImageId', 'Label'])
rf_sub.to_csv('rf pca submission.csv', index = False)

### K-Means Clustering

In [22]:
scaler = StandardScaler()

In [23]:
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
test_scaled = scaler.transform(test)

In [24]:
potential_k = list(range(10, 41, 1))
n_k = []
score = []

In [29]:
for k in potential_k:
    kmeans = MiniBatchKMeans(n_clusters=k)
    preds = kmeans.fit_predict(X_train_scaled)
    score.append(silhouette_score(X_train_scaled, preds))
    n_k.append(k)
    

In [30]:
pd.DataFrame({'Number of K': n_k, 'Silhouette Score': score})  

Unnamed: 0,Number of K,Silhouette Score
0,10,0.005672
1,10,0.015827
2,11,-0.013604
3,12,-0.021152
4,13,-0.045937
5,14,-0.020939
6,15,-0.007578
7,16,-0.022533
8,17,-0.048663
9,18,-0.01139


In [34]:
%%time
kmeans = MiniBatchKMeans(n_clusters=10).fit(X_train_scaled)
kmean_predictions = kmeans.predict(test_scaled)

CPU times: user 1.65 s, sys: 78.8 ms, total: 1.73 s
Wall time: 342 ms


In [35]:
kmean_predictions

array([7, 4, 1, ..., 2, 3, 7], dtype=int32)

In [36]:
Id = []
for i in range(1, len(kmean_predictions) + 1):
    Id.append(i)

rf_sub = pd.DataFrame(list(zip(Id, kmean_predictions)), columns = ['ImageId', 'Label'])
rf_sub.to_csv('kmeans submission.csv', index = False)