# Dimensionality Reduction on MNIST dataset

This notebook discusses the importance of dimensionality reduction as a preprocessing step. 

Before we go further we should know what is dimensionalty and why it’s important. In simplistic terms, it is just the number of columns in the dataset, but it has significant downstream effects on the eventual models. At the extremes, the concept of the “curse of dimensionality” discusses that in high-dimensional spaces some things just stop working properly. Even in relatively low dimensional problems, a dataset with more dimensions requires more parameters for the model to understand, and that means more rows to reliably learn those parameters. If the number of rows in the dataset is fixed, addition of extra dimensions without adding more information for the models to learn from can have a detrimental effect on the eventual model accuracy.

### Dataset used for this activity:
The MNIST dataset is composed of 28x28 pixel images of handwriten digits from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

References: 
https://www.kaggle.com/c/digit-recognizer/data
http://www.eggie5.com/69-dimensionality-reduction-using-pca

In [None]:
import pandas as pd
import numpy as np
import time

from sklearn.model_selection import train_test_split

train = pd.read_csv('data/mnist_train.csv')

# Separate labels from the data
y = train['label']
# Drop the label feature
X = train.drop("label",axis=1)

# Split the train data into X_train and y_train datasets in 80:20 ratio.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print "Train data shape : " + str(X_train.shape)
print "Test data shape : " + str(X_test.shape)
X_train.head()

Train data shape : (33600, 784)
Test data shape : (8400, 784)


Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
34941,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24433,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24432,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8832,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30291,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Applying KNN classifier on MNIST dataset without applying Dimensionality Reduction

Let’s take a look at how long it takes to train a KNN Classifier on the MNIST dataset:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)
start = time.time()
clf.fit(X_train, y_train)
y = clf.predict(X_test)

# Calculate error in prediction
errors = (y_test != y).sum()
total = X_test.shape[0]
error_rate_without_dr = (errors/float(total)) * 100
print "Error rate without dimensionality reduction: %d/%d * 100 = %f" % (errors,total,error_rate_without_dr)

end = time.time()
duration_without_dr = end-start
print "Time taken to train a KNN Classifier without DR: %d" %duration_without_dr

### Applying PCA transform

The output of a PCA process is a system of linear combinations of the data that we will use to transform our original dataset to the reduced dimensional dataset

In [52]:
from sklearn.decomposition import PCA
pca = PCA(n_components=150)
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=150, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [53]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

Let’s take a look at how long it takes to train a KNN Classifier after applying PCA on the MNIST dataset:

In [54]:
start = time.time()
clf.fit(X_train_pca, y_train)
y = clf.predict(X_test_pca)

errors = (y_test != y).sum()
total = X_test_pca.shape[0]
error_rate_with_pca = (errors/float(total)) * 100
print "Error rate with PCA: %d/%d * 100 = %f" % (errors, total, error_rate_with_pca)

end = time.time()
duration_with_pca = end-start
print "Time taken to train a KNN Classifier with PCA: %d" %duration_with_pca

Error rate with PCA: 248/8400 * 100 = 2.952381
Time taken to train a KNN Classifier with PCA: 77


### Applying SVD transform

Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, except that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Typically, SVD is used under the hood to find the principle components of a matrix.

In [55]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=150)
svd.fit(X_train)

X_train_svd = svd.transform(X_train)
X_test_svd = svd.transform(X_test)

start = time.time()
clf.fit(X_train_svd, y_train)
y = clf.predict(X_test_svd)

errors = (y_test != y).sum()
total = X_test_svd.shape[0]
error_rate_with_svd = (errors/float(total)) * 100
print "Error rate with SVD: %d/%d * 100 = %f" % (errors, total, error_rate_with_svd)

end = time.time()
duration_with_svd = end-start
print "Time taken to train a KNN Classifier with SVD: %d" %duration_with_svd

Error rate with SVD: 247/8400 * 100 = 2.940476
Time taken to train a KNN Classifier with SVD: 75


### Applying Random Projections

The sklearn.random_projection module implements a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes. 



In [56]:
from sklearn import random_projection

rp = random_projection.SparseRandomProjection(n_components=150, random_state=19)
rp.fit(X_train)

X_train_rp = rp.transform(X_train)
X_test_rp = rp.transform(X_test)

start = time.time()
clf.fit(X_train_rp, y_train)
y = clf.predict(X_test_rp)

errors = (y_test != y).sum()
total = X_test_rp.shape[0]
error_rate_with_rp = (errors/float(total)) * 100
print "Error rate with Random Projection: %d/%d * 100 = %f" % (errors, total, error_rate_with_rp)

end = time.time()
duration_with_rp = end-start
print "Time taken to train a KNN Classifier with Random Projection: %d" %duration_with_rp

Error rate with Random Projection: 373/8400 * 100 = 4.440476
Time taken to train a KNN Classifier with Random Projection: 126


### Results

So as you can see, PCA has some pretty compelling results when applied to machine learning tasks. Reduced dimensionality leads to faster training!
