# Ramia_Assignment5

**Introduction:** For this assignment we will develop a classifier that may be used to predict which of the 10 digits is being written in the MNIST dataset. We will employ two separate programs for this study: (1) random forest classifier with the full set of features, (2) principal components analysis and random forest classifier using the principal components. We will compare test set performance across the two modeling approaches, as well as evaluate the time required to perform each approach.

The results of this study show that the random forest classifier outperformed the classifier trained on the decomposed data in both time and accuracy. This may be characteristic of the dataset, the model and the training algorithm rather than the approach itself. In future studies I recommend trying both appraoches to test how they perform in those particular circumstances.

Load the MNIST dataset and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).

In [2]:
import numpy as np
from six.moves import urllib
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1)
    mnist.target = mnist.target.astype(np.int64)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

In [5]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]

Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.

In [6]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(max_features='sqrt',\
                                 bootstrap=True,\
                                 n_estimators=10,\
                                 random_state=1)

In [32]:
import time

t0 = time.time()
rnd_clf.fit(X_train, y_train)
t1 = time.time()
td = t1 - t0

In [33]:
print("Training took {:.2f}s".format(td))

Training took 3.65s


In [30]:
from sklearn.metrics import f1_score
y_pred = rnd_clf.predict(X_test)
f1_score(y_test, y_pred, average=None)

array([0.96455317, 0.98989011, 0.94696608, 0.9261811 , 0.94613821,
       0.92997199, 0.96620908, 0.9546798 , 0.92909281, 0.92607393])

Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.

In [34]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
t2 = time.time()
X_train_reduced = pca.fit_transform(X_train)
t3 = time.time()
td2 = t3 - t2

In [35]:
print("Decomposition took {:.2f}s".format(td2))

Decomposition took 5.12s


Train a new Random Forest classifier on the reduced dataset and see how long it takes.

In [36]:
rnd_clf2 = RandomForestClassifier(max_features='sqrt',\
                                  bootstrap=True,\
                                  n_estimators=10,\
                                  random_state=1)
t4 = time.time()
rnd_clf2.fit(X_train_reduced, y_train)
t5 = time.time()
td3 = t5 - t4

In [37]:
print("Training took {:.2f}s".format(td3))
print("Total took {:.2f}s".format(td2 + td3))

Training took 6.98s
Total took 12.10s


Training actually takes nearly twice as long to run now. Total time including decomposition takes nearly four times as long as training a random forest classifier on the raw data alone.

Dimensionality reduction does not always lead to faster training time. It depends on the dataset, the model and the training algorithm.

Next, evaluate the new random forest classifier on the test set to compare to the previous classifier.

In [38]:
X_test_reduced = pca.transform(X_test)

y_pred = rnd_clf2.predict(X_test_reduced)
f1_score(y_test, y_pred, average=None)

array([0.92398815, 0.96858639, 0.88676541, 0.8647619 , 0.87227723,
       0.84306987, 0.9199157 , 0.9044335 , 0.84677856, 0.86036961])

It is common for performance to drop slightly when reducing dimensionality, because we do lose some useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance.

**Conclusion:** Although PCA did not help in this particular case, future studies should still attempt both appraoches in order to assess how they perform given a new dataset, model and training algorithm. Unsupervised learning is an iterative process that requires exploration of the results. One should not take the performance of this particular application of PCA as true for all cases. Instead, any quality data scientist would use their inquisitive spirit to explore the merits of PCA for each task at hand.