**Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next evaluate the classifier on the test set: how does it compare to the previous classifier?**

### Load the MNIST dataset and split it into a training set and a test set

In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version= 1, as_frame= False)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=10000)

### Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set

In [11]:
from sklearn.ensemble import RandomForestClassifier
from datetime import datetime

rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1)

then = datetime.now()

rfc.fit(X_train, y_train)

now = datetime.now()
runtime = (now-then).seconds
print("Total runtime without PCA:", runtime, "seconds.")

Total runtime without PCA: 42 seconds.


In [13]:
from sklearn.metrics import accuracy_score
y_pred = rfc.predict(X_test)
result = accuracy_score(y_test, y_pred)
print("Accuracy on test set without PCA:", result*100, "%.")

Accuracy on test set without PCA: 97.06 %.


### Use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%

In [14]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)

### Train a new Random Forest classifier on the reduced dataset and see how long it takes. Next evaluate the classifier on the test set

In [19]:
rfc_pca = RandomForestClassifier(n_estimators=100, n_jobs=-1)

then = datetime.now()

rfc_pca.fit(X_train_pca, y_train)

now = datetime.now()
runtime = (now-then).seconds
print("Total runtime with PCA:", runtime, "seconds.")

Total runtime with PCA: 96 seconds.


In [21]:
X_test_pca = pca.transform(X_test)
y_pred = rfc_pca.predict(X_test_pca)
result = accuracy_score(y_test, y_pred)
print("Accuracy on test set with PCA:", result*100, "%.")

Accuracy on test set with PCA: 94.78999999999999 %.


The time for the model to run got 128% higher (from 42 to 96 seconds), while the accuracy on test set dropped 0.02% (from 97.06% to 94.79%). This means that for the Random Forest Classifier model **specifically**, dimensionality reduction is not a good idea, since the accuracy performance dropped significantly and running performance got much much worse.

The conclusion here is that dimensionality reduction is not always a good idea and have to always be tested before be blindly used. For some algorithms, such Softmax Regression, PCA can have a strong impact in reducing the runtime of the algorithm, but for other cases it makes our general model much worse (both in runtime and accuracy performance).