# Assignment 10: Dimensionality Reduction

Dataset(s) needed: MNIST ("Modified National Institute of Standards and Technology") dataset.

In [1]:
#Load the MNIST dataset
from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784')

In [2]:
mnist.data.shape

(70000, 784)

In [3]:
mnist.target.shape

(70000,)

<h3> Q.1. Split the data into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).
</h3>

In [4]:
# #TODO
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(
#     mnist.data, 
#     mnist.target, 
#     test_size=10000, 
#     random_state=99
# )

In [8]:
X_train = mnist.data[:60000]
X_test = mnist.data[60000:]
y_train = mnist.target[:60000]
y_test = mnist.target[60000:]

In [9]:
for subset in [X_train, X_test, y_train, y_test]:
    print(subset.shape)

(60000, 784)
(10000, 784)
(60000,)
(10000,)


<h3> Q.2. Train a Logistic Regression classifier on the dataset and see how long it takes.</h3>

In [10]:
from sklearn.linear_model import LogisticRegression
import time

In [12]:
log_clf = LogisticRegression(solver='lbfgs', verbose=1, max_iter=500, multi_class='ovr')
start_time = time.time()
#TODO: Train the classifier
log_clf.fit(X_train, y_train)
end_time = time.time()

print("Training took {:.2f}s".format(end_time - start_time))



Training took 235.49s


[Parallel(n_jobs=None)]: Done  10 out of  10 | elapsed:  3.9min finished


<h3> Q.3. Evaluate the resulting model on the test set.</h3>

In [13]:
from sklearn.metrics import classification_report

In [14]:
print(classification_report(y_test, log_clf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.95      0.98      0.97       980
           1       0.97      0.98      0.97      1135
           2       0.93      0.88      0.91      1032
           3       0.90      0.91      0.91      1010
           4       0.93      0.92      0.93       982
           5       0.91      0.85      0.88       892
           6       0.94      0.95      0.94       958
           7       0.92      0.92      0.92      1028
           8       0.84      0.88      0.86       974
           9       0.90      0.89      0.90      1009

    accuracy                           0.92     10000
   macro avg       0.92      0.92      0.92     10000
weighted avg       0.92      0.92      0.92     10000



<h3> Q.4. Use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.</h3>

In [17]:
from sklearn.decomposition import PCA

n = 155
pca = PCA(n_components=n)
pca.fit(X_train)
print(sum(pca.explained_variance_ratio_))

0.9502650943304869


<h3> Q.5. Train a new Logistic Regression classifier on the reduced dataset and see how long it takes. Was training much faster? Explain your results.
</h3>

In [18]:
X_train_pca = pca.transform(X_train)

log_clf = LogisticRegression(
    solver='lbfgs', 
    verbose=1, 
    max_iter=300, 
    multi_class='auto'
)
start_time = time.time()
#TODO: Train the classifier
log_clf.fit(X_train_pca, y_train)
end_time = time.time()

print("Training took {:.2f}s".format(end_time - start_time))

Training took 27.29s


[Parallel(n_jobs=None)]: Done   1 out of   1 | elapsed:   27.2s finished


<h3> Q.6. Evaluate the new classifier on the test set: how does it compare to the previous classifier? Discuss the speed / accuracy trade-off and in which case you'd prefer a very slight drop in model performance for a x-time speedup in training.
</h3>

In [19]:
print(classification_report(y_test, log_clf.predict(pca.transform(X_test))))

              precision    recall  f1-score   support

           0       0.95      0.98      0.96       980
           1       0.95      0.98      0.97      1135
           2       0.93      0.91      0.92      1032
           3       0.91      0.90      0.91      1010
           4       0.92      0.93      0.92       982
           5       0.91      0.86      0.89       892
           6       0.94      0.95      0.94       958
           7       0.93      0.93      0.93      1028
           8       0.88      0.87      0.88       974
           9       0.89      0.91      0.90      1009

    accuracy                           0.92     10000
   macro avg       0.92      0.92      0.92     10000
weighted avg       0.92      0.92      0.92     10000



The performance gains of dimensionality reduction were here very considerable: ~30 seconds vs ~203 seconds, or something like 7x faster. Quite the improvement!

Meanwhile, the tradeoff in performance was pretty small. Some classes so no decrease in classification performance (such as digit 0). Others saw minimal decreases; digit 9, for example, saw a decrease in F1 score of only .01. 

In this case, the tradeoff almost definitely seems worth it for any practical application, unless we knew for some reason that we would only be fitting this model a single time rather than update it as we collect new data. In that case, we would probably tolerate the slower fitting time in return for the small improvement of performance.

<h3> Q.7. Create a new text cell in your Notebook: Complete a 50-100 word summary 
    (or short description of your thinking in applying this week's learning to the solution) 
     of your experience in this assignment. Include:
<br>                                                                    
What was your incoming experience with this model, if any?
what steps you took, what obstacles you encountered.
how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?)
This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.
</h3>

I have some experience using PCA, primarily with text data. It's the most popular way of visualizing high-dimensional text data in two dimensions. What was interesting in this assignment was the question of how to systematically select your number of components n. In visualization, the answer is easy: it's usually 2, or sometimes 3. Here, however, I had to do a guided search over all possible n such that 1 => n > 784 and select the first n where the explained variance ratio is over 95%. 

I wonder if there is a clever way to perform that search systematically. The first option that occurs to me is to do a binary search. This is possible because, technically, our results are sorted: any increase in n will always increase our explained variance ratio. So start with n = 784/2 = 392, see if we are too high or too low, then split the search space in half again, etc.