# Final Exam Review (Adelaide)

*Contact TA: Emaad Ahmed Manzoor (emaad@cmu.edu)*

Recommended videos: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A

## Dataset

   - We will be using the Fashion-MNIST dataset. Take a look at what the data looks like [here](https://github.com/zalandoresearch/fashion-mnist).
   - Download the data from [here](https://www.dropbox.com/s/qgsk90f22tvsqjp/fashionmnist.zip?dl=0).
   - There are 60,000 training images and 10,000 test images.
   - 10 classes: t-shirt, trouser, pullover, etc. labeled as 0 to 9.

## Q0. Load the data into variables: Xtrain, ytrain, Xtest and ytest.

   - Each .csv files contains a header that describes each column.
   - Reading the files takes <1 minute.
   - Print the shape of each variable.

In [2]:
import numpy as np

In [5]:
%%time
training_data = np.loadtxt("fashion-mnist_train.csv", skiprows=1, delimiter=',')
test_data = np.loadtxt("fashion-mnist_test.csv", skiprows=1, delimiter=',')

Wall time: 1min 27s


In [6]:
Xtrain = training_data[:,1:]
ytrain = training_data[:,0]
Xtest = test_data[:,1:]
ytest = test_data[:,0]

In [7]:
print(Xtrain.shape)
print(ytrain.shape)
print(Xtest.shape)
print(ytest.shape)

(60000, 784)
(60000,)
(10000, 784)
(10000,)


## Q1. Unsupervised image labeling.

   - Cluster the images into 10 clusters via [K-Means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
   - Label each cluster with the most popular image class it contains.
   - Print the class label assigned to each cluster.
   - Report the overall accuracy on the test data.

Hint: You may need to convert the K-Means cluster labels to integers using the function `labels_.astype(int)`.

Clustering takes around 1 minute.

In [8]:
%%time
from sklearn.cluster import KMeans

clf = KMeans(n_clusters=10, n_jobs=-1)
clf.fit(Xtrain)

Wall time: 1min 29s


In [9]:
cluster_labels = []
image_clusters = clf.labels_
for cidx in range(10):
    true_labels = ytrain[image_clusters == cidx].astype(int)
    most_popular_label = np.bincount(true_labels).argmax()
    cluster_labels.append(most_popular_label)
    print("Cluster " + str(cidx) + " labeled with class " + str(most_popular_label))

Cluster 0 labeled with class 8
Cluster 1 labeled with class 5
Cluster 2 labeled with class 6
Cluster 3 labeled with class 4
Cluster 4 labeled with class 0
Cluster 5 labeled with class 8
Cluster 6 labeled with class 7
Cluster 7 labeled with class 1
Cluster 8 labeled with class 9
Cluster 9 labeled with class 9


In [None]:
ypred_cluster = clf.predict(Xtest)
ypred = [cluster_labels[y] for y in ypred_cluster] # convert it to classes

In [None]:
print("Accuracy: " + str(np.mean(ypred==ytest)))

## Q2. Dimensionality reduction + unsupervised image labeling.

Do this for $k = 100$ and $k = 500$.

   - Reduce the dimensionality to $k$ dimensions via [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
   - Repeat Q1 on the reduced-dimensionality data.
   - Report the overall accuracy on the test data.
   
This takes ~1 minute.

In [22]:
%%time
from sklearn.decomposition import PCA

for k in [100, 500]:
    print("Performing PCA...")
    pca = PCA(n_components=k)
    Xtrain_transformed = pca.fit_transform(Xtrain)
    
    print("K-Means...")
    clf = KMeans(n_clusters=10, n_jobs=-1)
    clf.fit(Xtrain_transformed)
    
    cluster_labels = []
    image_clusters = clf.labels_
    for cidx in range(10):
        true_labels = ytrain[image_clusters == cidx].astype(int)
        most_popular_label = np.bincount(true_labels).argmax()
        cluster_labels.append(most_popular_label)

    ypred_cluster = clf.predict(pca.fit_transform(Xtest))
    ypred = [cluster_labels[y] for y in ypred_cluster] # convert it to classes
    
    print("Accuracy with k=" + str(k) + ": " + str(np.mean(ypred==ytest)))

Performing PCA...
K-Means...
Accuracy with k=100: 0.4807
Performing PCA...
K-Means...
Accuracy with k=500: 0.5019
CPU times: user 55.4 s, sys: 7.38 s, total: 1min 2s
Wall time: 46.8 s


## Q3. Supervised image labeling.

   - Train a [logistic regression classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression).
   - Set the classifier parameters `multi_class='multinomial'` and `solver='lbfgs'`.
   - Select the parameter `C` with the best mean accuracy score in [3-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).
   - Try `C = 1.0`, `C = 100.0`.
   - Report the overall accuracy on the test data with the best `C` trained on the entire training data.

Each fold should take around 1 minute.

In [23]:
%%time
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

for C in [1.0, 100.0]:
    print("C = " + str(C))
    scores = []
    for train_idx, val_idx in KFold(n_splits=3).split(Xtrain):
        print("\tFold...")
        train_images = Xtrain[train_idx,:]
        train_labels = ytrain[train_idx]
        val_images = Xtrain[val_idx,:]
        val_labels = ytrain[val_idx]

        clf = LogisticRegression(C=C, n_jobs=-1, multi_class='multinomial', solver='lbfgs')
        clf.fit(train_images, train_labels)
        
        ypred = clf.predict(val_images)
        accuracy = np.mean(ypred==val_labels)
        scores.append(accuracy)
    
    print(np.mean(scores))

C = 1.0
	Fold...
	Fold...
	Fold...
0.85005
C = 100.0
	Fold...
	Fold...
	Fold...
0.8494
CPU times: user 3.35 s, sys: 1.52 s, total: 4.87 s
Wall time: 3min 56s


In [24]:
best_C = 1.0
clf = LogisticRegression(C=best_C, n_jobs=-1, multi_class='multinomial', solver='lbfgs')
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
accuracy = np.mean(ypred == ytest)
print("Accuracy: " + str(accuracy))

Accuracy: 0.8519
