# Final Exam Review (Adelaide) [50 points]

*Contact TA: Emaad Ahmed Manzoor (emaad@cmu.edu)*

Recommended videos: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A

## Dataset

   - We will be using the Fashion-MNIST dataset. Take a look at what the data looks like [here](https://github.com/zalandoresearch/fashion-mnist).
   - Download the data from [here](https://www.dropbox.com/s/qgsk90f22tvsqjp/fashionmnist.zip?dl=0).
   - There are 60,000 training images and 10,000 test images.
   - 10 classes: t-shirt, trouser, pullover, etc. labeled as 0 to 9.

## Q0. Load the data into variables: Xtrain, ytrain, Xtest and ytest. [5 points]

   - Each .csv files contains a header that describes each column.
   - Reading the files takes <1 minute.
   - Print the shape of each variable.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
train_df = pd.read_csv("fashion-mnist_train.csv", sep=",",  thousands=",")
test_df = pd.read_csv("fashion-mnist_test.csv", sep=",",  thousands=",")

In [3]:
Xtrain = train_df.loc[:, train_df.columns != 'label']
Xtest = test_df.loc[:, test_df.columns != 'label']
ytrain = train_df.loc[:, 'label']
ytest = test_df.loc[:,  'label']

print("Xtrain:",Xtrain.shape,"  Xtest:",Xtest.shape,"  ytest:",ytest.shape,"  ytrain:",ytrain.shape)


Xtrain: (60000, 784)   Xtest: (10000, 784)   ytest: (10000,)   ytrain: (60000,)


## Q1. Unsupervised image labeling. [10 points]

   - Cluster the images into 10 clusters via [K-Means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
   - Label each cluster with the most popular image class it contains.
   - Print the class label assigned to each cluster.
   - Report the overall accuracy on the test data.

Hint: You may need to convert the K-Means cluster labels to integers using the function `labels_.astype(int)`.

Clustering takes around 1 minute.

In [4]:
%%time
from sklearn.cluster import KMeans
model = KMeans(n_clusters=10, n_jobs=-1)
clf = model.fit(Xtrain)

CPU times: user 7 s, sys: 380 ms, total: 7.38 s
Wall time: 20 s


In [5]:
cluster_labels = []
image_clusters = clf.labels_
for cidx in range(10):
    true_labels = ytrain[image_clusters == cidx].astype(int)
    most_popular_label = np.bincount(true_labels).argmax()
    cluster_labels.append(most_popular_label)
    print("Cluster " + str(cidx) + " labeled with class " + str(most_popular_label))

Cluster 0 labeled with class 1
Cluster 1 labeled with class 6
Cluster 2 labeled with class 8
Cluster 3 labeled with class 9
Cluster 4 labeled with class 0
Cluster 5 labeled with class 8
Cluster 6 labeled with class 4
Cluster 7 labeled with class 9
Cluster 8 labeled with class 7
Cluster 9 labeled with class 5


In [16]:
from sklearn.metrics import accuracy_score

ypred_cluster = clf.predict(Xtest)
ypred = [cluster_labels[y] for y in ypred_cluster] # convert it to classes
print("Accuracy: " + str(np.mean(ypred==ytest)))


accuracy_score(ytest, ypred)

Accuracy: 0.552


0.55200000000000005

## Q2. Dimensionality reduction + unsupervised image labeling. [15 points]

Do this for $k = 100$ and $k = 500$.

   - Reduce the dimensionality to $k$ dimensions via [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
   - Repeat Q1 on the reduced-dimensionality data.
   - Report the overall accuracy on the test data.
   
This takes ~1 minute.

In [22]:
%%time

from sklearn.decomposition import PCA

def decomp(k, X, y, Xtest, ytest):
    p = PCA(k)
    X_transformed = p.fit_transform(X)
    clf = KMeans(n_clusters=10, n_jobs=-1)
    clf.fit(X_transformed)
    
    cluster_labels = []
    image_clusters = clf.labels_
    for cidx in range(10):
        true_labels = ytrain[image_clusters == cidx].astype(int)
        most_popular_label = np.bincount(true_labels).argmax()
        cluster_labels.append(most_popular_label)
    
    ypred_cluster = clf.predict(p.transform(Xtest))
    ypred = [cluster_labels[y] for y in ypred_cluster]
    print(accuracy_score(ytest, ypred))
    

decomp(100, Xtrain, ytrain, Xtest, ytest)
decomp(500, Xtrain, ytrain, Xtest, ytest)

0.5813
0.5521
CPU times: user 1min 37s, sys: 1min 15s, total: 2min 53s
Wall time: 28.6 s


## Q3. Supervised image labeling. [20 points]

   - Train a [logistic regression classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression).
   - Set the classifier parameters `multi_class='multinomial'` and `solver='lbfgs'`.
   - Select the parameter `C` with the best mean accuracy score in [3-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).
   - Try `C = 1.0`, `C = 100.0`.
   - Report the overall accuracy on the test data with the best `C` trained on the entire training data.

Each fold should take around 1 minute.

In [29]:
%%time

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold

def cross_validation(classifier, training_matrix, label_vector, n_folds):
    k_fold = KFold(n_folds)
    fold_scores = []
    for k, (train, val) in enumerate(k_fold.split(training_matrix, label_vector)):
        classifier.fit(training_matrix, label_vector)
        ypred = classifier.predict(training_matrix)
        yval = label_vector
        accuracy = np.sum(ypred==yval)/len(ypred)
        #score = f1_score(y_true=yval, y_pred=ypred)
        fold_scores.append(accuracy)        
        print(('Fold %d: F1_Score: %f') % (k, accuracy))
    mean_score = np.mean(fold_scores)
    print("Mean k-Fold score: " + str(mean_score))
    return mean_score

l = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=1.0)

cross_validation(l, Xtrain, ytrain, 3)

l2 = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=100.0)


cross_validation(l2, Xtrain, ytrain, 3)


Fold 0: F1_Score: 0.862317
Fold 1: F1_Score: 0.862317
Fold 2: F1_Score: 0.862317
Mean k-Fold score: 0.862316666667
Fold 0: F1_Score: 0.862500
Fold 1: F1_Score: 0.862500
Fold 2: F1_Score: 0.862500
Mean k-Fold score: 0.8625
CPU times: user 2min 57s, sys: 708 ms, total: 2min 58s
Wall time: 2min 58s
