<a href="https://colab.research.google.com/github/jeyakumar-nanc/MachineLearning/blob/main/Experiments_With_Various_Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Note to grader:** Each question consists of parts, e.g. Q1(i), Q1(ii), etc. Each part must be graded  on a 0-4 scale, following the standard NJIT convention (A:4, B+: 3.5, B:3, C+: 2.5, C: 2, D:1, F:0). 
The total score must be re-scaled to 100 -- that should apply to all future assignments so that Canvas assigns the same weight on all assignments. 



In [None]:
# Grader's area
import numpy as np
M = np.zeros([10,10])
maxScore = 0


# **Assignment 4**

The goal of this assignment is to run some experiments with scikit-learn on a fairly sizeable and interesting image data set. This is the MNIST data set that consists of lots of images, each having 28x28 pixels. By today's standards, this may seem relatively tiny, but only a few years ago was quite challenging computationally, and it motivated the development of several ML algorithms and models that are now state-of-the-art  solutions for much bigger data sets. 

The assignment is experimental. We will try to whether a combination of PCA and kNN can yield any good results for the MNIST data set. Let's see if it can be made to work on this data set. 

Note: There are less difficult Python parts in this assignment. You can get things done by just repeating things from the class notebooks. But your participation and interaction via Canvas is always appreciated!

## Preparation Steps

In [None]:
# Import all necessary python packages
import numpy as np
#import os
#import pandas as pd
import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
#from sklearn.linear_model import LogisticRegression

In [None]:
# we load the data set directly from scikit learn 
# 
# note: this operation may take a few seconds. If for any reason it fails we 
# can revert back to loading from local storage. 

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
y = y.astype(int)
X = ((X / 255.) - .5) * 2
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=10000, random_state=123, stratify=y)


## <font color = 'blue'> Question 1. Inspecting the Dataset </font>

**(i)** How many data points are in the training and test sets ? <br>
**(ii)** How many attributes does the data set have ?

Exlain how you found the answer to the first two questions. 

[**Hint**: Use the 'shape' method associated with numpy arrays. ]

**(iii)** How many different labels does this data set have. Can you demonsrate how to read that number from the vector of labels *y_train*?  <br>
**(iv)** How does the number of attributes relates to the size of the images? <br>
**(v)** What is the role of line 12 in the above code? 





*(Please insert cells below for your answers. Clearly id the part of the question you answer)*

In [None]:
print(f"(i) Data points in training set - {X_train.shape[0]}")
print(f"(i) Data points in testing set - {X_test.shape[0]}")
print(f"(ii) Attributes in dataset - {X_test.shape[1]}")
print(f"(iii) Number of different labels in y_train  - {len(np.unique(y_train))}")
print(np.unique(y_train))

print(f"(iv) Size of the image is 28x28, which gives the number of attributes as 784")
print(f"(v) To normalize and transform the datapoints")
##Mapping each 784-space point to a scalar

(i) Data points in training set - 60000
(i) Data points in testing set - 10000
(ii) Attributes in dataset - 784
(iii) Number of different labels in y_train  - 10
[0 1 2 3 4 5 6 7 8 9]
(iv) Size of the image is 28x28, which gives the number of attributes as 784
(v) To normalize and transform the datapoints


In [None]:
# For grader use only

# in this case, make excetion and, assign 0-2 points for each subquestion

# insert grade here  
# G[1,1] = 
# G[1,2] =
# G[1,3] = 
# G[1,4] = 
# G[1,5] =  


maxScore = maxScore + 10


##  <font color = 'blue'> Question 2. PCA on MNIST </font>

Because the number of attributes of the MNIST data set may be too big to apply kNN on it (due to the 'curse of dimensionality'), we want to compress the images down to a smaller number of 'fake' attributes. 

Use scikit-learn to output a data set *X_train_transformed* and *X_test_transformed*, with $l$ attributes. Here a reasonable choice of $l$ is 10, equal to the number of labels. But you can try slightly smaller or bigger values as well. 


**Hint**: Take a look at [this notebook](https://colab.research.google.com/drive/1DG5PjWejo8F7AhozHxj8329SuMtXZ874?usp=drive_fs) we used in the lecture, and imitate what we did there. Be careful though, to use only the scikit-learn demonstration, not the exhaustive PCA steps we did before it.

**Note**: This computation can take a while. If problems are encountered we can try the same experiment on a downsized data set. 

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

In [None]:
print(X_train_std.shape)
print(X_test_std.shape)

(60000, 784)
(10000, 784)


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
#pca = PCA(0.95)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
print(X_train_pca.shape)
print(X_test_pca.shape)
pca.explained_variance_ratio_

(60000, 10)
(10000, 10)


array([0.05694422, 0.04067321, 0.03768097, 0.02906179, 0.02546531,
       0.02205956, 0.01922801, 0.0175268 , 0.01540025, 0.01408934])

In [None]:
# for grader use
maxScore = maxScore +4 


# insert grade here (out of 4)
# G[2,1] =



## <font color = 'blue'> Question 3. kNN on MNIST attributes from PCA </font>


Having calculated the *transformed* MNIST data set we can now apply a kNN approach to the MNIST classification data set. Here are the sets:

(i) Fit a $k$-NN classifier on the transformed data set. Here $k$ is a hyperparameter, and you can experiment with it. Be aware though, that larger $k$ can take more time to fit. 

(ii) Apply the classifier on the transformed test set. What is the classification accuracy? 

(iii) A theoretical question: if we skipped all the above steps and we just assigned a **random** label to each test point, what would the classification accuracy be on average?  Does your result (ii) beat the random expectation? 

(iv) Experiment with different settings of $k$, and other hyperparameters that are described in the scikit-learn manual of the kNN classifier. Report your findings in a separate cell. Also for **participation points**: report your best result on Canvas! 

[**Hint**: Imitate the steps from the classroom notebook]


In [None]:
#(i) Fit a  k -NN classifier on the transformed data set. Here  k  is a hyperparameter, 
#and you can experiment with it. Be aware though, that larger  k  can take more time to fit.
from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_pca, y_train) 

#(ii) Apply the classifier on the transformed test set. What is the classification accuracy?
print(f"Accuracy with PCA - {knn.score(X_train_pca,y_train)}")

Accuracy with PCA - 0.9487


In [None]:
#(iii) A theoretical question: if we skipped all the above steps and we just assigned a random label to each test point, 
#what would the classification accuracy be on average? Does your result (ii) beat the random expectation?

print("picking only 10000 datapoints and apply PCA with 10 components with k=3 knn..")
X_train_pca_new = pca.fit_transform(X_train_std[0:10000])
X_test_pca_new = pca.transform(X_test_std[0:10000])
y_train_new = y_train[0:10000]
print(X_train_pca_new.shape)
print(X_test_pca_new.shape)


knn = KNeighborsClassifier(3)
knn.fit(X_train_pca_new, y_train_new)
print(f"Accuracy with PCA for 10000 datapoints - {knn.score(X_train_pca_new,y_train_new)}")

picking only 10000 datapoints and apply PCA with 10 components with k=3 knn..
(10000, 10)
(10000, 10)
Accuracy with PCA for 10000 datapoints - 0.9316


In [None]:
import numpy as np
print("picking only 10000 datapoints and randomly assigning y labels and applying knn with k=3 components..")
sc = StandardScaler()
X_train_std_new = sc.fit_transform(X_train[0:10000])
X_test_std_new = sc.transform(X_test[0:10000])
                              
y_train_new = y_train[0:10000]

res = y_train_new.ravel()
np.random.shuffle(res) #Randomly shuffle labels
y_train_shuffle = res.reshape(y_train_new.shape)

knn.fit(X_train_std_new, y_train_shuffle)
print(f"Accuracy with shuffled y labels on 10000 datapoints - {knn.score(X_train_std_new,y_train_shuffle)}")

picking only 10000 datapoints and randomly assigning y labels and applying knn with k=3 components..
Accuracy with shuffled y labels on 10000 datapoints - 0.434


Based on the above two experiments, Since the labels are assigned randomly, accuracy is less than accuracy obtained with PCA-KNN combination for 10000 datapoints.

In [None]:
#(iv) Experiment with different settings of  k , and other hyperparameters that are described in the scikit-learn manual of the kNN classifier.
pca = PCA(n_components=20)
#pca = PCA(0.95)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
print(X_train_pca.shape)
print(X_test_pca.shape)

for i in range(13):
  n_neighbors=i+3
  knn = KNeighborsClassifier(n_neighbors)
  knn.fit(X_train_pca, y_train)
  print(f" K = {n_neighbors}, accuracy - {knn.score(X_train_pca,y_train)}")

(60000, 20)
(10000, 20)
 K = 3, accuracy - 0.8316
 K = 4, accuracy - 0.8250833333333333
 K = 5, accuracy - 0.8223
 K = 6, accuracy - 0.81965


In [None]:
# for grader use
maxScore = maxScore +12

# insert grade here (each item out of 4)
# G[3,1] =
# G[3,2] = 
# G[3,3] =
# G[3,4] = 

In [None]:
# for grader use

# Total Grade Calculation
rawScore = np.sum(G)
score = rawScore*100/maxScore

NameError: ignored