# CS512 - Machine Learning - 2019
## Homework 1
100 pts


## Goal

The goal of this homework is three-fold:

*   Introduction to the machine learning experimental set up 
*   Gain experience with Decision trees and k-NN approaches
*   Gain experience with the Scikit library

## Dataset
**MNIST** is a collection of 28x28 grayscale images of digits (0-9); hence each pixel is a gray-level from 0-255. 

**Download the data from Keras. You must use a 20% of the training data for validation** (no need for cross-validation as you have plenty of data) and **use the official test data (10,000 samples) only for testing.**

## Task 
Build 2 classifiers (decision tree, k-NN) with the scikit library function calls to classify digits in the MNIST dataset.

## Software: You may find the necessary function references here:
http://scikit-learn.org/stable/supervised_learning.html

## Submission: 
Fill this notebook and submit this document with a link to #your Colab notebook 
(make sure to include the link obtained from the #share link on top right)


##1) Initialize

*   First make a copy of the notebook given to you as a starter.

*   Make sure you choose Connect form upper right.


## 2) Load training dataset

*  Read from Keras library.



In [0]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 

# Read data 

from sklearn import datasets

from keras.datasets import mnist
import numpy as np 

(trainData_2D, trainLabels), (testData_2D, testLabels) = mnist.load_data()

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


##3) Understanding the dataset

There are alot of functions that can be used to know more about this dataset

- What is the shape of the training set (num of samples X number of attributes) ***[shape function can be used]***

- Display attribute names ***[columns function can be used]***

- Display the first 5 rows from training dataset ***[head or sample functions can be used]***

..

In [0]:
# Reshape training to 1D
train_sz = trainData_2D.shape
trainData = np.reshape(trainData_2D, (train_sz[0], train_sz[1]*train_sz[2])).astype('float32')

# Reshape testing to 1D
test_sz = testData_2D.shape
testData = np.reshape(testData_2D, (test_sz[0], test_sz[1]*test_sz[2])).astype('float32')

# Normalize
from sklearn.preprocessing import normalize
trainData = normalize(trainData)
testData = normalize(testData)

# Check the dimentionality and type of training data
print('Data Dimensionality of the training set: ', trainData.shape)

dataset = pd.read_csv('/content/mnist_train.csv')
print(dataset.head(5)) # print first 5 rows in your dataset

counter = 0

# Display attribute names
for col in dataset.columns:
  if counter > 1:
    print("Attribute name number ", counter, ": ", col)
  counter = counter + 1


Data Dimensionality of the training set:  (60000, 784)
   label  1x1  1x2  1x3  1x4  1x5  ...  28x23  28x24  28x25  28x26  28x27  28x28
0      5    0    0    0    0    0  ...      0      0      0      0      0      0
1      0    0    0    0    0    0  ...      0      0      0      0      0      0
2      4    0    0    0    0    0  ...      0      0      0      0      0      0
3      1    0    0    0    0    0  ...      0      0      0      0      0      0
4      9    0    0    0    0    0  ...      0      0      0      0      0      0

[5 rows x 785 columns]
Attribute name number  2 :  1x2
Attribute name number  3 :  1x3
Attribute name number  4 :  1x4
Attribute name number  5 :  1x5
Attribute name number  6 :  1x6
Attribute name number  7 :  1x7
Attribute name number  8 :  1x8
Attribute name number  9 :  1x9
Attribute name number  10 :  1x10
Attribute name number  11 :  1x11
Attribute name number  12 :  1x12
Attribute name number  13 :  1x13
Attribute name number  14 :  1x14
Attribute

##4) Shuffle and Split TRAINING data as train (also called development) (80%) and validation (20%) 

In [0]:

from sklearn.utils import shuffle
import random

from sklearn.model_selection import train_test_split

# Shuffle the training data

#random.shuffle(trainData)

trainData, trainLabels = shuffle(trainData, trainLabels)


# Split 80-20

(train_x, val_x, train_y, val_y) = train_test_split(trainData, trainLabels, test_size=0.2)





##5) Train decision tree and k-NN  classifiers on development data and do model selection using the validation data

* Train a decision tree classifier with different values of "min_samples_split" which is the minimum number of samples required to split an internal node:  min_samples_split = [default = 2, 3, 5]. 

* Train a k-NN classifier (k=1 and k=5 and rest of the parameters set to default). 

In [0]:
# Train decision tree classifier (leave the best parameter version)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

GivenNum = [2, 3, 5]
accuracies_tree = []
val_results = []

for num in GivenNum:
  dt = DecisionTreeClassifier(min_samples_split = num, criterion = 'entropy')
  dt.fit(train_x, train_y)
  score = dt.score(val_x, val_y)
  accuracies_tree.append(score)
  print("For num = %d, validation accuracy = %.5f%%" % (num, score * 100))
  
  
# Train k-NN classifier (leave the best parameter version)

from sklearn.neighbors import KNeighborsClassifier

accuracies_knn = []

kVals = [1, 5]

# for k = 1

for k in kVals:
  model = KNeighborsClassifier(n_neighbors=k)
  model.fit(train_x, train_y)

  score = model.score(val_x, val_y)
  print("For k = %d, validation accuracy = %.5f%%" % (k, score * 100))

  accuracies_knn.append(score)


For num = 2, validation accuracy = 87.84167%
For num = 3, validation accuracy = 87.63333%
For num = 5, validation accuracy = 87.89167%
For k = 1, validation accuracy = 97.40833%
For k = 5, validation accuracy = 97.50000%


## 7) Test your CHOSEN classifier on Test set

- Load test data
- Apply same pre-processing as training data (probably none)
- Predict the labels of testing data **using the best chosen SINGLE model out of the models (2 approaches, 5 param. setting) that you have tried from step 6 (you have selected your model according to your validation results)** and report the accuracy. 

In [0]:
# Find the index of k value with the highest validation accuracy
i = np.argmax(accuracies_knn)
print("best k = %d with %.5f%% validation accuracy" % (kVals[i], accuracies_knn[i] * 100))

# Note: Pre-processing for the test data was done at step 3.

# Predict the labels of the test data
predictions = model.predict(testData)

# Train KNN with the best k value using [full] training data 
model = KNeighborsClassifier(n_neighbors=kVals[i])
model.fit(trainData, trainLabels)

# Calculate the accuracy given the true labels and prediction of test data
from sklearn.metrics import accuracy_score
TestAccuracy = accuracy_score(testLabels, predictions)

#Reporting the accuracy
print("Testing Accuracy = %.5f%%" % (TestAccuracy * 100))

# Print the confusion matrix of the testing data 

from sklearn.metrics import confusion_matrix

results = confusion_matrix(testLabels, predictions)
print(results)


best k = 5 with 97.50000% validation accuracy
Testing Accuracy = 97.30000%
[[ 975    1    0    0    0    0    3    1    0    0]
 [   0 1131    2    0    0    0    2    0    0    0]
 [  10    2 1006    1    1    0    0    8    4    0]
 [   3    1    3  975    1    9    0    6    9    3]
 [   3    2    0    0  943    0    6    1    2   25]
 [   6    0    0    7    1  860    7    2    6    3]
 [   4    2    0    0    2    2  948    0    0    0]
 [   2   12    6    0    0    0    0  992    0   16]
 [   6    3    3    9    3    5    4    3  934    4]
 [   9    8    2    6    4    3    1    6    4  966]]


##8) Notebook & Report 

**Notebook: As training and testing takes a long time, I will just look at your notebook results; so make sure each cell is run and  outputs are there.**

**Report:** Write an **at most one page summary** of your approach to this problem at the end of your notebook; this should be like an abstract of a paper or the executive summary (you aim for clarity and passing on information, not going to details about known facts such as what dec. trees are or what MNIST is, assuming they are known to people in your research area). 

**Must include statements such as:**

 ( Include the problem definition: 1-2 lines )
 
  (Talk about train/val/test sets, size and how split. )
 
  (Talk about any preprocessing you do.)
  
 ( Give the validation accuracies for different approaches, parameters **in a table** and state which one you selected)
 
 ( State  what your test results are with the chosen method, parameters: e.g. "We have obtained the best results with the ….. classifier      (parameters=....) , giving a digit classification accuracy of …% on test data….""

  (Comment on the speed of the algorithms and anything else that you deem important/interesting (e.g. confusion matrix)).

*You will get full points from here as long as you have a good (enough) summary of your work, regardless of your best performance or what you have decided to talk about in the last few lines.*

# **Report**

  In this task, we have the problem of classifying images from the MNIST dataset with correct labels. The images represent handwriting forms of digits and we are trying to predict the spesific digit by using decision tree and k-NN classifiers. Both the training and testing datasets are fetched from Keras.

  The images are in 28x28 format, representing digits from 0 to 9; each pixel is a gray-level from 0-255.

  The training dataset goes by the name "mnist_train.csv", has 60,000 samples and has a size of 104 MB. On the the other hand, the test dataset is called "mnist_test.csv", has 10,000 samples and has a size of 17.4 MB. 

  The 20% of the training data was used for validation and splitted in that fashion. Also, no cross-validation was used since we our data points are ample.

  For pre-processing, normalization was done and re-shaping of data from 2D to 1D was made.
  
  Validation accuracies for different approaches and parameters could be found in the table below:

+-------------+------------------+-       
| *k-NN Classifier *                  |       
|k-values     |  Validation Accuracy|           
| 1 -----------> 97.40833%         |        
| 5 -----------> 97.50000%         |        
+-------------+-------------------+-       

*-*-*-*-

+-------------+------------------+-       
| *Decision Tree Classifier *    |       
|min_sample_split  |  Validation Accuracy| 

| 2 -----------> 87.84167%   |  

| 3 -----------> 87.63333%   |

| 5 -----------> 87.89167%   | 

+-------------+-------------------+-  
                                           


  As a result, k-NN classifier with the parameter k = 5 was chosen.

  Moving on, while testing the chosen classifier (k-NN, with the parameter k = 5) with the test data, resulted in producing a digit classification accuracy of 97.30000% on the test data.

  Another observation was that that k-NN classifier takes more time than the decision tree to compute.

  Additionaly, you can see the confusion matrix at task 7 to have a better understanding of the results in this homework. Looking at the matrix, it could be clearly seen thatthe majority of the tests resulted on the diagonal of the matrix which tells us the chosen classifier has done a good job.

Thank you.


##9) Submission

Please submit your **"share link" INLINE in Sucourse submissions**

You should get your "share link" as **share with anyone in edit mode** (just in case)

Please **also submit a 1-page (at most) report in hardcopy** (what you write to step 8) with your name and Hw number etc. to facilitate grading. So step 8 is so that your notebook is complete. The hardcopy is so that grading is easy.


## Questions? 

You should ask all your Google Colab related questions to Discussions and feel free to answer/share your answer regarding Colab. 

You can also ask/answer about which functions to use and what libraries... 

However you should **not ask** about the core parts, that is what is validation/test, which one shd. have higher performance, what are your scores etc.
