# 5. Classification and Cross-Validation

**Instructions:**
* go through the notebook and complete the **tasks** .  
* Make sure you understand the examples given!
* When a question allows a free-form answer (e.g., ``what do you observe?``) create a new markdown cell below and answer the question in the notebook.
* ** Save your notebooks when you are done! **

In the previous lab, we loaded up the iris dataset for flower classification, and performed simple exploratory data analysis, i.e., we visualized the data available (features given class labels) in order to understand characteristics of the data (e.g., that some classes are easier to be separated from others based on some features, etc.)

If you don't remember much about this, please revisit the corresponding lab (Lab 2) before moving on.

In this lab, we will go through the process of actually training a classifier on a dataset (training set), and evaluating the performance of the classifier on unknown data (test set)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Run the cell below to load our data. Notice the last line, where we add some random Gaussian noise to our data to make the task more challenging (data in real life usually contains some form of noise).

In [4]:
%matplotlib inline


from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()

#view a description of the dataset (uncomment next line to do so)
#print(iris.DESCR)

#Set X a samples times features matrix, Y equal to the targets
X=iris.data 
y=iris.target 


#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.4,X.shape)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many data samples do we have?  Print the value below using ``shape`` on X appropriately.

In [5]:
print(X.shape[0])


150


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many features do we have?  Print the value below using ``shape`` on X appropriately.

In [6]:
print(X.shape[1])

4


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many classes do we have?  Print the value below using ``np.unique`` appropriately.

In [7]:
np.unique(y)

array([0, 1, 2])

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many samples do we have that belong to class 1?  Use the ``np.where`` function appropriately on y to print this in the cell below.

In [8]:
np.where(y == 1)

(array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
        67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
        84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Assume we want to generate a list of shuffled indices of our data.  Use the function ``numpy.random.permute`` to do that.  In the cell below, you can already see how to create a list of indices that is **not** shuffled.

In [9]:
L=list(range(X.shape[0]))
print(L)
#Enter code here

np.random.permutation(L)


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]


array([ 82, 124,  30, 128,  23,  97,  20,  67,  87,  36,  98,  48, 133,
        12,  53,  71,   6, 125, 122, 110, 141,  65,  59,  89,  38,  47,
        62, 114,  50, 102,  14, 144,  15,  70,  33,  66,  18, 123, 107,
        92,  96,  26,  43,   3, 111,  79,  44,  81,  21,  84, 120,  41,
       140, 117, 119,  49,   9,  56,   0, 118,  99,  39,  69, 127,  52,
        25,  83,  34,  35, 112, 105,  88,  94, 137,  54, 108, 104,   7,
       139,  85,  51,  60, 126,   2, 149,  11,  32,  61,  78,  10,  63,
        27, 147,  28, 116, 145,   5,  24,  37,  55, 106,  90, 136, 100,
        57,  64, 121, 142,  17,  68, 148, 134,  95,  16,  45,  72,  22,
       103, 138, 129, 143, 146,  31, 113,  73,  46, 115, 130,  40,   8,
        13,  74,  75,  76,  42,  19,  77,   1,  91, 101,  86,  29,  93,
        80, 132, 109, 135,  58, 131,   4])

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Here is an example of using the k-NN classifier.  We split our data to training and testing (with a 0.2 percentage for our test data), fit on the training data, test on the testing data.  Go through the code and make sure you understand it.  Subsequently, do the same for the next cell, that prints the confusion matrix and the total accuracy.  (documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

**note: for this lab, we use the euclidean distance along with 10 neighbours**

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define knn classifier, with 5 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
#define training and testing data, fit the classifier
knn.fit(X_train,y_train)
#predict values for test data based on training data
y_pred=knn.predict(X_test)
#print values
print(y_test) # true values
print(y_pred) # predicted values

[1 0 2 1 2 2 0 2 1 2 1 0 0 1 1 2 1 1 2 0 1 1 1 0 2 0 2 0 1 0]
[1 0 2 1 2 1 0 2 1 2 1 0 0 1 2 2 1 1 2 0 1 2 2 0 2 0 2 0 1 0]


In [35]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score
print(confusion_matrix(y_test,y_pred))
print('overall accuracy: %s' % accuracy_score(y_test,y_pred))

# we can also generate the class-relative precision and recall 
print('class precision: %s' % precision_score(y_test,y_pred,average=None))
print('class recall: %s' %recall_score(y_test,y_pred,average=None))


[[9 0 0]
 [0 9 3]
 [0 1 8]]
overall accuracy: 0.8666666666666667
class precision: [1.         0.9        0.72727273]
class recall: [1.         0.75       0.88888889]


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Write your **own** functions that return the confusion matrix given the true and predicted labels, as well as the accuracy.  To do so, fill in the code in the next two cells.

In [36]:
#create a matrix with entries equal to zero, and subsequently build the confusion matrix
#the method should return the confusion matrix in a numpy array
def myConfMat(y_test,y_pred,classno):
    print(y_test)
    print(y_pred)

    C= np.zeros((classno, classno), dtype=int)
    arr = np.unique(y_test)

    for i in range(len(y_test)):
        if y_pred[i] == y_test[i]:
            C[y_test[i], y_pred[i]] += 1
        
        elif y_pred[i] != y_test[i]:
            if y_pred[i] in arr:
                C[y_test[i], y_pred[i]] += 1
    
    print("arr", arr)  
    
    # initialize the confusion matrix to zeros
    #loop through all results and update the confusion matrix
    return C

#note: len(np.unique(y))  indicates the dimensions of the confusion matrix (why?)
print(myConfMat(y_test,y_pred,len(np.unique(y))))

[1 0 2 1 2 2 0 2 1 2 1 0 0 1 1 2 1 1 2 0 1 1 1 0 2 0 2 0 1 0]
[1 0 2 1 2 1 0 2 1 2 1 0 0 1 2 2 1 1 2 0 1 2 2 0 2 0 2 0 1 0]
arr [0 1 2]
[[9 0 0]
 [0 9 3]
 [0 1 8]]


In [38]:
#use the numpy function where to return the accuracy given the true/predicted labels.  i.e., #correct/#total
def myAccuracy(y_test,y_pred):
    C = myConfMat(y_test,y_pred,len(np.unique(y)))
    diagnol = 0
    rest = 0
    for i in range(len(C)):
        for j in range(len(C)):
            if i == j:
                #print("d = ", C[i][j], " + ")
                diagnol += C[i][j]
            else:
                #print("rest = ", C[i][j], " + ")
                rest += C[i][j]
    
    #print("Sum = ", diagnol)
    #print("Rest = ", rest)
    
    accuracy=(rest / diagnol) * 100 # change this line
    return 100 - accuracy 
    
    
print(myAccuracy(y_test,y_pred))

[1 0 2 1 2 2 0 2 1 2 1 0 0 1 1 2 1 1 2 0 1 1 1 0 2 0 2 0 1 0]
[1 0 2 1 2 1 0 2 1 2 1 0 0 1 2 2 1 1 2 0 1 2 2 0 2 0 2 0 1 0]
arr [0 1 2]
84.61538461538461


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Write your own cross-validation function.  In this case, we are using a fixed distance (euclidean) and a fixed number of neighbours (10) so we do **not** need to create a validation set.

Your function (see cell below) firstly splits the indices of each of our data into bins according to the number of folds (here: 5-fold).

Then, you should loop through all folds, split the data into training and testing by selecting the appropriate bins (see slides on cross-validation), train on training data and save the test result as the accuracy for each fold (see list accuracy_fold).  This is the list that your function should return in the end.  Remember that the ``extend`` function extends a list with more values.  

The final print call in the end of the cell should print the list of accuracies, with five values, one for each fold.

In [None]:
def myCrossVal(X,y,foldK):
    accuracy_fold=[] #list to store accuracies folds
    
    
    #TASK: use the function np.random.permutation to generate a list of shuffled indices from in the range (0,number of data)
    #(you did this already in a task above)
    #indices=#...
    #print(indices)
    
    #TASK: use the function array_split to split the indices to k different bins:
    #uncomment line below
    #bins=
    #print(bings)
    
    
    #loop through folds
    for i in range(0,foldK):
        foldTrain=[] # list to save current indices for training
        foldTest=[]  # list to save current indices for testing
        #TASK: take bin i for testing, rest for training.  Can use the function extend to add indices to foldTrain and foldTest
        #train kNN classifier
        #test on test data
        #append the new accuracy to your accuracy_fold list.  You can use accuracy_score or your myAccuracy function.
    return accuracy_fold;
    
accuracy_fold=myCrossVal(X,y,5)
print(accuracy_fold)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Print the average accuracy and standard deviation of your results over the 5 folds. (functions ``mean`` and ``std``)

In [None]:
#######################################

<hr>
<span style="color:rgb(170,0,0)">**Optional task:**</span> Write your own functions to calculate class-relative precision and recall. Compare these to the sklearn functions ``precision_score`` and ``recall_score`` that were used above on the original y_test and y_pred values (from the beginning of this tutorial).

In [None]:
#hint: you can use the output from your myConfMat function above

def myPrecision(y_test,y_pred):
    classes = np.unique(y_pred)
    precision = np.zeros(nClasses.shape) 
    return precision

def myRecall(y_test,y_pred):
    classes = np.unique(y_pred)
    recall = np.zeros(nClasses.shape) 
    return recall

print('classes:      %s' % np.unique(y_pred) )    
print('my precision: %s' % myPrecision(y_test,y_pred))
print('my recall:    %s' % myRecall(y_test,y_pred))

