# Classification and Cross-Validation

**Instructions:**
* go through the notebook and complete the **tasks** .  
* Make sure you understand the examples given!
* When a question allows a free-form answer (e.g., ``what do you observe?``) create a new markdown cell below and answer the question in the notebook.
* ** Save your notebooks when you are done! **

In the previous lab, we loaded up the iris dataset for flower classification, and performed simple exploratory data analysis, i.e., we visualized the data available (features given class labels) in order to understand characteristics of the data (e.g., that some classes are easier to be separated from others based on some features, etc.)

If you don't remember much about this, please revisit the corresponding lab (Lab 2) before moving on.

In this lab, we will go through the process of actually training a classifier on a dataset (training set), and evaluating the performance of the classifier on unknown data (test set)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Run the cell below to load our data. Notice the last line, where we add some random Gaussian noise to our data to make the task more challenging (data in real life usually contains some form of noise).

In [2]:
%matplotlib inline
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()

#view a description of the dataset (uncomment next line to do so)
print(iris.DESCR)

#Set X a samples times features matrix, Y equal to the targets
X=iris.data 
y=iris.target 


#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.4,X.shape)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many data samples do we have?  Print the value below using ``shape`` on X appropriately.

In [19]:
n = X.shape[0]
n

150

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many features do we have?  Print the value below using ``shape`` on X appropriately.

In [16]:
number_of_features = X.shape[1]
number_of_features

4

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many classes do we have?  Print the value below using ``np.unique`` appropriately.

In [25]:
len(np.unique(y))

3

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> How many samples do we have that belong to class 1?  Use the ``np.where`` function appropriately on y to print this in the cell below.

In [87]:
len(y[np.where(y==1)])

50

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Assume we want to generate a list of shuffled indices of our data.  Use the function ``numpy.random.permute`` to do that.  In the cell below, you can already see how to create a list of indices that is **not** shuffled.

In [85]:
L=list(range(X.shape[0]))
print(L)
#Enter code here
shuffled = np.random.permutation(L)
shuffled

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]


array([135,   9, 124, 125,  21, 137, 133,  64,  82,  62,  13, 114,   2,
       144,  76,  87,  48,  43,  73, 139,  33,  89,  20, 110,  31,  44,
       103,  67, 108,  32, 105,  75,  70, 121,   1, 117, 123, 136,  96,
        27,  55,   8, 102,   0,  14, 101,  16,  45,  68, 112,  86,  56,
       141, 128,  47,  19,  17,  57,  12, 145,  26, 127,  84,  71,  94,
        29,  60,   7,  49,  58,  15,  91, 115,  37, 113,  41,  95,  46,
        38,  72,  97,  36,  11,  81,  63,  39,  24,   4,  66,  22,  25,
        18, 119, 120,  77,  88, 143,  90, 132,  85, 107, 129,  61, 100,
        23,  83,  34,  98,  74,  79, 131, 109,  30, 147,  10, 148, 142,
        65, 118,   5,  54,  80, 122,  52, 116,  92,  42, 134,  28, 140,
       138,  53, 146, 111,  69, 130,  78,  51,  50, 126, 106, 104,  99,
        59, 149,  93,  40,  35,   6,   3])

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Here is an example of using the k-NN classifier.  We split our data to training and testing (with a 0.2 percentage for our test data), fit on the training data, test on the testing data.  Go through the code and make sure you understand it.  Subsequently, do the same for the next cell, that prints the confusion matrix and the total accuracy.  (documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

**note: for this lab, we use the euclidean distance along with 10 neighbours**

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define knn classifier, with 5 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
#define training and testing data, fit the classifier
knn.fit(X_train,y_train)
#print(X_train)
#predict values for test data based on training data
#print(X_test)
y_pred=knn.predict(X_test)
#print values
print(y_test) # true values
print(y_pred) # predicted values

"""
twice daily oil

"""

[2 0 2 0 2 2 0 0 2 2 2 1 0 0 2 1 1 0 2 1 2 0 0 2 1 0 0 1 0 0]
[2 0 2 0 2 1 0 0 2 2 1 1 0 0 1 1 1 0 2 1 2 0 0 2 1 0 0 1 0 0]


'\ntwice daily oil\n\n'

In [170]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score
print(confusion_matrix(y_test,y_pred))
print('overall accuracy: %s' % accuracy_score(y_test,y_pred))

# we can also generate the class-relative precision and recall 
print('class precision: %s' % precision_score(y_test,y_pred,average=None))
print('class recall: %s' %recall_score(y_test,y_pred,average=None))


[[ 6  0  0]
 [ 0 12  0]
 [ 0  1 11]]
overall accuracy: 0.9666666666666667
class precision: [1.         0.92307692 1.        ]
class recall: [1.         1.         0.91666667]


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Write your **own** functions that return the confusion matrix given the true and predicted labels, as well as the accuracy.  To do so, fill in the code in the next two cells.

In [217]:
#create a matrix with entries equal to zero, and subsequently build the confusion matrix
#the method should return the confusion matrix in a numpy array
def myConfMat(y_test,y_pred,classno):
    C = [[0, 0, 0], # initialize the confusion matrix to zeros
         [0, 0, 0],
         [0, 0, 0]] 
    
    for i in range(0, len(y_pred)):
        C[y_test[i]][y_pred[i]] +=1
    #loop through all results and update the confusion matrix
    return C

#note: len(np.unique(y))  indicates the dimensions of the confusion matrix (why?)
print(myConfMat(y_test,y_pred,len(np.unique(y))))

[[6, 0, 0], [0, 12, 0], [0, 1, 11]]


In [260]:
#use the numpy function where to return the accuracy given the true/predicted labels.  i.e., #correct/#total
def myAccuracy(y_test,y_pred):
#     print(y_test)
#     print(y_pred)
#     print("\n")
    accuracy = len(y_test[np.where(y_test==y_pred)])/len(y_pred) #np.where()
    return accuracy
    
print(myAccuracy(y_test,y_pred))

0.9


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Write your own cross-validation function.  In this case, we are using a fixed distance (euclidean) and a fixed number of neighbours (10) so we do **not** need to create a validation set.

Your function (see cell below) firstly splits the indices of each of our data into bins according to the number of folds (here: 5-fold).

Then, you should loop through all folds, split the data into training and testing by selecting the appropriate bins (see slides on cross-validation), train on training data and save the test result as the accuracy for each fold (see list accuracy_fold).  This is the list that your function should return in the end.  Remember that the ``extend`` function extends a list with more values.  

The final print call in the end of the cell should print the list of accuracies, with five values, one for each fold.

In [523]:
def myCrossVal(X,y,foldK):
    accuracy_fold=[] #list to store accuracies folds 
    #TASK: use the function np.random.permutation to generate a list of shuffled indices 
    #from in the range (0,number of data)
    #(you did this already in a task above)
    L=list(range(X.shape[0]))
    indices = np.random.permutation(L)
    #print(indices)
    
    #TASK: use the function array_split to split the indices to k different bins:
    #uncomment line below
    bins=np.array(np.array_split(indices, foldK))
    #print(bins)
    
    #loop through folds
    for i in range(0,foldK):
        foldTrain=[] # list to save current indices for training
        foldTest=[]  # list to save current indices for testing
        #TASK: take bin i for testing, rest for training.  
        #Can use the function extend to add indices to foldTrain and foldTest
        #train kNN classifier
        foldTest.extend(bins[i])
        for j in range(0, foldK):
            if j!=i:
                foldTrain.extend(bins[j])
        #print(foldTrain)
        knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
        knn.fit(X[foldTrain],y[foldTrain])
        #test on test data
        y_pred=knn.predict(X[foldTest])
        #append the new accuracy to your accuracy_fold list.  
        #You can use accuracy_score or your myAccuracy function.
        accuracy_fold.append(myAccuracy(y[foldTest], y_pred))
    return accuracy_fold;
    
accuracy_fold=myCrossVal(X,y,5)
print(accuracy_fold)

[0.9, 0.9, 0.9, 0.8333333333333334, 0.9666666666666667]


<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Print the average accuracy and standard deviation of your results over the 5 folds. (functions ``mean`` and ``std``)

In [439]:
print("mean accuracy: %s" % np.mean(accuracy_fold))
print("standard deviation: %s" % np.std(accuracy_fold))
#np.std?

mean accuracy: 0.9
standard deviation: 0.07601169500660919


In [None]:
#######################################

<hr>
<span style="color:rgb(170,0,0)">**Optional task:**</span> Write your own functions to calculate class-relative precision and recall. Compare these to the sklearn functions ``precision_score`` and ``recall_score`` that were used above on the original y_test and y_pred values (from the beginning of this tutorial).

In [525]:
#hint: you can use the output from your myConfMat function above

def myPrecision(y_test,y_pred):
    classes = np.unique(y_pred)
    precision = np.zeros(classes.shape)
    # get confusion matrix
    conf_matrix = myConfMat(y_test,y_pred,len(np.unique(y)))
    for i in range(0,3):
        tp_plus_fp=0
        for j in classes:
            # true positives + false positives
            tp_plus_fp+=conf_matrix[j][i]
        # precision = true positives / true positives + false positives
        class_precision=conf_matrix[i][i]/tp_plus_fp
        precision[i]=class_precision
    return precision

def myRecall(y_test,y_pred):
    classes = np.unique(y_pred)
    recall = np.zeros(classes.shape)
    # get confusion matrix
    conf_matrix = myConfMat(y_test,y_pred,len(np.unique(y)))
    for i in range(0,3):
        tp_plus_fn=0
        for j in classes:
            # true positives + false negatives
            tp_plus_fn+=conf_matrix[i][j]
        # recall = true positives / true positives + false negatives
        class_recall=conf_matrix[i][i]/tp_plus_fn
        recall[i]=class_recall
    return recall

print("homemade functions:")
print('classes:      %s ' % np.unique(y_pred) )    
print('my precision: %s' % myPrecision(y_test,y_pred))
print('my recall:    %s \n\n' % myRecall(y_test,y_pred))

print("functions from sklearn:")
print('class precision: %s' % precision_score(y_test,y_pred,average=None))
print('class recall: %s' %recall_score(y_test,y_pred,average=None))

homemade functions:
classes:      [0 1 2] 
my precision: [1.         0.76923077 1.        ]
my recall:    [1.   1.   0.75] 


functions from sklearn:
class precision: [1.         0.76923077 1.        ]
class recall: [1.   1.   0.75]
