# Model Selection

## Answers

In this lab, we will work through using cross-validation.

**Task 1:**
In the cell below, we load the iris dataset and show the indices recovered by using a built-in cross-validation function from scikit-learn. These indices can be then used for splitting our data into a training set and testing set. Note that while in this example we use StratifiedShuffleSplit, there are many other cross-validation generators in scikit-learn. You can read more about this <a href="https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html">here</a>.


In [1]:
from sklearn import datasets
from sklearn.model_selection import StratifiedShuffleSplit
iris = datasets.load_iris()

#Set X a samples times features matrix, Y equal to the targets
#use only the first 10 datapoints for this example
X=iris.data[0:10] 
y=iris.target[0:10] 

#we use 10 splits, with the test size being 0.2
cvsplt = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

for train_index, test_index in cvsplt.split(X, y):
    print("train indices:", train_index, "test indices:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


train indices: [9 8 4 3 6 2 7 1] test indices: [5 0]
train indices: [1 6 8 3 0 2 4 5] test indices: [9 7]
train indices: [8 7 3 5 1 9 4 6] test indices: [0 2]
train indices: [5 0 9 8 6 2 3 1] test indices: [4 7]
train indices: [5 4 6 0 1 2 7 3] test indices: [8 9]
train indices: [8 6 2 0 7 3 1 5] test indices: [4 9]
train indices: [7 8 5 3 9 6 4 2] test indices: [1 0]
train indices: [4 8 5 1 6 7 9 3] test indices: [2 0]
train indices: [8 9 0 4 6 1 5 7] test indices: [3 2]
train indices: [6 2 4 1 7 8 9 5] test indices: [3 0]


**Task 2:**
In the cell below, we have the sample code that was used to train a k-NN classifier on the iris dataset. Your task is to use the code above to change the classification process we show below, in order to perform cross-validation on the data and obtain an estimate of the performance of the classifier with 10 neighbours on the unseen data.


In [2]:
# model answer

from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

%matplotlib inline
iris = datasets.load_iris()
#Set X a samples times features matrix, Y equal to the targets
X=iris.data 
y=iris.target 


#we use 10 splits, with the test size being 0.2
cvsplt = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

for train_index, test_index in cvsplt.split(X, y):
    #print("train indices:", train_index, "test indices:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
    knn.fit(X_train,y_train)
    y_pred=knn.predict(X_test)
    print(metrics.accuracy_score(y_test,y_pred))

# #split to train and test
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# #define knn classifier, with 5 neighbors and use the euclidian distance
# knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
# #define training and testing data, fit the classifier
# knn.fit(X_train,y_train)
# #predict values for test data based on training data
# y_pred=knn.predict(X_test)
# #print values
# print(y_test) # true values
# print(y_pred) # predicted values


1.0
1.0
0.9
0.9666666666666667
0.9666666666666667
0.9
1.0
0.9333333333333333
0.9666666666666667
1.0


<hr>


**Task 3:**  Now you are going to attempt to write your own cross-validation function to evaluate the k-NN classifier above (i.e. not using StratifiedShuffleSplit). 

Again, we are using a fixed distance (euclidean) and a fixed number of neighbours (10) so we do **not** need to create a validation set.

Your function (see cell below) firstly splits the indices of each of our data into bins according to the number of folds (here: 5-fold).

Then, you should loop through all folds, split the data into training and testing by selecting the appropriate bins (see slides on cross-validation), train on training data and save the test result as the accuracy for each fold (see list accuracy_fold). This is the list that your function should return in the end. Remember that the extend function extends a list with more values.

**Note**: you might want to refer to the additional helper notebook on indexing that is provided with this topic. 


In [3]:
def myCrossVal(X,y,foldK):
    '''
    This function performs cross validation on the sklearn 
    KNeighborsClassifier algorithm.
    
    inputs:
        X - a data matrix of size (samples, features)
        y - a label array of size (samples,)
        foldK - number of folds
        
    outpus:
        accuracy_fold - a list of foldK accuracy values
    
    '''
    from sklearn.neighbors import KNeighborsClassifier
    
    accuracy_fold=[] #list to store accuracies folds
    
    
    #TASK: use the function np.random.permutation to generate a list of shuffled indices from in the range (0,number of data)
    np.random.seed(0)
    
    # Creates an array of random permutation of indices between 0 and the length of the X data.
    indices = np.random.permutation(np.arange(0,len(X),1))
    
    #TASK: use the function array_split to split the indices to k different bins:
    bins = np.array_split(indices, foldK)
    
    #loop through folds
    for i in range(0,foldK):
        foldTrain=[] # list to save current indices for training
        foldTest=[]  # list to save current indices for testing
        #TASK: take bin i for testing, rest for training. 
        # Can use the function extend to add indices to foldTrain and foldTest
        foldTest = bins[i]
        
        for j in range(0,foldK):
            if j!=i:
                foldTrain.extend( bins[j] )
        
        #train kNN classifier
        knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
        knn.fit(X[foldTrain,:],y[foldTrain])
        #test on test data
        y_pred=knn.predict(X[foldTest,:])
        #append the new accuracy to your accuracy_fold list.  
        
        accuracy_fold.append( metrics.accuracy_score(y[foldTest],y_pred) )
        
    return accuracy_fold;
    
accuracy_fold=myCrossVal(X,y,5)

print(accuracy_fold)

[1.0, 0.8666666666666667, 1.0, 0.9666666666666667, 0.9666666666666667]


**Task 4:** Print the average accuracy and standard deviation of your results over the 5 folds. (functions ``mean`` and ``std``)

In [4]:
print('average accuracy: %.3f (std. %.3f)' % (np.mean(accuracy_fold),np.std(accuracy_fold)))


average accuracy: 0.960 (std. 0.049)


### Nested cross validation
Performing **nested** cross validation is slightly more elaborate, but necessary when we want to learn the best values for our parameters (e.g. here, finding the best values for the number of neighbours that k-NN should have). You can follow the examples on scikit-learn <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html">here</a> and <a href="https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html">here</a> to learn how to do this.
