# 6. Nested Cross-Validation

**Instructions:**
* go through the notebook and complete the **tasks** .  
* Make sure you understand the examples given!
* When a question allows a free-form answer (e.g., ``what do you observe?``) create a new markdown cell below and answer the question in the notebook.
* ** Save your notebooks when you are done! **

In the previous lab, we looked at cross-validation when the parameters of our classifier (e.g., k-NN) where known.
In this lab, you will be extending the code for cross-validation in order to find the best parameters to use for each fold, by using a validation set.  Please have a look at the relevant lecture slides that demo how to apply nested cross-validation in order to remember the procedure.

**Note** You can always copy the code in a separate notebook (or, a plain text file .py that you can run with python from the command line) if you want.  After you are done, you can copy the code back in this notepad.

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Run the cell below to load our data. Note that besides adding noise, we also initialize the numpy random seed - this is in order to always get the same results regardless of how many times we run the code. Otherwise, this piece of code is the same as the previous lab.

In [None]:
%matplotlib inline


from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

#import k-nn classifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()

#view a description of the dataset (uncomment next line to do so)
#print(iris.DESCR)

#Set X equal to features, Y equal to the targets

X=iris.data 
y=iris.target 


mySeed=1234567
#initialize random seed generator 
np.random.seed(mySeed)

#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.5,X.shape)

<hr>
<span style="color:rgb(170,0,0)">**Task:**</span> Your task is now to write your own nested cross-validation function.

You can assume that we want to run 5-fold cross-validation, and evaluate the number of neighbours (from 1 to 10 inclusive), along with the 'euclidean' and 'manhattan' distances.

Your function should split the data (using indexes) into appropriate bins, similarly to how this was done in the previous lab. 

For each fold, the testing set should consist of indices in one bin, the validation set should consist of indices in another bin, and the rest of the bins can be assigned to your training set.

Subsequently, we loop through all different parameters (one for loop for neighbours, one for loop for distances), train on the training set and test on the validation set.

Once we are done, we have the best performing set of parameters on our validation set.  We subsequently merge the training set with the validation set, and train on that set using best parameters.

Finally, we evaluate on our test set, and proceed to the next fold.

Your function should return the accuracies on the test set (with best parameters) over all five folds, e.g. ``[0.80000000000000004, 0.8666666666666667, 0.80000000000000004, 0.96666666666666667, 0.73333333333333328]``

The code below is commented so that you can work through developing the function - if you feel more comfortable, you can start working on this code in a different cell/ide and then copy the code here.

In [None]:
# nested cross validation function
# X - data / features
# y - outputs
# foldK - number of folds
# nns - list of number of neighbours parameter for validation
# dists - list of distances for validation
# mySeed - random seed
# returns: accuracy over 5 folds (list)

def myNestedCrossVal(X,y,foldK,nns,dists,mySeed):
    np.random.seed(mySeed)
    accuracy_fold=[]
    
    #TASK: use the function np.random.permutation to generate a list of shuffled indices from in the range (0,number of data)
    #(you did this already in a task above)
    #indices=#...
    #print(indices)
    
    #TASK: use the function array_split to split the indices to foldK different bins (here, 5)
    #uncomment line below
    #bins=
    #print(bins)
    
    #no need to worry about this, just checking that everything is OK
    assert(foldK==len(bins))
    
    #loop through folds
    for i in range(0,foldK):
        foldTrain=[] # list to save current indices for training
        foldTest=[]  # list to save current indices for testing
        foldVal=[]    # list to save current indices for validation

        #loop through all bins, take bin i for testing, the next bin for validation, and the rest for training
        for j in range(0,len(bins)):
            ;
            #insert code here

            
        #print('** Train', len(foldTrain), foldTrain)
        #print('** Val', len(foldVal), foldVal)
        #print('** Test', len(foldTest), foldTest)
        
        
        #no need to worry about this, just checking that everything is OK
        assert not np.intersect1d(foldTest,foldVal)
        assert not np.intersect1d(foldTrain,foldTest)
        assert not np.intersect1d(foldTrain,foldVal)
       
        bestDistance='' #save the best distance metric here
        bestNN=-1 #save the best number of neighbours here
        bestAccuracy=-10 #save the best attained accuracy here (in terms of validation)
        
        
        # loop through all parameters (one for loop for distances, one for loop for nn)
        # train the classifier on current number of neighbours/distance
        # obtain results on validation set
        # save parameters if results are the best we had
        
        
        #print('** End of val for this fold, best NN', bestNN, 'best Dist', bestDistance)
        
        
        #evaluate on test data:
        #extend your training set by including the validation set
        #train k-NN classifier on new training set and test on test set
        #get performance on fold, save result in accuracy_fold array
       
        

        #print('==== Final Cross-val on test on this fold with NN', bestNN, 'dist', bestDistance, ' accuracy ',accuracy_score(y[foldTest],y_pred))

    return accuracy_fold;
    
#call your nested crossvalidation function:
 
accuracy_fold=myNestedCrossVal(X,y,5,list(range(1,11)),['euclidean','manhattan'],mySeed)
print(accuracy_fold)