# Model Selection using cross-validation


## Instructions:
* Go through the notebook and complete the tasks.
* Make sure you understand the examples given. If you need help, refer to the documentation links provided or go to the discussion forum..
* When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook. 
* Save your notebooks when you are done. 

In this lab, we will work through using cross-validation.

**Task 1:**
In the cell below, we load the iris dataset and show the indices recovered by using a built-in cross-validation function from scikit-learn. These indices can be then used for splitting our data into a training set and testing set. Note that while in this example we use StratifiedShuffleSplit, there are many other cross-validation generators in scikit-learn. You can read more about this <a href="https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html">here</a>.


In [4]:
from sklearn import datasets
from sklearn.model_selection import StratifiedShuffleSplit

# Load iris dataset
iris = datasets.load_iris()

# Set X as a samples times features matrix, Y equal to the targets
# Use only the first 10 datapoints for this example
X = iris.data[0:10]
y = iris.target[0:10]

# Use StratifiedShuffleSplit for cross-validation
cvsplt = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

for train_index, test_index in cvsplt.split(X, y):
    print("train indices:", train_index, "test indices:", test_index)



train indices: [9 8 4 3 6 2 7 1] test indices: [5 0]
train indices: [1 6 8 3 0 2 4 5] test indices: [9 7]
train indices: [8 7 3 5 1 9 4 6] test indices: [0 2]
train indices: [5 0 9 8 6 2 3 1] test indices: [4 7]
train indices: [5 4 6 0 1 2 7 3] test indices: [8 9]
train indices: [8 6 2 0 7 3 1 5] test indices: [4 9]
train indices: [7 8 5 3 9 6 4 2] test indices: [1 0]
train indices: [4 8 5 1 6 7 9 3] test indices: [2 0]
train indices: [8 9 0 4 6 1 5 7] test indices: [3 2]
train indices: [6 2 4 1 7 8 9 5] test indices: [3 0]


**Task 2:**
In the cell below, we have the sample code that was used to train a k-NN classifier on the iris dataset. Your task is to use the code above to change the classification process we show below, in order to perform cross-validation on the data and obtain an estimate of the performance of the classifier with 10 neighbours on the unseen data.


In [5]:
from sklearn import datasets
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load iris dataset
iris = datasets.load_iris()

# Set X as a samples times features matrix, Y equal to the targets
X = iris.data
y = iris.target

# Use StratifiedShuffleSplit for cross-validation
cvsplt = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

accuracies = []

# Train and evaluate the model using cross-validation
for train_index, test_index in cvsplt.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Define k-NN classifier with 10 neighbors
    knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')

    # Fit the classifier
    knn.fit(X_train, y_train)

    # Predict values for test data
    y_pred = knn.predict(X_test)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

print("Accuracies for each fold:", accuracies)



Accuracies for each fold: [1.0, 1.0, 0.9, 0.9666666666666667, 0.9666666666666667, 0.9, 1.0, 0.9333333333333333, 0.9666666666666667, 1.0]


<hr>


**Task 3:**  Now you are going to attempt to write your own cross-validation function to evaluate the k-NN classifier above (i.e. not using StratifiedShuffleSplit). 

Again, we are using a fixed distance (euclidean) and a fixed number of neighbours (10) so we do **not** need to create a validation set.

Your function (see cell below) firstly splits the indices of each of our data into bins according to the number of folds (here: 5-fold).

Then, you should loop through all folds, split the data into training and testing by selecting the appropriate bins (see slides on cross-validation), train on training data and save the test result as the accuracy for each fold (see list accuracy_fold). This is the list that your function should return in the end. Remember that the extend function extends a list with more values.

In [6]:
def myCrossVal(X, y, foldK):
    accuracy_fold = []  # list to store accuracies for each fold
    
    # Shuffle indices
    indices = np.random.permutation(len(y))
    
    # Split indices into k different bins
    bins = np.array_split(indices, foldK)

    # Loop through folds
    for i in range(foldK):
        foldTrain = []  # list to save current indices for training
        foldTest = bins[i]  # current bin for testing

        # Collect training indices (all other bins)
        for j in range(foldK):
            if j != i:
                foldTrain.extend(bins[j])
        
        # Convert lists to numpy arrays
        X_train = X[foldTrain]
        X_test = X[foldTest]
        y_train = y[foldTrain]
        y_test = y[foldTest]

        # Train k-NN classifier
        knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')
        knn.fit(X_train, y_train)

        # Test on test data
        y_pred = knn.predict(X_test)

        # Append accuracy to the list
        accuracy = accuracy_score(y_test, y_pred)
        accuracy_fold.append(accuracy)

    return accuracy_fold

# Perform cross-validation and get accuracies
accuracy_fold = myCrossVal(X, y, 5)
print("Accuracies for each fold:", accuracy_fold)


Accuracies for each fold: [1.0, 1.0, 0.9333333333333333, 0.9333333333333333, 1.0]


**Task 4:** Print the average accuracy and standard deviation of your results over the 5 folds. (functions ``mean`` and ``std``)

In [7]:
# Calculate average accuracy and standard deviation
average_accuracy = np.mean(accuracy_fold)
std_deviation = np.std(accuracy_fold)

print("Average Accuracy:", average_accuracy)
print("Standard Deviation of Accuracy:", std_deviation)


Average Accuracy: 0.9733333333333334
Standard Deviation of Accuracy: 0.03265986323710903


### Nested cross validation
Performing **nested** cross validation is slightly more elaborate, but necessary when we want to learn the best values for our parameters (e.g. here, finding the best values for the number of neighbours that k-NN should have). You can follow the examples on scikit-learn <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html">here</a> and <a href="https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html">here</a> to learn how to do this.
