# Exercise 1 | TKO_2096 Application of Data Analysis 2021

#### Nested cross-validation for K-nearest neighbors <br>
- Use Python 3 to program a nested cross-validation for the k-nearest neighbors (kNN) method so that the number of neighbours k is automatically selected from the range 1 to 10. In other words, the base learning algorithm is kNN but the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is kNN with automatic CV-based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation).
- As a kNN implementation, you can use sklearn: http://scikit-learn.org/stable/modules/neighbors.html but your own kNN implementation can also be used if you like to keep more control on what is happening in the learning process. The CV implementation should be easily modifiable, since the forthcoming exercises involve different problem-dependent CV variations.
- Use the nested CV implementation on the iris data and report the resulting classification accuracy. Hint: you can use the nested CV example provided on sklearn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html as a starting point and compare your nested CV implementation with that but do NOT use the ready made CV implementations of sklearn as the idea of the exercise is to learn to split the data on your own. The other exercises need more sophisticated data splitting which are not necessarily available in libraries.
- Return your solution for each exercise BOTH as a Jupyter Notebook file and as a PDF-file made from it.
- Return the report to the course page on **Monday 1st of February** at the latest.  

## Import libraries

In [1]:
#Library imports: 
import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from sklearn.neighbors import KNeighborsClassifier

## Results of the nested cross-validation

In [10]:
# Load the dataset
iris = load_iris()

X = iris.data
y = iris.target

# Parameters for cv and knn
folds = 5
n_neighbors = [1,2,3,4,5,6,7,8,9,10]


# K-fold split generators
inner_cv = KFold(n_splits=folds, shuffle=True, random_state=44)
outer_cv = KFold(n_splits=folds, shuffle=True, random_state=44)

outer_scores = []

print("Selecting the best k value:")
# Split the data into training and test sets for the outer cross-validation
for train_out, test_out in outer_cv.split(X):
    X_train, X_test = X[train_out], X[test_out]
    y_train, y_test = y[train_out], y[test_out]
    
    scores = [] # Mean score for each value of k (10 values)
    
    # For every value of k...
    for k in n_neighbors:
        inner_scores = []
        
        # ...split the outer training set further into inner training and validation sets...
        for train_in, validation in inner_cv.split(X_train):
            X_train_inner, X_val = X_train[train_in], X_train[validation]
            y_train_inner, y_val = y_train[train_in], y_train[validation]

            # ...and do classification
            knn = KNeighborsClassifier(k)
            knn.fit(X_train_inner, y_train_inner)

            # Score the model
            score = knn.score(X_val, y_val)
            inner_scores.append(score) # save score
        
        # Take mean score for every round of k-values
        scores.append(np.mean(inner_scores))
            
    # Save this round's best score and k
    best_score = max(scores)

    best_k_for_this_round = scores.index(best_score)+1 # add 1 for indexing
    
    print(f"Best k: {best_k_for_this_round}, with score: {round(best_score,3)}")
    
    # Use best_k_for_this_round as number of neighbors in the outer cross-validation
    knn = KNeighborsClassifier(best_k_for_this_round)
    outer_score = cross_val_score(knn, X_test, y_test, cv=outer_cv)
    outer_scores.append(np.mean(outer_score))


print()
print("Evaluating prediction performance of the selected model:")

for i, j in enumerate(outer_scores):
    print(f"Run #{i}: {round(j,3)}")

print()
print("Mean score for outer cv was:", round(np.mean(outer_scores),3))

Selecting the best k value:
Best k: 5, with score: 0.975
Best k: 8, with score: 0.967
Best k: 9, with score: 0.967
Best k: 8, with score: 0.975
Best k: 3, with score: 0.983

Evaluating prediction performance of the selected model:
Run #0: 0.967
Run #1: 0.967
Run #2: 0.933
Run #3: 0.8
Run #4: 0.833

Mean score for outer cv was: 0.9


## OLD VERSION


In [9]:
###### THIS IS THE OLD VERSION ######

# Load the dataset
iris = load_iris()

X = iris.data
y = iris.target


# Do some definitions
folds = 5
n_neighbors = [1,2,3,4,5,6,7,8,9,10]

inner_cv = KFold(n_splits=folds, shuffle=True, random_state=44)
outer_cv = KFold(n_splits=folds, shuffle=True, random_state=44)

outer_scores = []


# Split the data into training and test sets for the outer cross-validation
for train_i, test_i in outer_cv.split(X):
    X_train, X_test = X[train_i], X[test_i]
    y_train, y_test = y[train_i], y[test_i]
    
    scores = []
    

    # Split the training set further into inner training and validation sets
    for train_inner, valid in inner_cv.split(X_train):
        X_train_inner, X_val = X_train[train_inner], X_train[valid]
        y_train_inner, y_val = y_train[train_inner], y_train[valid]
        
        inner_scores = []
        
        # Get prediction scores for every value of k
        for k in n_neighbors:
            knn = KNeighborsClassifier(k)
            knn.fit(X_train_inner, y_train_inner)
            
            score = knn.score(X_val, y_val)
            inner_scores.append((k, score))
            
        # For every value of k, save the best score
        scores.append(max(inner_scores))
    
    # Get the value of the best k
    best_k = max(scores)[0]
    
    # And use it to run the outer cross-validation
    knn = KNeighborsClassifier(best_k)
    outer_score = cross_val_score(knn, X_test, y_test, cv=outer_cv)
    outer_scores.append(np.mean(outer_score))
    
print("Accuracy:", round(np.mean(outer_scores),2))

Accuracy: 0.85
