# 1. Nested cross-validation exercise
## Nested cross-validation for feature selection with nearest neighbors <br>
- Use Python 3 to program both a hyper-parameter selection method based on 5-fold cross-validation and a nested 5-fold cross-validation for estimating the prediction performance of models inferred with this automatic selection approach. Use base learning algorithm provided in the exercise, namely the "use_ith_feature" method, so that the value of the hyper-parameter i is automatically selected from the range from 1 to 100 of alternative values. The provided base learning algorithm "use_ith_feature" is 1-nearest neighbor that uses only the ith feature of the data given to it. The 5-fold CV based hyper-parameter selection procedure is supposed to select the best feature, e.g. the value of i, based on C-index evaluated with predictions obtained with 5-fold cross-validation. A ready-made implementation of C-index is also provided in the exercise. In nested 5-fold cross-validation, a C_index value is further evaluated on the predictions obtained from an outer 5-fold cross-validation. During each round of this outer 5-fold CV, the whole feature selection process based on inner 5-fold CV is separately done and the selected feature is used for prediction for the test data points held out during that round of the outer CV. Accordingly, the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is the one that automatically selects the single best feature with 5-fold cross-validation based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation).
- Compare the C-index produced by nested 5-fold CV with the result of ordinary 5-fold CV with the best value of i e.g. the feature providing the highest 5-fold CV C-index, and show the C-index difference between the two methods.
- Use the provided implementation of the "use_ith_feature" learning algorithm and C-index functions in your exercise.

As a summary, for completing this exercise implement the following steps: 
_______________________________________________________________
#### 1. Use 5-fold cross-validation for determining the optimal i-parameter for the data (X_train.csv, y_prediction.csv) from the set of possible values of i e.g. {1,...,100}. When you have found the optimal i, save the corresponding C-index (call it 5_fold_c_index) for this parameter.
#### 2. Similarly, use nested cross-validation ( 5-fold CV both in outer and inner folds) for estimating the C-index (call it n_5_fold_c_index) of the method that selects the best feature with 5-fold approach. 
#### 3. Return both this notebook and as a PDF-file made from it in the exercise submit page. 
_______________________________________________________________

Remember to use the provided learning algorithm use_ith_feature and C-index functions in your implementation! 

## Import libraries

In [10]:
#In this cell import all libraries you need. For example: 
import numpy as np
from sklearn.model_selection import KFold
import pandas as pd

## Provided functions 

In [11]:
"""
C-index function: 
- INPUTS: 
'y' an array of the true output values
'yp' an array of predicted output values
- OUTPUT: 
The c-index value
"""
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

"""
Self-contained 1-nearest neighbor using only a single feature
- INPUTS: 
'X_train' a numpy matrix of the X-features of the train data points
'y_train' a numpy matrix of the output values of the train data points
'X_test' a numpy matrix of the X-features of the test data points
'i' the index of the feature to be used with 1-nearest neighbor
- OUTPUT: 
'y_predictions' a list of the output value predictions
"""
def use_ith_feature(X_train, y_train, X_test, i):
    y_predictions = []
    for test_ind in range(0, X_test.shape[0]):
        diff = X_test[test_ind, i] - X_train[:, i]
        distances = np.sqrt(diff * diff)
        sort_inds = np.array(np.argsort(distances), dtype=int)
        y_predictions.append(y_train[sort_inds[0]])
    return y_predictions


## Your implementation here

In [12]:
# reading the data from csv files
X = pd.read_csv('X_train.csv', header=None)
y = pd.read_csv('y_prediction.csv', header=None)
X_train = X.values
y_train = y.values

# number of the folds for C-V
n_folds = 5
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

# variables for results
best_i = None
best_c_index = 0.0

# 5-fold C-V
for i in range(X_train.shape[1]):
    total_c_index = 0.0
    
    for train_index, val_index in kf.split(X_train):
        X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
        y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]
        
        #  1-NN with the current i
        y_pred_fold = use_ith_feature(X_train_fold, y_train_fold, X_val_fold, i)
        
        # c-index for the current fold
        c_index_fold = cindex(y_val_fold, y_pred_fold)
        total_c_index += c_index_fold
    
    # average c-index among folds for the current i
    avg_c_index = total_c_index / n_folds
    
    # updating the best_i and best_c_index 
    if avg_c_index > best_c_index:
        best_i = i
        best_c_index = avg_c_index

print(f"Optimal i parameter: {best_i}")
print(f"5-fold C-index for the optimal i: {best_c_index}")


Optimal i parameter: 45
5-fold C-index for the optimal i: 0.74


In [13]:
X = pd.read_csv('X_train.csv', header=None)
y = pd.read_csv('y_prediction.csv', header=None)
X_train = X.values
y_train = y.values

# number of folds for outer and inner C-V
outer_n_folds = 5
inner_n_folds = 5

# in order to save results
outer_total_c_index = 0.0

# This is the outer C-V
outer_kf = KFold(n_splits=outer_n_folds, shuffle=True, random_state=42)
for outer_train_index, outer_test_index in outer_kf.split(X_train):
    X_outer_train, X_outer_test = X_train[outer_train_index], X_train[outer_test_index]
    y_outer_train, y_outer_test = y_train[outer_train_index], y_train[outer_test_index]
    
    # variables for inner C-V
    inner_best_i = None
    inner_best_c_index = 0.0
    
    # This is the inner C-V
    inner_kf = KFold(n_splits=inner_n_folds, shuffle=True, random_state=42)
    for i in range(X_outer_train.shape[1]):
        inner_total_c_index = 0.0
        
        for inner_train_index, inner_val_index in inner_kf.split(X_outer_train):
            X_inner_train, X_inner_val = X_outer_train[inner_train_index], X_outer_train[inner_val_index]
            y_inner_train, y_inner_val = y_outer_train[inner_train_index], y_outer_train[inner_val_index]
            
            # 1-NN with the current  i
            y_pred_inner = use_ith_feature(X_inner_train, y_inner_train, X_inner_val, i)
            
            # c-index for the current fold
            c_index_inner = cindex(y_inner_val, y_pred_inner)
            inner_total_c_index += c_index_inner
        
        # average c-index in inner folds
        avg_c_index_inner = inner_total_c_index / inner_n_folds
        
        # updating the inner_best_i and inner_best_c_index 
        if avg_c_index_inner > inner_best_c_index:
            inner_best_i = i
            inner_best_c_index = avg_c_index_inner
    
    # best_i from inner C-V to predicting on the outer testset
    y_pred_outer = use_ith_feature(X_outer_train, y_outer_train, X_outer_test, inner_best_i)
    
    # c-index for the outer fold using the best_i
    c_index_outer = cindex(y_outer_test, y_pred_outer)
    outer_total_c_index += c_index_outer

#  average c-index in outer folds
avg_c_index_outer = outer_total_c_index / outer_n_folds

print(f"Nested 5-fold C-index for the method that selects the best feature: {avg_c_index_outer}")


Nested 5-fold C-index for the method that selects the best feature: 0.4666666666666667


In this exercise, the c-index differences for 5-fold cross validation and nested one was something around 0.273333. First, I expected that the performance should be higher with nested cross validation which did not happen at all, so I thought that there might be something wrong or unexpected, for example the fact that the labeled data has only 30 observation might be influencing the results. But, after some search, I came to the conclusion that neasted cross validation gives a more reliable model performance evaluation, not necassrily a better performance.

Also, in this assignment it was a bit hard to understand that the purpose of hyper parameter tuning here is feature selection. Also, it does not make sense to me, that why among 100 features, only one feature has been selected as the best one. Probably is not a real feature selection and just an assignment to make familliar in simplest possible way.