# 1. Nested cross-validation exercise
## Nested cross-validation for feature selection with nearest neighbors <br>
- Use Python 3 to program both a hyper-parameter selection method based on leave-one-out cross-validation and a nested leave-one-out cross-validation for estimating the prediction performance of models inferred with this automatic selection approach. Use base learning algorithm provided in the exercise, namely the "use_ith_feature" method, so that the value of the hyper-parameter i is automatically selected from the range from 1 to 100 of alternative values. The provided base learning algorithm "use_ith_feature" is 1-nearest neighbor that uses only the ith feature of the data given to it. The LOOCV based hyper-parameter selection procedure is supposed to select the best feature, e.g. the value of i, based on C-index evaluated with predictions obtained with leave-one-out cross-validation. A ready-made implementation of C-index is also provided in the exercise. In nested leave-one-out cross-validation, a C_index value is further evaluated on the predictions obtained from an outer leave-one-out cross-validation. During each round of this outer LOOCV, the whole feature selection process based on inner LOOCV is separately done and the selected feature is used for prediction for the test data point held out during that round of the outer LOOCV. Accordingly, the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is the one that automatically selects the single best feature with leave-one-out cross-validation based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation).
- Note that since the hold-out set in LOOCV has only a single datum but C-index requires at least two data points. The solution in this exercise is to "pool" the predictions of all LOOCV rounds of a single LOOCV computation into an array of length of the data used in that LOOCV computation and then compute C-index on that array and the corresponding true outputs. This pooling approach, however, does have its weaknesses, since C-index computed from pooled LOOCV outputs may sometimes be a heavily biased estimator of the true C-index. This has been considered in detail in our previous research (and other group's too as seen in the references) that is available here:
http://dx.doi.org/10.1177/0962280218795190
where AUC, a special case of C-index, is considered. The study goes quite deep into the problem of AUC estimation with CV, and you can read it if you are interested about the research carried out in our laboratory, while EMLM course does not go that far and this year's exercise unfortunately still has this non-optimal pooling approach in use.
- Compare the C-index produced by nested leave-one-out CV with the result of ordinary leave-one-out CV with the best value of i e.g. the feature providing the highest LOOCV C-index, and show the C-index difference between the two methods.
- Use the provided implementation of the "use_ith_feature" learning algorithm and C-index functions in your exercise.

As a summary, for completing this exercise implement the following steps: 
_______________________________________________________________
#### 1. Use leave-one-out cross-validation for determining the optimal i-parameter for the data (X_alternative.csv, y_alternative.csv) from the set of possible values of i e.g. {1,...,100}. When you have found the optimal i, save the corresponding C-index (call it loo_c_index) for this parameter.
#### 2. Similarly, use nested leave-one-out cross-validation (leave-one-out both in outer and inner folds) for estimating the C-index (call it nloo_c_index) of the method that selects the best feature with leave-one-out approach. 
#### 3. Return both this notebook and as a PDF-file made from it in the exercise submit page. 
_______________________________________________________________

Remember to use the provided learning algorithm use_ith_feature and C-index functions in your implementation! 

## Import libraries

In [639]:
#In this cell import all libraries you need. For example: 
import numpy as np
import pandas as pd

## Provided functions 

In [640]:
"""
C-index function: 
- INPUTS: 
'y' an array of the true output values
'yp' an array of predicted output values
- OUTPUT: 
The c-index value
"""
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

"""
Self-contained 1-nearest neighbor using only a single feature
- INPUTS: 
'X_train' a numpy matrix of the X-features of the train data points
'y_train' a numpy matrix of the output values of the train data points
'X_test' a numpy matrix of the X-features of the test data points
'i' the index of the feature to be used with 1-nearest neighbor
- OUTPUT: 
'y_predictions' a list of the output value predictions
"""
def use_ith_feature(X_train, y_train, X_test, i):
    y_predictions = []
    for test_ind in range(0, X_test.shape[0]):
        diff = X_test[test_ind, i] - X_train[:, i]
        distances = np.sqrt(diff * diff)
        sort_inds = np.array(np.argsort(distances), dtype=int)
        y_predictions.append(y_train[sort_inds[0]])
    return y_predictions


In [641]:
#load data 
x_df = pd.read_csv('X_alternative.csv', header=None).T
display(x_df)

y_df = pd.read_csv('Y_alternative.csv', header=None).T
display(y_df)

#Check for missing values
print('Count of missing values in dataFrame x: ', x_df.isna().sum().sum())
print('Count of missing values in dataFrame y: ', y_df.isna().sum().sum())

#Seems the data is all in the same scale and no values seem to be missing so the data doesn't need any action 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,0.619350,0.129018,0.131995,0.081591,0.536646,0.856475,0.269039,0.542290,0.953448,0.394755,...,0.684800,0.240149,0.534512,0.743666,0.613363,0.044856,0.252338,0.481110,0.554211,0.624228
1,0.894394,0.420882,0.409956,0.546907,0.716692,0.403165,0.510235,0.487369,0.016227,0.082877,...,0.580510,0.136212,0.198161,0.020115,0.800232,0.116786,0.043594,0.170237,0.310986,0.530988
2,0.523794,0.405804,0.522332,0.575482,0.918109,0.769205,0.984211,0.347833,0.441513,0.452526,...,0.179802,0.446898,0.039641,0.245948,0.310818,0.045215,0.832020,0.401422,0.410235,0.807586
3,0.385701,0.403238,0.712307,0.709464,0.193943,0.851972,0.469135,0.059363,0.231504,0.755357,...,0.686894,0.937015,0.564522,0.193605,0.019359,0.075442,0.335750,0.319551,0.430338,0.108209
4,0.409570,0.318821,0.486487,0.785272,0.398906,0.520888,0.937629,0.588577,0.572203,0.245983,...,0.526579,0.733458,0.512536,0.764717,0.966871,0.169473,0.837364,0.594323,0.602016,0.353801
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.949962,0.564682,0.553774,0.758504,0.988201,0.099114,0.244342,0.577558,0.414522,0.562557,...,0.940847,0.857497,0.814041,0.525641,0.349517,0.893160,0.888424,0.071549,0.321960,0.495570
96,0.051626,0.989836,0.197861,0.212656,0.283483,0.701698,0.248090,0.028091,0.459789,0.810378,...,0.274311,0.482525,0.956788,0.665012,0.573106,0.913248,0.917017,0.431100,0.205266,0.764825
97,0.494949,0.965460,0.077357,0.804820,0.744440,0.097722,0.973724,0.271387,0.168571,0.660916,...,0.659134,0.704446,0.657918,0.069332,0.224993,0.318439,0.624500,0.270147,0.289065,0.877994
98,0.416414,0.749550,0.016394,0.296416,0.889423,0.607366,0.928859,0.563848,0.383554,0.555825,...,0.445784,0.867884,0.332692,0.934119,0.594126,0.642161,0.999535,0.473911,0.231834,0.687030


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,0.363947,0.654349,0.370083,0.833793,0.198503,0.198334,0.308125,0.411723,0.724286,0.115418,...,0.132256,0.913207,0.485352,0.448907,0.917852,0.074258,0.073575,0.141819,0.572679,0.900572


Count of missing values in dataFrame x:  0
Count of missing values in dataFrame y:  0


In [642]:
#make test/train splits
def split(x_df, y_df, row): 

        #separate a row from x_df to use as test set
        xt = x_df[row]
        x_test = np.array([xt])

        #drop the testing row from the training set
        xtr = x_df.drop([row], axis=1)
        x_train = np.array(xtr).T

        #drop testing rows label from trainin labels
        yt = y_df.drop([row], axis=1)
        y_train = np.array(yt).T
        
        return x_test, x_train, y_train

## Leave-one-out implementation

In [643]:
#Loop trough x_df rows. Use the selected row for testing and rest of the rows for training
#Goal: choose i 

#array to store cindexes 
cindex_df = pd.DataFrame(columns=['c_index'], index=range(0,x_df.shape[0]))

#testing labels for cindex
y_test = np.array(y_df)

In [644]:
#loop through the attributes 
for i in range(0, x_df.shape[0]):
    
    #loop x_df rows
    y_predictions = []
    for c in range(0, x_df.shape[1]):

        x_test, x_train, y_train = split(x_df, y_df, c)
    
        #make predictions based on different attributes
        #make prediction
        y_pred = use_ith_feature(x_train, y_train, x_test, i)
        #get cindex for this attribute
        y_predictions = np.append(y_predictions, y_pred[0][0])
    
    #initialize arrays for cindex funtion
    y = y_test[0]
    yp = y_predictions
    
    #add cindexis to dataframe
    cind = cindex(y, yp)
    cindex_df.at[i,'c_index'] = cind

#printing something to indicate we are ready.
print('Done')

Done


In [645]:
#display all indexis
display(cindex_df)

best = cindex_df[cindex_df.c_index.max() == cindex_df['c_index']]

#attribute with maximum value 
print('The best attribute for regression is ')
display(best)

Unnamed: 0,c_index
0,0.37931
1,0.532184
2,0.356322
3,0.585057
4,0.55977
...,...
95,0.506897
96,0.47931
97,0.597701
98,0.385057


The best attribute for regression is 


Unnamed: 0,c_index
76,0.662069


## Nested-leave-one-out implementation

In [646]:
#new column names
names = []
for i in range(0, x_train_outer.shape[0]):
    names = np.append(names, int(i))

cindex_df_inner = pd.DataFrame(columns=['c_index'], index=range(0,x_df.shape[0])) #initialize array to store inner loops cindex values

y_test_outer = np.array(y_df)
y_predictions_outer = []

cindex_df_outer = pd.DataFrame(columns=['c_index'], index=range(0,x_df.shape[1]))

In [647]:
#outer split 
for c in range(0, x_df.shape[1]):
    print('Outer split: ', c)
    
    #call split funtion
    x_test_outer, x_train_outer, y_train_outer = split(x_df, y_df, c)
    
    #dataframes for inner split 
    x_df_inner = pd.DataFrame((x_train_outer.T))
    x_df_inner.columns = names
    y_df_inner = pd.DataFrame((y_train_outer.T))
    y_df_inner.columns = names
    
    y_test_inner = np.array(y_df_inner) 
    
    #loop through the attributes 
    for i in range(0, x_df.shape[0]):
        
        y_predictions_inner = []
        
        #inner split
        for d in range(0, x_df_inner.shape[1]):
            x_test_inner, x_train_inner, y_train_inner = split(x_df_inner, y_df_inner, d)
            
            #make predictions based on different attributes
            y_pred = use_ith_feature(x_train_inner, y_train_inner, x_test_inner, i)
            #get cindex for this attribute
            y_predictions_inner = np.append(y_predictions_inner, y_pred[0][0])
        
        #initialize arrays for cindex funtion
        y = y_test_inner[0]
        yp = y_predictions_inner

        #add cindexis to dataframe
        cind = cindex(y, yp)
        cindex_df_inner.at[i,'c_index'] = cind
    
    #attribute with maximum value 
    best = np.argmax(cindex_df_inner.values)
    
    #make predictions based on different attributes
    y_pred = use_ith_feature(x_train_outer, y_train_outer, x_test_outer, best)
    
    #get cindex for this attribute
    y_predictions_outer = np.append(y_predictions_outer, y_pred[0][0])
    
print('Done')

Outer split:  0
Outer split:  1
Outer split:  2
Outer split:  3
Outer split:  4
Outer split:  5
Outer split:  6
Outer split:  7
Outer split:  8
Outer split:  9
Outer split:  10
Outer split:  11
Outer split:  12
Outer split:  13
Outer split:  14
Outer split:  15
Outer split:  16
Outer split:  17
Outer split:  18
Outer split:  19
Outer split:  20
Outer split:  21
Outer split:  22
Outer split:  23
Outer split:  24
Outer split:  25
Outer split:  26
Outer split:  27
Outer split:  28
Outer split:  29
Done


In [648]:
#initialize arrays for cindex funtion
y = y_test_outer[0]
yp = y_predictions_outer

#add cindexis to dataframe
cind = cindex(y, yp)

#The actual c-index of the model 
print('C-index of the model  is ', cind)

C-index of the model  is  0.5149425287356322


## Conclusion 

This proved the fact that evaluation done at the same time as hyperparameter gives over optimistic approximation.
<strong>0.66</strong> was the c-index value of the simple cross-validation. Nested-cross-validation gave us a value of <strong>0.51</strong>