# 1. Nested cross-validation exercise
## Nested cross-validation for feature selection with nearest neighbors <br>
- Use Python 3 to program both a hyper-parameter selection method based on leave-one-out cross-validation and a nested leave-one-out cross-validation for estimating the prediction performance of models inferred with this automatic selection approach. Use base learning algorithm provided in the exercise, namely the "use_ith_feature" method, so that the value of the hyper-parameter i is automatically selected from the range from 1 to 100 of alternative values. The provided base learning algorithm "use_ith_feature" is 1-nearest neighbor that uses only the ith feature of the data given to it. The LOOCV based hyper-parameter selection procedure is supposed to select the best feature, e.g. the value of i, based on C-index evaluated with predictions obtained with leave-one-out cross-validation. A ready-made implementation of C-index is also provided in the exercise. In nested leave-one-out cross-validation, a C_index value is further evaluated on the predictions obtained from an outer leave-one-out cross-validation. During each round of this outer LOOCV, the whole feature selection process based on inner LOOCV is separately done and the selected feature is used for prediction for the test data point held out during that round of the outer LOOCV. Accordingly, the actual learning algorithm, whose prediction performance will be evaluated with nested CV, is the one that automatically selects the single best feature with leave-one-out cross-validation based model selection (see the lectures and the pseudo codes presented on them for more info on this interpretation).
- Note that since the hold-out set in LOOCV has only a single datum but C-index requires at least two data points. The solution in this exercise is to "pool" the predictions of all LOOCV rounds of a single LOOCV computation into an array of length of the data used in that LOOCV computation and then compute C-index on that array and the corresponding true outputs. This pooling approach, however, does have its weaknesses, since C-index computed from pooled LOOCV outputs may sometimes be a heavily biased estimator of the true C-index. This has been considered in detail in our previous research (and other group's too as seen in the references) that is available here:
http://dx.doi.org/10.1177/0962280218795190
where AUC, a special case of C-index, is considered. The study goes quite deep into the problem of AUC estimation with CV, and you can read it if you are interested about the research carried out in our laboratory, while EMLM course does not go that far and this year's exercise unfortunately still has this non-optimal pooling approach in use.
- Compare the C-index produced by nested leave-one-out CV with the result of ordinary leave-one-out CV with the best value of i e.g. the feature providing the highest LOOCV C-index, and show the C-index difference between the two methods.
- Use the provided implementation of the "use_ith_feature" learning algorithm and C-index functions in your exercise.

As a summary, for completing this exercise implement the following steps: 
_______________________________________________________________
#### 1. Use leave-one-out cross-validation for determining the optimal i-parameter for the data (X_alternative.csv, y_alternative.csv) from the set of possible values of i e.g. {1,...,100}. When you have found the optimal i, save the corresponding C-index (call it loo_c_index) for this parameter.
#### 2. Similarly, use nested leave-one-out cross-validation (leave-one-out both in outer and inner folds) for estimating the C-index (call it nloo_c_index) of the method that selects the best feature with leave-one-out approach. 
#### 3. Return both this notebook and as a PDF-file made from it in the exercise submit page. 
_______________________________________________________________

Remember to use the provided learning algorithm use_ith_feature and C-index functions in your implementation! 

## Import libraries

In [81]:
#In this cell import all libraries you need. For example: 
import numpy as np
from sklearn.model_selection import LeaveOneOut

## Provided functions 

In [82]:
"""
C-index function: 
- INPUTS: 
'y' an array of the true output values
'yp' an array of predicted output values
- OUTPUT: 
The c-index value
"""
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

"""
Self-contained 1-nearest neighbor using only a single feature
- INPUTS: 
'X_train' a numpy matrix of the X-features of the train data points
'y_train' a numpy matrix of the output values of the train data points
'X_test' a numpy matrix of the X-features of the test data points
'i' the index of the feature to be used with 1-nearest neighbor
- OUTPUT: 
'y_predictions' a list of the output value predictions
"""
def use_ith_feature(X_train, y_train, X_test, i):
    y_predictions = []
    for test_ind in range(0, X_test.shape[0]):
        diff = X_test[test_ind, i] - X_train[:, i]
        distances = np.sqrt(diff * diff)
        sort_inds = np.array(np.argsort(distances), dtype=int)
        y_predictions.append(y_train[sort_inds[0]])
    return y_predictions


## Your implementation here

### Prepare the data.

In [83]:
# Load data from csv-files into numpy arrays
data_X = np.genfromtxt('X_alternative.csv', delimiter=',')
data_y = np.genfromtxt('y_alternative.csv', delimiter=',')

# Check for NaN values
print(np.any(np.isnan(data_X)))
print(np.any(np.isnan(data_y)))

False
False


### 1.

In [84]:
# Wrong answer with this. New implementation in the next cell.

def get_optimal_i(data_X, data_y):
    # Scikit-learn leave-one-out
    loo = LeaveOneOut()
    splits = loo.split(data_X)

    pred_pool = []
    true_y_pool = []

    c_indices = []

    # Loop over all of the splits and calculate the C-index for each split
    for (train_index, test_index) in splits:
        # Predictions for this split
        pred_this_split = []
        for i in range(100):
            # Predictions using ith feature
            pred_i = use_ith_feature(data_X[train_index], data_y[train_index], data_X[test_index], i)
            # Reduce extra dimension and append
            pred_this_split.append(pred_i[0])

        # Add predictions to the pool
        pred_pool.append(pred_this_split)
        true_y_pool.append(data_y[test_index][0])
    
    # Calculate C-index for each prediction array
    for i in range(len(true_y_pool)):
        c_ind = cindex(true_y_pool, pred_pool[:][i])
        c_indices.append(c_ind)

    # Get index of highest C-index value
    opt_i = np.argmax(c_indices)
    best_c_index = c_indices[opt_i]
    return opt_i, best_c_index

In [85]:
def get_optimal_i2(data_X, data_y):


    c_index_for_each_i = []

    for i in range(100):

        # Scikit-learn leave-one-out
        loo = LeaveOneOut()
        splits = loo.split(data_X)

        # Predictions for this i
        pred_this_i = []
        # True labels for this i
        y_true_this_i = []

        # Loop over all of the splits
        for (train_index, test_index) in splits:
            # Prediction using ith feature
            pred_i = use_ith_feature(data_X[train_index], data_y[train_index], data_X[test_index], i)
            # Reduce extra dimension and append prediction and true label
            pred_this_i.append(pred_i[0])
            y_true_this_i.append(data_y[test_index][0])

        # Calculate C-index for this i
        c_ind_this_i = cindex(y_true_this_i, pred_this_i)
        c_index_for_each_i.append(c_ind_this_i)


    optimal_i = np.argmax(c_index_for_each_i)
    c_index_for_optimal_i = c_index_for_each_i[optimal_i]
    return optimal_i, c_index_for_optimal_i

In [86]:
optimal_i, loo_c_index = get_optimal_i2(data_X, data_y)
print("Optimal i: ", optimal_i)
print("C-index for optimal i: ", loo_c_index)

Optimal i:  76
C-index for optimal i:  0.6620689655172414


### 2.

In [87]:
# Scikit-learn leave-one-out
outer_loo = LeaveOneOut()
outer_splits = outer_loo.split(data_X)

outer_pred_pool = []
outer_true_y_pool = []

for (train_index, test_index) in outer_splits:
    opt_i = get_optimal_i2(data_X[train_index], data_y[train_index])[0]
    # Prediction with the optimal i
    pred = use_ith_feature(data_X[train_index], data_y[train_index], data_X[test_index], opt_i)

    # Add prediction and true value to the pools
    outer_pred_pool.append(pred[0])
    outer_true_y_pool.append(data_y[test_index])

# Calculate C-index from the pools
nloo_c_index = cindex(outer_true_y_pool, outer_pred_pool)
nloo_c_index

0.5149425287356322

In [88]:
print("C-index for the first method is: {}.".format(loo_c_index))
print("C-index for the second method is: {}.".format(nloo_c_index))

C-index for the first method is: 0.6620689655172414.
C-index for the second method is: 0.5149425287356322.


The C-index for the second method implies that the learning algorithm is as good as a random guess.