# Exercise: Cross-Validation with Symmetric Pair-Input Data



## Task 1 

1. Implement the modified leave-one-out cross-validation scheme that is described in the lecture notes.

2. Estimate and report the generalisation performance of the K-nearest neighbor classifier in predicting the functional similarity of proteins. Use both the unmodified and the modified leave-one-out cross-validation.

3. Discuss your results. In particular, answer the following questions:
 - Why do the two cross-validation schemes produce notably different estimates?
 - For which types of pairs (A, B, or C) are these schemes appropriate and why?

In [1]:
def cindex(true_labels, predictions):
    n = 0
    h_sum = 0

    for i in range (len(true_labels)):
        t = true_labels[i]
        p = predictions[i]
        j= i+1
        for j in range(len(true_labels)):
            nt = true_labels[j]
            np = predictions[j]
            if t != nt:
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_sum = h_sum + 1
                elif p == np:
                    h_sum = h_sum + 0.5
    cindx = h_sum/n
    return cindx

In [2]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# Load datasets
pairs = np.genfromtxt('pairs.data', delimiter=',', dtype=('U50', 'U50'))
features = np.genfromtxt('features.data', delimiter=',')
labels = np.genfromtxt('labels.data', delimiter=',')

# prediction lists to calculate c-index performance
predictions=[]
predictions2=[]

# Loop for each pair
for x in range(len(pairs)):  
    
    # test set
    test_pair = pairs[x, :]
    
    # Pairs member 1
    test_pair_1 = test_pair[0]
    
    # Pairs member 2
    test_pair_2 = test_pair[1]
    
    # delete rows used for testing to create trainset (Unmodified leave-one-out cross-validation)
    pairs_trainset = np.delete(pairs, (x), axis=0)
    features_trainset = np.delete(features, (x), axis=0)
    labels_trainset = np.delete(labels, (x), axis=0)
      
    # KNN prediction with k=1 and euclidean distance
    neigh1 = KNeighborsClassifier(n_neighbors=1)
    neigh1.fit(features_trainset, labels_trainset)
    
    prediction = neigh1.predict([features[x]])
    predictions.append(prediction)
    
 
    # lists for modified dataset
    modified_pairs = []
    modified_del_index = []
    
    #loop through training set pairs
    for n in range(len(pairs_trainset)):
        test_pair2 = pairs_trainset[n]
        
        # Filter pairs in training set containing test pair instances
        if test_pair2[0] != test_pair_1 and test_pair2[0] != test_pair_1 and test_pair2[1] != test_pair_2 and test_pair2[1] != test_pair_2:
            modified_pairs.append(test_pair2)
        else:
            modified_del_index.append(n)
    
    # Deleting rows from trainsets containing test pair instances
    features_modified = np.delete(features_trainset, (modified_del_index), axis=0)
    labels_modified = np.delete(labels_trainset, (modified_del_index), axis=0)
    
    # KNN prediction modified
    neigh2 = KNeighborsClassifier(n_neighbors=1)
    neigh2.fit(features_modified, labels_modified)
    
    prediction2 = neigh2.predict([features[x]])
    predictions2.append(prediction2)
    
    
# Unmodified c-index calculation

C_index =cindex(labels, predictions)
print("C-index using unmodified leave-one-out cross-validation:", C_index)


# Modified c-index calculation
C_index2 = cindex(labels, predictions2)
print("C-index performance using modified leave-one-out cross-validation:", C_index2)



C-index using unmodified leave-one-out cross-validation: 0.7617702448210922
C-index performance using modified leave-one-out cross-validation: 0.6313559322033898


1) 
The two cross-validation schemes produce notably different estimates because the data is symmetric pair-input data. This means pairs that have same members have depencies. This way is propably information leak beteen the test set and training set. Using modified leave-one-out corss-validation this issue fixed by removing the shared members from  the training set.
This is also why the unmodified scheme gives "better" perfomance, but it's more optimisitic than the more realistic of modified result.

2)
Modified leave-one-out cross-validation works better for case C because of no unknown members instance are present in known instances, so the cross-validation is faster because there are less rows in training set than in unmodified leave-one-out crossvalidation scheme. 

Unmodified leave-one-out cross-validation works better for case A because both members of unknown instance are present in   n instances, so there are dependencies between training and test sets, which leads to better generalisation performance.
