# TKO_7092 Evaluation of Machine Learning Methods 2024

---

Student name: Arvin Jalali

Student number: 2310744

Student email: arvin.a.jalali@utu.fi

---

## Exercise 4

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, how cross-validation should be performed in the given scenario and why  your cross-validation will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 21 February 2024 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. Currently I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have a list of over 100.000 potential drug molecules, but their affinities still need to be verified in the lab. Obviously I do not have the resources to measure all the possible drug-target pairs, so I need to prioritise. I have decided to do this with the use of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had made in the lab, which comprise of all the 77 target proteins of interest but only 59 different drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of the remaining drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether I am wasting my lab resources by using my model.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

# Why did the estimation described in the letter fail?
# How should leave-one-out cross-validation be performed in the given scenario and why?
# Remember to provide comprehensive and precise arguments.
We should pay much attention to dependencies in pair-input data, since the shared objects lead to dependencies between observations. Dependencies must be taken into account by performance evaluation methods. That is why the estimation described in the letter fails. Pair-input observations are not independent because objects are shared between observations.Drug and target molecules are paired in various combinations in drug-target interaction data as pair-input data.

There 4 different types of pair-input observations, based on their type, different dependencies happen on the estimation of out-of-sample prediction performance. 

In type A, both members are in-sample objects. The performnace on type A observations can be appropriately estimated by the regular leave-one-out cross-validation. There are no restrictions on the objects that can appear in the training set. 

In type B, the first member is out-of-sample object, and the second member is in-sample object. Regarding the estimation for type B, in each fold, the observations that share the first pair member with the test observation must not be used for training. The first pair members of the test observation are not allowed to appear in the training set. 

In type C, the first member is in-sample object, and the second member is out-of-sample object. Regarding the estimation for type C, in each fold, the observations that share the second pair member with the test observation must not be used for training. The second pair members of the test observations are not allowed to appear in the training set.

In type D, both members are out-of-sample objects. Regarding the estimation for type D, in each forld, the observations that share either pair members with the test observation must not be used for training. None of the pair members of the test observations are allowed to appear in training set.

The 4 types of out-of-sample observations differe by the nature and extent of dependencies. Out-of-sample prediction performance must be estimated separately for each type. 

In this exercise, it has been mentioned: "which comprise of all the 77 target proteins of interest but only 59 different drug molecules" . I can conclude that, there can be additional drug molecules other than the 59 molecules, which may appear in a pair-input observation as out-of-sample for the purpose of affinity prediction. In this case, considering the the order of pairs as (D - T), just like as data points in the pairs.data, then that is type B pair-input data. In my opinion, in this exercise type C and D cannot happen, becuase it has mentioned in the letter that, all of 77 targets have utilized. So in this assignment, leave-one-out cross-validation for type A (regular one) and most importantly for type B separately should be implemented. In case of not implementing a separate cross-validation for type B, and C, D (if required), serious dependencies caused by shared objects among observations lead to unreliable C_index as performance metric, meaning that the value of C_index looks really good, but it is not real, and of course that is very unreliable. By making separate cross-validation for types B, C, D, we actually modify the training set for each type by removing some observations affected by dependencies due to shared objects in pair-input data.

#### Import libraries

In [83]:
# Import the libraries you need.
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor

#### Write utility functions

In [84]:
# Write the utility functions you need in your analysis.

"""
C-index function: 
- INPUTS: 
'y' an array of the true output values
'yp' an array of predicted output values
- OUTPUT: 
The c-index value
"""
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

#### Load datasets

In [85]:
# Read the data files (input.data, output.data, pairs.data).

# Read input data
input_df = pd.read_csv('input.data', header=None, delimiter=' ')
output_df = pd.read_csv('output.data', header=None, delimiter=' ')
print("Here are the dimensions of input.data and output.data respectively")
print(input_df.shape)
print(output_df.shape)

df = pd.read_csv('pairs.data', header=None, delimiter=' ', quoting=3)
print("Here is the dimension of pairs.data")
print(df.shape)
print(df)
D_collection = set()
T_collection = set()

for index, row in df.iterrows():
    D_item = row[0].strip('"')
    T_item = row[1].strip('"')
    D_collection.add(D_item)
    T_collection.add(T_item)

print()
print("Collection of drug items:", D_collection)
print()
print("Collection of target items:", T_collection)

print()
print("number of drug items:", len(D_collection))
print("number of target items:", len(T_collection))

Here are the dimensions of input.data and output.data respectively
(400, 67)
(400, 1)
Here is the dimension of pairs.data
(400, 2)
         0      1
0    "D40"   "T2"
1    "D31"  "T64"
2     "D6"  "T58"
3    "D56"  "T49"
4    "D20"  "T28"
..     ...    ...
395  "D30"  "T27"
396  "D53"  "T11"
397  "D29"  "T27"
398  "D53"  "T50"
399   "D4"  "T15"

[400 rows x 2 columns]

Collection of drug items: {'D13', 'D31', 'D43', 'D29', 'D2', 'D37', 'D4', 'D25', 'D35', 'D11', 'D23', 'D21', 'D36', 'D40', 'D12', 'D47', 'D32', 'D46', 'D7', 'D48', 'D15', 'D50', 'D10', 'D14', 'D55', 'D54', 'D56', 'D42', 'D9', 'D44', 'D5', 'D24', 'D34', 'D53', 'D27', 'D57', 'D28', 'D20', 'D8', 'D26', 'D45', 'D38', 'D6', 'D17', 'D59', 'D1', 'D16', 'D18', 'D41', 'D3', 'D22', 'D30', 'D19', 'D51', 'D39', 'D58', 'D52', 'D49', 'D33'}

Collection of target items: {'T71', 'T67', 'T30', 'T59', 'T35', 'T76', 'T40', 'T45', 'T57', 'T27', 'T47', 'T13', 'T63', 'T22', 'T37', 'T52', 'T18', 'T2', 'T36', 'T28', 'T53', 'T64', 'T68', 'T1', '

#### Implement and run cross-validation

In [86]:
# implementing leave one out cross validation for type B and comparing the C_index with type A
X = input_df.values
y = output_df.values
df_array = df.values

# a function which gets a pair-input and return the type of the given pair-input.
# such a function is not really required for this assignment.
def check_availability(drug_item, target_item):
    
    if drug_item in D_collection and target_item in T_collection:
        return "A"
    elif drug_item in D_collection and target_item not in T_collection:
        return "B"
    elif drug_item not in D_collection and target_item in T_collection:
        return "C"
    else:
        return "D"
        
# result = check_availability('D310', 'T150')


# This a function which returns the indexes of observations should be removed from traing set
# particularly for TYPE B pair-input leave one out cross validation
def find_matching_indexes(index, data):
    target_value = data[index][0]  # Get the value of the first feature at the given index
    matching_indexes = [i for i, row in enumerate(data) if row[0] == target_value]
    return matching_indexes
    

def pair_data_LOO_CV(X, y, n_neighbors):

    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    y_preds = []
    for i in range(len(X)):
        indices_to_remove = find_matching_indexes(i, df_array)  # List of indexes to remove
        X_train = np.delete(X, indices_to_remove, axis=0)
        y_train = np.delete(y, indices_to_remove)
        #X_train = np.delete(X, i, axis=0)
        #y_train = np.delete(y, i)
        X_test = X[i].reshape(1, -1)
        y_test = y[i]

        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        y_preds.append(y_pred)

    y_preds = np.array(y_preds)
    avg_cindex = cindex(y, y_preds)
    
    return avg_cindex
    
type_B_C_index = pair_data_LOO_CV(X, y, 10)
print("C_index for type B pair-input data with KNN, K=10 is equal to:", type_B_C_index)

def LOO_CV(X, y, n_neighbors):

    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    y_preds = []
    for i in range(len(X)):
        X_train = np.delete(X, i, axis=0)
        y_train = np.delete(y, i)
        X_test = X[i].reshape(1, -1)
        y_test = y[i]

        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        y_preds.append(y_pred)

    y_preds = np.array(y_preds)
    avg_cindex = cindex(y, y_preds)
    
    return avg_cindex
    
type_A_C_index = LOO_CV(X, y, 10)
print("C_index for type A pair-input data with KNN, K=10 is equal to:", type_A_C_index)


C_index for type B pair-input data with KNN, K=10 is equal to: 0.51968671679198
C_index for type A pair-input data with KNN, K=10 is equal to: 0.8300062656641604
