# TKO_7092 Evaluation of Machine Learning Methods 2024

---

Student name: Lauri Maila

Student number: 2209361

Student email: lkmail@utu.fi

---

## Exercise 4

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, how cross-validation should be performed in the given scenario and why  your cross-validation will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 21 February 2024 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. Currently I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have a list of over 100.000 potential drug molecules, but their affinities still need to be verified in the lab. Obviously I do not have the resources to measure all the possible drug-target pairs, so I need to prioritise. I have decided to do this with the use of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had made in the lab, which comprise of all the 77 target proteins of interest but only 59 different drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of the remaining drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether I am wasting my lab resources by using my model.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

In [1]:
# Why did the estimation described in the letter fail?
# How should leave-one-out cross-validation be performed in the given scenario and why?
# Remember to provide comprehensive and precise arguments.

#### Import libraries

In [2]:
# Import the libraries you need.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

#### Write utility functions

In [3]:
# I'm using the same C-index function that was provided in the first exercise
### Function for calculating C-index ###
# y: array containing true label values.
# yp: array containing the predicted label values.
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n


### Function for spatial leave-one-out cross-validation (LOOCV) ###
def pairwise_cv(k, distance_param, coordinates, df_input, df_output):

    # Number of input rows
    n = df_input.shape[0]

    # Calculate distance matrix from coordinates
    dist_matrix = cdists(coordinates)
    
    # Store actual values and predictions for C-index calculation
    true_values = []
    predictions = []

    # Loop over all data points (leave-one-out cross-validation)
    for i in range(n):
        X = df_input
        y = df_output

        # Use the i-th data point as test set
        X_test = df_input.iloc[[i]]
        y_test = df_output.iloc[[i]]
        
        # Find indices where the distance to current test point is <= to the distance parameter
        nearby_indices = np.where((dist_matrix[i] <= distance_param))[0]

        # Remove the nearby points from the training set
        X_train = X.drop(index=nearby_indices, inplace=False)
        y_train = y.drop(index=nearby_indices, inplace=False)
        
        # Fit the kNN model
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(X_train, y_train)
        
        # Predict model on the left out data point
        y_pred = knn.predict(X_test)

        # Store the predictions and true values
        predictions.append(y_pred)
        true_values.append(y_test.values.flatten())

    # At the end, calculate the C-index using provided function
    c_index = cindex(true_values, predictions)
    return c_index

#### Load datasets

In [14]:
# Read the data files (input.data, output.data, pairs.data).
df_input = pd.read_csv("input.data", delimiter=" ", header=None)
df_output = pd.read_csv("output.data", delimiter=" ", header=None)
df_pairs = pd.read_csv("pairs.data", delimiter=" ", header=None, names=["D", "T"])
print(df_input.head())
print(df_input.shape)
print(df_output.head())
print(df_output.shape)
print(df_pairs.head())
print(df_pairs.shape)

         0         1         2         3         4         5         6   \
0  0.759222  0.709585  0.253151  0.421082  0.727780  0.404487  0.709027   
1  0.034584  0.304720  0.688257  0.296396  0.151878  0.830755  0.270656   
2  0.737867  0.236079  0.905987  0.163612  0.801455  0.789823  0.393999   
3  0.406913  0.607740  0.235365  0.888679  0.150347  0.598991  0.130108   
4  0.697707  0.432565  0.650329  0.886065  0.328660  0.576926  0.523100   

         7         8         9   ...        57        58        59        60  \
0  0.242963  0.407292  0.379971  ...  0.838616  0.165050  0.515334  0.332678   
1  0.705392  0.186120  0.085594  ...  0.472762  0.730013  0.639373  0.445218   
2  0.522067  0.411352  0.781861  ...  0.595468  0.582292  0.836193  0.281514   
3  0.465818  0.799953  0.906878  ...  0.453880  0.311799  0.534668  0.563793   
4  0.080463  0.131349  0.913496  ...  0.583892  0.444141  0.249423  0.110690   

         61        62        63        64        65        66  
0  0

#### Implement and run cross-validation

In [None]:
# Implement and run the requested cross-validation. Report and interpret its results.
# Using k=10 for kNN as stated in case
k = 10 

# Store c_index result for each distance parameter
results = []

for d in 10:
    # Calculate C-index for each distance parameter
    c_index = pairwise_cv(k, d, coordinates, df_input, df_output)
    results.append({"dist": d, "cind": c_index})