# TKO_7092 Evaluation of Machine Learning Methods 2024

---

Student name: Konsta Nyman

Student number: 523834

Student email: kokany@utu.fi

---

## Exercise 4

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, how cross-validation should be performed in the given scenario and why  your cross-validation will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 21 February 2024 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. Currently I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have a list of over 100.000 potential drug molecules, but their affinities still need to be verified in the lab. Obviously I do not have the resources to measure all the possible drug-target pairs, so I need to prioritise. I have decided to do this with the use of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had made in the lab, which comprise of all the 77 target proteins of interest but only 59 different drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of the remaining drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether I am wasting my lab resources by using my model.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

In [1]:
# Why did the estimation described in the letter fail?

# The estimation described in the letter failed because of dependencies between the observations in the pair-input data. Some pairs share objects and are therefore dependent on each other.
# For example the pair-input observation 3 ("D6" "T58") shares the same drug with the observation 38 ("D6" "T41"). When either of these pair-input observations or any other observations that
# share objects with training set objects are taken to be the test set, an issue occurs. This optimistically biases the estimation because the test and training sets contain pairs that share 
# the same object and the independency assumption does not hold. 


# How should leave-one-out cross-validation be performed in the given scenario and why?

# There are usually four types of observations in pair-input data. This problem only contains observations of Type A and Type B, since all target proteins are used in the sample. We can call the 
# drugs the 1st member of the pair and the target proteins the 2nd member. Type A observations include pair-input observations where both members are included in the sample, even if the exact pair 
# isn't in the sample. The performance is expected to be better on Type A observations. Performance is expected to be worse on Type B observations which include all pairings where only the 2nd 
# member is part of the sample. In this case they include all pairings that feature any other drug than any of the 59 ones included in the sample, since all the proteins are in-sample data. There
# being over 100 000 potential drugs, Type B pairings form the vast majority of the parings. The estimate received already is reliable for Type A pairings since the data naturally includes only 
# Type A pairs, but we need to alter evaluation methods to get a reliable generalized performance estimate for Type B parings.

# When performing leave-one-out cross-validation to estimate the performance of Type B observations, we must ignore any training set pairs that share the 1st member with the test set pair. This 
# gives a reliable estimate of the performance also on the generalized data since the dependencies in the data are considered. 

#### Import libraries

In [2]:
# Import the libraries you need.
import numpy as np
import pandas as pd
from sklearn.model_selection import LeaveOneOut
from sklearn.neighbors import KNeighborsRegressor

#### Write utility functions

In [3]:
# Write the utility functions you need in your analysis.

### Append unique values to the list ###
# list: list to be appended to
# value: value to be appended to the list
def append_unique(list, value):
    if value not in list:
        list.append(value)

### Calculating C-index ###
# y: array containing true values.
# yp: array containing the predicted values.
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

# form training set for test object

#### Load datasets

In [4]:
# Read input.data

input_df = pd.read_csv('input.data',
                       sep=' ',
                       header=None)
print(f"Shape of input_df: {input_df.shape}")


Shape of input_df: (400, 67)


In [5]:
# Read output.data

output_df = pd.read_csv('output.data',
                       header=None)
print(f"Shape of output_df: {output_df.shape}")

Shape of output_df: (400, 1)


In [6]:
# Read pairs.data

pairs_df = pd.read_csv('pairs.data',
                       sep=' ',
                       header=None)
print(f"Shape of pairs_df: {pairs_df.shape}")


Shape of pairs_df: (400, 2)


#### Implement and run cross-validation

In [7]:
# Implement and run the requested cross-validation. Report and interpret its results.

# create leave-one-out cross-validator
loo = LeaveOneOut()

# create knn model with 10 neighbors
knn_10 = KNeighborsRegressor(n_neighbors=10)

# create list to store predictions
y_predictions = []

# loo split
for train_index, test_index in loo.split(input_df):
    X_train, X_test = input_df.iloc[train_index], input_df.iloc[test_index]
    y_train, y_test = output_df.iloc[train_index], output_df.iloc[test_index]

    # exclude all instances that share the 1st member of the pair
    drop_indexes = []
    for j in train_index:
        if pairs_df.iloc[test_index[0], 0] == pairs_df.iloc[j, 0]:
            append_unique(drop_indexes, j)
    X_train = X_train.drop(drop_indexes)
    y_train = y_train.drop(drop_indexes)

    # fit k-NN model
    knn_10.fit(X_train, y_train)

    # make predictions
    preds = knn_10.predict(X_test)
    y_predictions.append(preds) #append predicted value to list

# calculate c-index
c_index_type_b = cindex(output_df.to_numpy(), y_predictions)

# print result
print(f"C-index for Type B observations: {c_index_type_b}")

C-index for Type B observations: 0.51968671679198


In [8]:
# Original leave-one-out cross-validation for Type A 

# create list to store predictions
y_predictions = []

# loo split
for train_index, test_index in loo.split(input_df):
    X_train, X_test = input_df.iloc[train_index], input_df.iloc[test_index]
    y_train, y_test = output_df.iloc[train_index], output_df.iloc[test_index]

    # fit k-NN model
    knn_10.fit(X_train, y_train)

    # make predictions
    preds = knn_10.predict(X_test)
    y_predictions.append(preds) #append predicted value to list

# calculate c-index
c_index_type_a = cindex(output_df.to_numpy(), y_predictions)

# print result
print(f"C-index for Type A observations: {c_index_type_a}")

C-index for Type A observations: 0.8300062656641604


The estimation for the performance on generalized data is very poor: 0.5197. This is a reliable estimate for any other drugs than the ones measured in the sample. I also got a lower performance estimate for Type A than you: 0.8300