# TKO_7092 Evaluation of Machine Learning Methods 2023

## Exercise 4

Complete the tasks given to you in the letter below. There are cells at the end of this notebook to which you are expected to write your code. Insert markdown cells as needed to describe your solution. Remember to follow all the general exercise guidelines stated in Moodle.

The deadline of this exercise is **Wednesday 22 February 2023 at 23:59**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.


---

Student name: Juuso Pyykkönen

Student number: 522060

Student email: jhpyyk@utu.fi

---


Dear Data Scientist,

I have a task for you that concerns drug molecules and their targets. I have spent a lot of time in a laboratory to measure how strongly potential drug molecules bind to putative target molecules. I do not have enough resources to measure all possible drug-target pairs, so I would like to first predict their affinities and then measure only the most promising ones. I have already managed to create a model which I believe is good for this purpose. Its details are below.

- algorithm: K-nearest neighbours regressor
- parameters: K=20
- training data: all the pairs for which I have measured the affinity

The data I used to create the model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

I am not able to evaluate how well the model will perform if I use it to predict the affinities of new drug-target pairs. I need you to evaluate the model for me. There are three distinct situations in which I want to use this model in the future.

1. Because I have only measured the affinities for some of the possible pairs of the currently known drugs and targets, I want to use the model to predict the affinities for the remaining pairs.
2. I am confident that I will discover new potential drug molecules in the future, so I will want to use the model to predict their affinities to the currently known targets.
3. Because new putative target molecules, too, will likely be identified in the future, I will also want to use the model to predict the affinities between the drug molecules I will discover and the target molecules somebody else will discover in the future.

Please evaluate the generalisation performance of the model in these three situations. I need to get evaluation results from leave-one-out cross-validation with C-index. Also, because I'm worried that unreliable evaluation results will mislead me to waste precious resources, please explain why I can trust your results.


Yours sincerely, \
Bio Scientist

---

#### Import libraries

In [76]:
import numpy as np
import pandas as pd
from sklearn.model_selection import LeaveOneOut
from sklearn.neighbors import KNeighborsRegressor

#### Load datasets

In [77]:
data = pd.read_csv('input.data', sep=' ', header=None)
data['output'] = pd.read_csv('output.data', header=None)
data[['drug', 'target']] = pd.read_csv('pairs.data', sep=' ', header=None)

# To reference feature columns
feature_cols = list(range(0, 2500))
# To reference pair columns
pair_cols = ['drug', 'target']

print('NaN value count: ', data.isna().sum().sum())

data

NaN value count:  0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2493,2494,2495,2496,2497,2498,2499,output,drug,target
0,6.53771,7.04273,7.30593,7.04480,7.32600,7.15379,6.46464,7.33308,6.25152,7.29930,...,8.28991,8.27096,7.65185,8.13150,8.13992,7.36155,7.98930,10000.0,D23,T194
1,4.26878,4.05945,4.40541,4.73575,4.25489,4.61444,4.72028,4.71408,5.43478,4.75449,...,6.61113,6.97087,7.23425,6.57285,8.38097,6.80756,7.12181,10000.0,D9,T270
2,7.24802,5.96468,7.02855,6.52784,7.38776,7.43236,6.06098,7.68345,6.91821,8.41192,...,7.20765,7.60826,6.05150,7.23766,6.75104,5.72958,6.73456,10000.0,D3,T47
3,3.00092,3.33087,3.57794,3.31246,3.43355,3.35872,3.32773,3.29331,5.89109,3.39740,...,3.01052,2.79974,2.93089,2.81599,2.74684,2.93389,2.76753,10000.0,D49,T222
4,4.34096,3.79832,5.67286,4.20168,4.74336,4.97859,3.56746,4.55088,4.30942,3.99160,...,2.87061,3.12170,2.92398,3.26003,2.70133,2.87879,2.64117,270.0,D37,T28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,8.57018,9.04255,9.63228,8.77320,9.78987,9.97577,9.68539,9.17266,8.66047,9.56224,...,5.55991,5.27942,5.06684,5.48311,5.35253,5.07481,5.57504,10000.0,D33,T426
1496,7.36589,8.12633,8.04439,8.31312,8.11686,7.33734,8.11067,8.19454,12.42720,8.77493,...,4.69769,4.69417,5.12240,4.82360,5.03664,4.87108,4.61108,10000.0,D55,T267
1497,6.11318,9.20208,6.96815,8.02065,7.02480,6.78667,7.61725,6.58085,6.47229,6.60451,...,7.02489,7.53969,8.30141,6.94445,6.71488,8.12290,7.16530,10000.0,D39,T372
1498,6.45690,7.89646,6.76667,7.62162,6.61818,6.30901,7.58741,6.70339,9.15385,8.68000,...,5.83832,6.15672,6.44876,5.87900,6.79348,12.50000,6.73077,10000.0,D34,T300


#### Write functions

In [78]:
def cindex(y, yp):
    n = 0
    h_num = 0 
    for i in range(0, len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    return h_num/n

def type_b_loo(pairs):
    train_index = []
    test_index = []

    for test_i in range(len(pairs)):
        test_index.append([test_i])
        # Do not add indices with the same drug name
        train_index.append(np.argwhere(pairs[:, 0] != pairs[test_i, 0]).flatten())

    return zip(train_index, test_index)


def type_d_loo(pairs):
    train_index = []
    test_index = []

    for test_i in range(len(pairs)):
        test_index.append([test_i])
        train = []

        for i, pair in enumerate(pairs):
            # Add index only when the drug and target names are different from
            # test index drug and target name
            if (pair[0] != pairs[test_i, 0] and pair[1] != pairs[test_i, 1]):
                train.append(i)
        
        train_index.append(train)

    return zip(train_index, test_index)

    
def loocv(X, y, loo, kneigh):
    predictions = []
    true_y = []

    for (train_index, test_index) in loo:
        kneigh.fit(X[train_index], y[train_index])
        predictions.append(kneigh.predict(X[test_index]))
        true_y.append(y[test_index])

    c_ind = cindex(predictions, true_y)

    return c_ind

#### Run cross-validations

##### Case 1

Type A pair-input observations: use standard leave-one-out cross validation.

In [79]:
kneigh = KNeighborsRegressor(n_neighbors=20)

loo = LeaveOneOut()
split = loo.split(data[feature_cols])
c_ind_1 = loocv(data[feature_cols].to_numpy(), data['output'].to_numpy(), split, kneigh)
print(f'C-index for predicting the affinities for the remaining pairs is {c_ind_1}')

C-index for predicting the affinities for the remaining pairs is 0.6350869098722096


#### Case 2

Type B pair-input observations: do not use pairs with the same drug name in training.

In [80]:
loo_b = type_b_loo(data[pair_cols].to_numpy())
c_ind_2 = loocv(data[feature_cols].to_numpy(), data['output'].to_numpy(), loo_b, kneigh)
print(f'C-index for predicting affinities for new drug molecules {c_ind_2}')

C-index for predicting affinities for new drug molecules 0.597533635287411


#### Case 3

Type D pair-input observations: do not use pairs with the same drug or target name in training.

In [81]:
loo_d = type_d_loo(data[pair_cols].to_numpy())
c_ind_3 = loocv(data[feature_cols].to_numpy(), data['output'].to_numpy(), loo_d, kneigh)
print(f'C-index for predicting affinities for new drug and target molecules {c_ind_3}')

C-index for predicting affinities for new drug and target molecules 0.5858317544828793


#### Interpret results

For case 1 the generalisation performance is the best of the three, because it has been trained with the most accurate data. The performance decreases when generalisation increases. The resulting C-indices are in expected range and are decreasing a little with more generalisation and therefore can be trusted.

The C-indices are not very good though, and a different model should be considered.