# TKO_7092 Evaluation of Machine Learning Methods 2023

## Exercise 4

Complete the tasks given to you in the letter below. There are cells at the end of this notebook to which you are expected to write your code. Insert markdown cells as needed to describe your solution. Remember to follow all the general exercise guidelines stated in Moodle.

The deadline of this exercise is **Wednesday 22 February 2023 at 23:59**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.


---

Student name: Peppi-Lotta Saari

Student number: 517334

Student email: plsaar@utu.fi

---


Dear Data Scientist,

I have a task for you that concerns drug molecules and their targets. I have spent a lot of time in a laboratory to measure how strongly potential drug molecules bind to putative target molecules. I do not have enough resources to measure all possible drug-target pairs, so I would like to first predict their affinities and then measure only the most promising ones. I have already managed to create a model which I believe is good for this purpose. Its details are below.

- algorithm: K-nearest neighbours regressor
- parameters: K=20
- training data: all the pairs for which I have measured the affinity

The data I used to create the model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

I am not able to evaluate how well the model will perform if I use it to predict the affinities of new drug-target pairs. I need you to evaluate the model for me. There are three distinct situations in which I want to use this model in the future.

1. Because I have only measured the affinities for some of the possible pairs of the currently known drugs and targets, I want to use the model to predict the affinities for the remaining pairs.
2. I am confident that I will discover new potential drug molecules in the future, so I will want to use the model to predict their affinities to the currently known targets.
3. Because new putative target molecules, too, will likely be identified in the future, I will also want to use the model to predict the affinities between the drug molecules I will discover and the target molecules somebody else will discover in the future.

Please evaluate the generalisation performance of the model in these three situations. I need to get evaluation results from leave-one-out cross-validation with C-index. Also, because I'm worried that unreliable evaluation results will mislead me to waste precious resources, please explain why I can trust your results.


Yours sincerely, \
Bio Scientist

---

#### Import libraries

In [71]:
# Import the libraries you need.
import numpy as np 
import pandas as pd 
from tqdm import tqdm
from sklearn.neighbors import KNeighborsRegressor

#### Load datasets

In [72]:
# Read the data files (input.data, output.data, pairs.data).
features = pd.DataFrame(pd.read_table('input.data', header=None, delimiter=r"\s+"))
labels = pd.DataFrame(pd.read_table('output.data', header=None))
pairs = pd.DataFrame(pd.read_table('pairs.data', header=None, delimiter=r"\s+"))

display(features.head())
display(labels.head())
display(pairs.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2490,2491,2492,2493,2494,2495,2496,2497,2498,2499
0,6.53771,7.04273,7.30593,7.0448,7.326,7.15379,6.46464,7.33308,6.25152,7.2993,...,8.56873,7.90797,8.70878,8.28991,8.27096,7.65185,8.1315,8.13992,7.36155,7.9893
1,4.26878,4.05945,4.40541,4.73575,4.25489,4.61444,4.72028,4.71408,5.43478,4.75449,...,7.55949,7.61247,6.60946,6.61113,6.97087,7.23425,6.57285,8.38097,6.80756,7.12181
2,7.24802,5.96468,7.02855,6.52784,7.38776,7.43236,6.06098,7.68345,6.91821,8.41192,...,6.68409,6.10721,7.84371,7.20765,7.60826,6.0515,7.23766,6.75104,5.72958,6.73456
3,3.00092,3.33087,3.57794,3.31246,3.43355,3.35872,3.32773,3.29331,5.89109,3.3974,...,2.8702,5.68182,2.57248,3.01052,2.79974,2.93089,2.81599,2.74684,2.93389,2.76753
4,4.34096,3.79832,5.67286,4.20168,4.74336,4.97859,3.56746,4.55088,4.30942,3.9916,...,2.72576,2.80786,3.01114,2.87061,3.1217,2.92398,3.26003,2.70133,2.87879,2.64117


Unnamed: 0,0
0,10000.0
1,10000.0
2,10000.0
3,10000.0
4,270.0


Unnamed: 0,0,1
0,D23,T194
1,D9,T270
2,D3,T47
3,D49,T222
4,D37,T28


#### Write functions

In [73]:
# Write the functions you need to perform the requested cross-validations.

# taken from first week's assignment
def cindex(true_labels, pred_labels):
    n = 0
    h_num = 0 
    for i in range(0, len(true_labels)):
        t = true_labels[i]
        p = pred_labels[i]
        for j in range(i+1, len(true_labels)):
            nt = true_labels[j]
            np = pred_labels[j]
            if (t != nt): 
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt): 
                    h_num += 1
                elif (p == np):
                    h_num += 0.5
    cindx =  h_num/n
    return cindx

In [74]:
#loop through the samples
def make_predictions(test_features, train_features, test_labels, train_labels):

        model = KNeighborsRegressor(n_neighbors=20)

        # Train the model using the training sets
        model.fit(train_features, train_labels)

        #Predict Output
        p = model.predict(test_features)
        
        return  p[0][0]
        

In [75]:
def loo_split(sample):
    #separate a row from features to use as test set
    test_features = features.iloc[[sample]]

    #drop the testing row from the training set
    train_features = features.drop([sample], axis=0)

    #separate a row from labels to use as test labels
    test_labels = labels.iloc[[sample]]

    #drop the testing row from the training labels set
    train_labels = labels.drop([sample], axis=0)
        
    # print(test_features, train_features, test_labels, train_labels)
    return test_features, train_features, test_labels, train_labels

In [76]:
def loo_cindex():
    
    predictions = []
    
    for sample in tqdm(range(0, features.shape[0])):
        
        test_features, train_features, test_labels, train_labels = loo_split(sample)
        
        prediction = make_predictions(test_features, train_features, test_labels, train_labels)
        
        predictions.append(prediction)
        
    return cindex(labels[0], predictions)

In [77]:
def loo_pair1_split(sample):
    
    # get indexis of replica pair ones
    replicas = pairs.loc[sample][0]
    replica_inds = pairs[pairs[0] == replicas].index
    
    # test sets
    test_features = features.iloc[[sample]]
    test_labels = labels.iloc[[sample]]
    
    #train sets
    train_features = features.drop(index=replica_inds)
    train_labels = labels.drop(index=replica_inds)
    
    return  test_features, train_features, test_labels, train_labels

In [78]:
def loo_pair1_cindex():
    
    predictions = []
    
    for sample in tqdm(range(0, features.shape[0])):
        
        test_features, train_features, test_labels, train_labels = loo_pair1_split(sample)
        
        prediction = make_predictions(test_features, train_features, test_labels, train_labels)
        
        predictions.append(prediction)
        
    return cindex(labels[0], predictions)

In [79]:
def loo_both_pairs_split(sample):
    #print(test_features, train_features, test_labels, train_labels)
    # return test_features, train_features, test_labels, train_labels   
    
    # get indexis of replica pair ones
    replicas_pair1 = pairs.loc[sample][0]
    replicas_pair2 = pairs.loc[sample][1]
    
    replica_inds = pairs[(pairs[0] == replicas_pair1) | (pairs[1] == replicas_pair2)].index
    
    # test sets
    test_features = features.iloc[[sample]]
    test_labels = labels.iloc[[sample]]
    
    #train sets
    train_features = features.drop(index=replica_inds)
    train_labels = labels.drop(index=replica_inds)
    
    return  test_features, train_features, test_labels, train_labels

In [80]:
def loo_both_pairs_cindex():
    
    predictions = []
    
    for sample in tqdm(range(0, features.shape[0])):
        
        test_features, train_features, test_labels, train_labels = loo_both_pairs_split(sample)
        
        prediction = make_predictions(test_features, train_features, test_labels, train_labels)
        
        predictions.append(prediction)
        
    return cindex(labels[0], predictions)

#### Run cross-validations

In [81]:
# Run the requested cross-validations and print the results.
# Because I have only measured the affinities for some of the possible pairs of the currently known drugs and targets, I want to use the model to predict the affinities for the remaining pairs.

# Simple cross-validation

print("C-index for simple leave-one-out is ", loo_cindex())

100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [03:12<00:00,  7.78it/s]


C-index for simple leave-one-out is  0.7753800149970941


In [82]:
# I am confident that I will discover new potential drug molecules in the future, so I will want to use the model to predict their affinities to the currently known targets.

#cross-validation where we don't use the pair 1's duplicates for training

print("C-index for drop replicas of pair 1 leave-one-out is ", loo_pair1_cindex())

100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [03:11<00:00,  7.81it/s]


C-index for drop replicas of pair 1 leave-one-out is  0.7004486294526061


In [83]:
# Because new putative target molecules, too, will likely be identified in the future, I will also want to use the model to predict the affinities between the drug molecules I will discover and the target molecules somebody else will discover in the future.

#cross-validation where we don't use the pair 2's duplicates for training
print("C-index for drop both pair of pair leave-one-out is ", loo_both_pairs_cindex())

100%|██████████████████████████████████████████████████████████████████████████████| 1500/1500 [03:04<00:00,  8.13it/s]


C-index for drop both pair of pair leave-one-out is  0.6763313587770585


#### Interpret results

######Interpret the results you obtained and explain why they can be trusted.
Simple leave one out will tells us how well we can predict samples values when the pairs are formed from identifiers and target we already know. This cross-validation gives us the best result because the pairs are known and we have the most training data. Some of the trainig samples are connected to the test sample vie the identifier or the targer. So the constraint of independence is not fullfilled.<br>


In the second leave-one-out we drop from training the rows where the identifier is the same as the test samples. This way we can simulate how the model will predict unknown identidiers to known targets. This is a bit worse than the basic leave-one out as we have less data and the samples we use are less connected. Some of the training samples are still connected to the test sample but only via their target values. <br>


The last leave-one-out gets the worst results. Here we know the test samples are independent of the training data. We remove the coresponding identifierst and targets compared to the test sample. Here the constraint of independence is fullfilled. 
