# TKO_2096 Applications of Data Analysis 2021
## Exercise 4

Complete the tasks given to you in the letter below. There are cells at the end of this notebook to which you are expected to write your code. Insert markdown cells as needed to describe your solution.

The deadline of this exercise is **28.2.2021, 23:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.


---

Student name: Aleksi Laakso

Student number: 518416 and PeppiID: 117551

Student email: almlaa@utu.fi

---


Dear Data Scientist,

I have a task for you that concerns drug molecules and their targets. I have spent a lot of time in a laboratory to measure how strongly potential drug molecules bind to putative target molecules. I do not have enough resources to measure all possible drug-target pairs, so I would like to first predict their affinities and then measure only the most promising ones. I have already managed to create a model which I believe is good for this purpose. Its details are below.

- algorithm: K-nearest neighbours regressor
- parameters: K=20
- training data: full data set

The full data set is available as the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

I am not able to evaluate how well my model will perform when I will use it to predict the affinities of new drug-target pairs. I need you to evaluate the model for me. There are three distinct situations in which I want to use this model in the future.

1. I did not have the resources to measure the affinities of all the known drug-target pairs in the laboratory, so I want to use the model to predict the affinities of the remaining pairs.
2. I am confident that I will discover new potential drug molecules in the future, so I will want to use the model to predict their affinities to the currently known targets.
3. Because new putative target molecules, too, will likely be identified in the future, I will also want to use the model to predict the affinities between the drug molecules I will discover and the target molecules somebody else will discover in the future.

I need to get evaluation results from leave-one-out cross-validation with C-index. Please evaluate the generalisation performance of my model in the three situations and explain why your cross-validation methods are suitable for them.


Yours sincerely, \
Bio Scientist


PS. Follow all the general exercise guidelines stated in Moodle.

---

#### Import libraries

In [1]:
# Import the libraries you need.
import numpy as np
import pandas as pd
from sklearn import neighbors

#### Load datasets

In [2]:
# Read the data files (input.data, output.data, pairs.data).
input = pd.read_csv('input.data', sep=" ", header=None)
output = pd.read_csv('output.data', sep=" ", header=None)
pairs = pd.read_csv('pairs.data', sep=" ", header=None)

# Change input and output to numpy for easier handling
input = input.to_numpy()
output = output.to_numpy()

#### Write functions

In [3]:
# Write the functions you need to perform the requested cross-validations.
# cindex function from ex. 2&3
# true_labels an array of the true output values
# pred_labels an array of predicted output values
# return c-index
def cindex(true_labels, pred_labels):
    #init n and pairs
    n = 0
    pairs = 0
    for i in range(0, len(true_labels)):
        #true label p1 and predicted label p2 for it
        p1 = true_labels[i] 
        p2 = pred_labels[i] 
        for j in range(i+1, len(true_labels)):
            #next true label y1 and next prediction y2
            y1 = true_labels[j] 
            y2 = pred_labels[j] 
            
            # if true label differs from next true label then true
            if (p1 != y1):
                pairs = pairs + 1
                if (p1 < y1 and p2 < y2) or (p1 > y1 and p2 > y2):
                    n = n + 1
                elif (p2 == y2):
                    n += 0.5
    return n/pairs

#regular loocv for 1)
def loocv(input, output):
    k = 20
    
    # init empty list for predictions and K-Nearest Neighbor Regressor
    predictions = []
    neigh = neighbors.KNeighborsRegressor(n_neighbors = k, metric = 'euclidean')
    
    # i:th index as test set
    for i in range(len(input)):
        
        #train sets
        X_Train = np.append(input[:i], input[i+1:], axis = 0)
        Y_Train = np.append(output[:i], output[i+1:], axis = 0)
        
        #test set
        X_Test = input[i]
        
        neigh.fit(X_Train, Y_Train)
        predictions.append(neigh.predict([X_Test]))
    
    return cindex(output, predictions)

# matrix where matrix[i,j] = is the row to which this pair belongs to
# e.g. matrix[0,0] = 0 => pair D23 and T194 are related to output and input rows 0
# NaN if the pair doesn't belong to any of the data rows
def matrix(pairs):
    
    # Get all unique drugs and Targets
    drugs = pairs.iloc[:,0].unique()
    targets = pairs.iloc[:,1].unique()
    
    #create pandas dataframe so that there is row for every drug and column for every target
    indices = pd.DataFrame(index=drugs, columns=targets)
    
    #loop through pairs and save row index to matrix
    for i in range(len(pairs)):
        indices[pairs.iloc[i,1]][pairs.iloc[i,0]] = i
        
    return indices

# loocv for 2)
def loocv_2 (input, output, matrix):
    k = 20
    
    # init empty list for predictions and true value, and K-Nearest Neighbor Regressor
    true_values = []
    predictions = []
    neigh = neighbors.KNeighborsRegressor(n_neighbors = k, metric = 'euclidean')
    
    #loop through all drugs
    for i in range(matrix.shape[0]):
        
        #indices where i:th drug is paired
        indices = matrix.iloc[i,:].dropna()
        indices = indices.astype(int)
        
        X_Train = np.delete(input, indices, axis = 0)
        Y_Train = np.delete(output, indices, axis = 0)
        
        X_Test = input[indices]
        Y_Test = output[indices]
        
        neigh.fit(X_Train, Y_Train)
        pred = neigh.predict(X_Test)
        for pre in pred:
            predictions.append(pre[0])
        for real in Y_Test:
            true_values.append(real[0])
            
    return cindex(true_values, predictions)

# loocv for 3)
def loocv_3(input, output, matrix):
    k = 20
    
    # init empty list for predictions and true values and K-Nearest Neighbor Regressor
    true_values = []
    predictions = []
    neigh = neighbors.KNeighborsRegressor(n_neighbors = k, metric = 'euclidean')
    
    #loop through all drugs
    for i in range(matrix.shape[0]):
        
        #take i-th drug
        indices_Drugs = matrix.iloc[i,:]
        
        #loop through targets
        for j in range(matrix.shape[1]):
            
            #check that there is i-drug and j-target pair
            if (np.isnan(indices_Drugs[j])):
                continue
            else:
                #test sets
                X_Test = input[indices_Drugs[j]]
                Y_Test = output[indices_Drugs[j]]
                
                #indices where j:th target is paired
                indices_Targets = matrix.iloc[:,j].dropna()
                
                #unique indices of rows that have i-drug or j-target
                indices = np.unique(np.concatenate((indices_Drugs.dropna(), indices_Targets))).astype(int)
                
                #train sets                     
                X_Train = np.delete(input, indices, axis = 0)
                Y_Train = np.delete(output, indices, axis = 0)
                
                neigh.fit(X_Train, Y_Train)
                pred = neigh.predict([X_Test])
                predictions.append(pred)
                true_values.append(Y_Test)
                    
    return cindex(true_values, predictions)

#### Run cross-validations

In [4]:
# Run the requested cross-validations and print the results.
%time matrix = matrix(pairs)

Wall time: 163 ms


In [5]:
%%time
#1)
c_1 = loocv(input, output)
print('c-index for c_1 aka type A:', c_1)

c-index for c_1 aka type A: 0.7753800149970941
Wall time: 40.8 s


In [6]:
%%time
#2)
c_2 = loocv_2(input, output, matrix)
print('c-index for c_2 aka type B or C:', c_2)

c-index for c_2 aka type B or C: 0.7003789608003829
Wall time: 2.47 s


In [7]:
%%time
#3)
c_3 = loocv_3(input, output, matrix)
print('c-index for c_3 aka type D:', c_3)

c-index for c_3 aka type D: 0.6763313587770585
Wall time: 53.7 s


#### Interpret results

In [8]:
# Interpret the results you obtained and explain why your cross-validation methods work.

# All in all the results look reasonable because they're in the correct order c_1 > c_2 > c_3
# I used euclidean distance as metric for KNN as it was used in last excercises

# (regular) loocv is suitable for case one as we have information of the drug-target pairs
# loocv_2 could be used to predict on new drugs because the model was trained with known targets and without 
# information about drugs
# loocv_3 could be used with new drugs and targets because the model was trained without old information of them
