# TKO_2096 Applications of Data Analysis 2021
## Exercise 4

Complete the tasks given to you in the letter below. There are cells at the end of this notebook to which you are expected to write your code. Insert markdown cells as needed to describe your solution.


Dear Data Scientist,

I have a task for you that concerns drug molecules and their targets. I have spent a lot of time in a laboratory to measure how strongly potential drug molecules bind to putative target molecules. I do not have enough resources to measure all possible drug-target pairs, so I would like to first predict their affinities and then measure only the most promising ones. I have already managed to create a model which I believe is good for this purpose. Its details are below.

- algorithm: K-nearest neighbours regressor
- parameters: K=20
- training data: full data set

The full data set is available as the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

I am not able to evaluate how well my model will perform when I will use it to predict the affinities of new drug-target pairs. I need you to evaluate the model for me. There are three distinct situations in which I want to use this model in the future.

1. I did not have the resources to measure the affinities of all the known drug-target pairs in the laboratory, so I want to use the model to predict the affinities of the remaining pairs.
2. I am confident that I will discover new potential drug molecules in the future, so I will want to use the model to predict their affinities to the currently known targets.
3. Because new putative target molecules, too, will likely be identified in the future, I will also want to use the model to predict the affinities between the drug molecules I will discover and the target molecules somebody else will discover in the future.

I need to get evaluation results from leave-one-out cross-validation with C-index. Please evaluate the generalisation performance of my model in the three situations and explain why your cross-validation methods are suitable for them.


Yours sincerely, \
Bio Scientist


PS. Follow all the general exercise guidelines stated in Moodle.

---

In [1]:
## 
# This is not a working solution, but it's what I got for now
##

#### Import libraries

In [2]:
# Import the libraries you need.
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsRegressor
#from sklearn.preprocessing import OneHotEncoder
import time

#### Load datasets

In [3]:
# Read the data files (input.data, output.data, pairs.data).

X_in = np.genfromtxt('input.data',delimiter=" ")
y_in = np.genfromtxt('output.data',delimiter=" ")
pairs_in = pd.read_csv('pairs.data',sep=" ", header=None).to_numpy()
#pairs_in = np.array([[j[1:] for j in a] for a in pairs]).astype(int)

print("Column and rows")
print("Input X:",X_in.shape)
print("Output y:",y_in.shape)
print("Pairs: pairs",pairs_in.shape)

Column and rows
Input X: (1500, 2500)
Output y: (1500,)
Pairs: pairs (1500, 2)


In [11]:
# Set shortener for debuggin'
a = len(X_in)
#a = 200  # comment out for complete set

X = X_in[:a]
y = y_in[:a]
pairs = pairs_in[:a]

#### Write functions

In [12]:
# Write the functions you need to perform the requested cross-validations.

def LOO(X):
    # Creates a generator for splitting the data to training and test sets
    # Regular LOO from earlier excercises
    
    indices = np.arange(len(X)) # number of splits
    
    for test_index in indices:
        
        test_index = indices[test_index]
        train_index = np.delete(indices, test_index) # all but 'test_index'

        yield train_index, test_index
        
        
        
        
def LeaveNthMemberOut(pairs,n):
    # Creates a generator for splitting the data to training and test sets
    # Removes the pairs with shared nth member from the training set X
    
    indices = np.arange(len(pairs)) # number of splits
    n = n-1
    
    for ind in indices:
        test_index = ind

        train_index = indices[np.logical_not(indices==test_index)] # remove test set from training set
        
        bad_members_idx = indices[pairs[:,n] == pairs[ind][n]] # pairs where item n is the same   
        
        # remove pairs where the nth item is the same
        train_index = np.setdiff1d(train_index, bad_members_idx, assume_unique = True)
        
        yield train_index, test_index
        
        
        
        
def LeaveBothMembersOut(pairs):
    # Creates a generator for splitting the data to training and test sets
    # Removes the pairs with any shared members from the training set X
    
    indices = np.arange(len(pairs)) # number of splits
    
    for ind in indices:
        test_index = ind

        train_index = indices[np.logical_not(indices==test_index)] # remove test set from training set
        
        first_member_indexes = indices[pairs[:,0] == pairs[ind][0]] # pairs where item 1 is the same
        second_member_indexes = indices[pairs[:,1] == pairs[ind][1]] # pairs where item 2 is the same
        
        # remove pairs where the 1st member is the same
        train_index = np.setdiff1d(train_index,first_member_indexes, assume_unique = True)
        # remove pairs where the 2nd member is the same
        train_index = np.setdiff1d(train_index,second_member_indexes, assume_unique = True) 
        
        yield train_index, test_index
    
    
    
    
def LOOCV(X, y, loo, k):
    # Regressor fitting and predictions
    
    pred = np.array([]) # feature predictions
    
    for train_index, test_index in loo:
        
        # Progress printout
        if (test_index+100) % 100 == 0:
            print(test_index+100,"/", len(X))
            
            
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        #y_test = y[test_index] # Remain a mere comment until given purpose!
        
        
        knr = KNeighborsRegressor(n_neighbors=k)
        
        knr.fit(X_train, y_train)
        
        pred = np.append(pred, knr.predict(X_test.reshape(1,-1)), axis=0)
        
    pred = np.array(pred)
    
    return pred




def cindex(true_labels, pred_labels):
    """Returns general C-index between true labels and predicted labels"""  
    
    N = 0
    T = 0 # Total number of unequal outputs
    
    ## Create the data set
    data = np.column_stack((true_labels.reshape(-1,1), pred_labels))
    
    
    for i in range(len(data)): # First item of a pair
        for j in range(i, len(data)): # Second item of a pair
            
            if (np.greater(data[i][0],data[j][0]) and (np.greater(data[i][1],data[j][1]))):
                N = N + 1
            elif (np.greater(data[j][0],data[i][0]) and (np.greater(data[j][1],data[i][1]))):
                N = N + 1
            elif (np.not_equal(data[i][0],data[j][0]) and (np.equal(data[i][1],data[j][1]))):
                N = N + 0.5
                
            if np.not_equal(data[i][0],data[j][0]):
                T = T + 1

    
    return N/T

#### Run cross-validations

In [13]:
# Run the requested cross-validations and print the results.

# Situation 1, new pair of existing drugs and targets
# Both members (drug and target) in the test set are shared with observations in the training set
# Type A observation
# Type A observation's dependencies are similar to what can be found in a regular Leave-One-Out
# Regular Leave-One-Out can be used

s = time.time()

loo = LOO(pairs) # Pairwise LOO generator

k = 20
predictions = LOOCV(X,y,loo,k) # params: X, y, loo generator, k

e = time.time()
print(f"Predictions took {time.strftime('%H:%M:%S', time.gmtime(round(e-s,2)))}!")

c1 = cindex(y, predictions) # Calculation of C-index
print("\nC-index =",c1,"\n")  

e = time.time()
print(f"All done! Took {time.strftime('%H:%M:%S', time.gmtime(round(e-s,2)))}!")

100 / 1500
200 / 1500
300 / 1500
400 / 1500
500 / 1500
600 / 1500
700 / 1500
800 / 1500
900 / 1500
1000 / 1500
1100 / 1500
1200 / 1500
1300 / 1500
1400 / 1500
1500 / 1500
Predictions took 00:00:32!

C-index = 0.7753800149970941 

All done! Took 00:00:44!


In [14]:
# Situation 2, completely new drug or target molecule
# One of the members is found in the training set
# Observation types B and C
# Cannot use regular LOO to avoid too optimistic dependencies
# The shared member must be removed from the training set to avoid bias

s = time.time()

n = 2 # for shared second member, targets 
loo = LeaveNthMemberOut(pairs, n) # Pairwise LOO generator

k = 20
predictions = LOOCV(X,y,loo,k) # params: X, y, loo generator, k

e = time.time()
print(f"Predictions took {time.strftime('%H:%M:%S', time.gmtime(round(e-s,2)))}!")

c2 = cindex(y, predictions) # Calculation of C-index
print("\nC-index =",c2,"\n")  

e = time.time()
print(f"All done! Took {time.strftime('%H:%M:%S', time.gmtime(round(e-s,2)))}!")

100 / 1500
200 / 1500
300 / 1500
400 / 1500
500 / 1500
600 / 1500
700 / 1500
800 / 1500
900 / 1500
1000 / 1500
1100 / 1500
1200 / 1500
1300 / 1500
1400 / 1500
1500 / 1500
Predictions took 00:00:32!

C-index = 0.765166223904048 

All done! Took 00:00:44!


In [15]:
# Situation 3, both, the drug and the target, are new
# Neither of the pair members are found in the training set
# Type D observation
# Cannot use regular LOO or LeavePairOut

s = time.time()

loo = LeaveBothMembersOut(pairs) # Pairwise LOO generator

k = 20
predictions = LOOCV(X,y,loo,k) # params: X, y, loo generator, k

e = time.time()
print(f"Predictions took {time.strftime('%H:%M:%S', time.gmtime(round(e-s,2)))}!")

c3 = cindex(y, predictions) # Calculation of C-index
print("\nC-index =",c3,"\n")  

e = time.time()
print(f"All done! Took {time.strftime('%H:%M:%S', time.gmtime(round(e-s,2)))}!")

100 / 1500
200 / 1500
300 / 1500
400 / 1500
500 / 1500
600 / 1500
700 / 1500
800 / 1500
900 / 1500
1000 / 1500
1100 / 1500
1200 / 1500
1300 / 1500
1400 / 1500
1500 / 1500
Predictions took 00:00:32!

C-index = 0.6763313587770585 

All done! Took 00:00:44!


#### Interpret results

In [18]:
# Interpret the results you obtained and explain why your cross-validation methods work.
print("C-index result result for situation 1:",c1)
print("C-index result for situation 2:",c2)
print("C-index result for situation 3:",c3)

# Explanations for methods are within the spesific functions

# C-index is dropping since the model knows less and less of the data it's trying to predict. 
# Situation 1 is the best since no data was omitted
# Situation 2 is close to sit. 1 since it had some of the same data to learn from
# Situation 3 had none of that so it's noticeably worse (0.1) than the other two cases

C-index result result for situation 1: 0.7753800149970941
C-index result for situation 2: 0.765166223904048
C-index result for situation 3: 0.6763313587770585
