# Outline
#### 1.1 Read residues and neighbors
#### 1.2 First 6 neighbors selected into subsequent analysis
#### 2.1 For neighbor, find a most similar neighbor in each sample 
#### 2.2 Change columns' name
#### 2.3 Meaning of column names 
#### 3.1 Functions for calculating the sum of root of the sum of squared difference 
#### 3.2 The below is to calcuate the summation of the distance of one residue to the closest residue in each sample 
#### 4.1 Clustering 
#### 5.1 Sort by total distance of each sample (lowest, first) and return the first 6 clusters
#### 5.2 Sort by total distance of each sample (lowest, first) and return the first 6  unique clusters
#### 6.1 Generate features accroding to clustering importance
#### 7.1 Prediction results

### Load packges

In [1]:
import pandas as pd
import math
import numpy as np
from sklearn import preprocessing
from sklearn.cluster import KMeans

### 1.1 Read residues and neighbors
For each K residue, up to 20 neighbors have been recorded and loaded 

Note: These K residues are positive or negative

In [94]:
Twenty_near_neigh = pd.read_csv("/home/yuan/Documents/PTM_k_NN24.csv").iloc[:,2:]

In [95]:
Twenty_near_neigh.iloc[4:9,:]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,63,64,65,66,67,68,69,70,71,72
4,GLY,4.55808,144.017527,GLY,5.496251,84.293718,GLY,6.017002,71.365163,GLY,...,83.865922,THR,10.005927,125.062849,ILE,10.012168,128.109903,ILE,10.14118,121.9774
5,SER,5.382942,100.72826,ALA,6.947727,96.593505,ALA,7.038957,158.572218,VAL,...,80.530945,ALA,12.060864,62.235592,ARG,12.207237,118.44805,GLU,12.569549,161.641172
6,GLN,4.664383,112.660623,GLY,4.964091,117.424241,GLY,6.261041,74.416412,SER,...,81.51216,VAL,14.468603,54.906272,ASP,14.470224,35.269542,MET,14.688015,53.904125
7,ILE,6.373498,105.584257,GLY,6.587568,72.583166,GLU,7.253496,75.290607,LYS,...,81.015173,LYS,13.945731,98.127057,VAL,14.183865,60.444774,LEU,14.241316,81.618881
8,ALA,5.967033,140.713607,ALA,6.222193,111.021387,VAL,7.174093,73.843222,GLU,...,100.043765,LEU,12.333867,70.209676,ASP,12.464868,135.410558,PHE,12.721365,61.386859


Dimensions of the dataset

2072: 2072 residues (pos: neg ~ 1:1)

60: 20 neighbors x 3 (neighbor type, distance to the residue, and angle to the residue)

In [96]:
Twenty_near_neigh.shape

(2072, 72)

#### 1.2 First 6 neighbors selected into subsequent analysis

In [159]:
Six_near_neigh = Twenty_near_neigh.iloc[:,0:54]

In [160]:
Six_near_neigh.tail()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,45,46,47,48,49,50,51,52,53,54
2067,PRO,4.842628,121.738105,SER,5.040065,156.044333,GLY,5.857273,79.111261,GLN,...,124.280897,ASP,10.951507,64.92268,PRO,11.223716,122.534488,THR,11.23608,142.826211
2068,VAL,5.43181,120.683681,VAL,5.687536,148.820974,ALA,5.717814,95.484556,LEU,...,62.0617,CYS,9.396902,151.776982,LEU,9.648069,140.285784,LEU,9.763929,145.657441
2069,THR,4.937804,119.508384,VAL,5.045129,110.80823,THR,5.075596,137.567445,GLU,...,92.274315,LYS,10.683572,123.711805,THR,10.751302,95.190825,ASP,11.167239,100.976467
2070,CYS,4.527412,114.5659,VAL,5.200363,123.479881,ASP,5.430008,129.284858,VAL,...,97.408204,TYR,9.732845,97.563322,SER,9.744364,115.958755,VAL,10.009726,74.638843
2071,GLY,4.683273,119.941672,GLY,4.966696,102.169358,ASP,5.366877,153.427777,MET,...,119.416556,GLY,8.071721,82.349489,GLU,8.189664,128.067917,LEU,8.525135,134.060982


### 2.1 For neighbor, find a most similar neighbor in each sample 
##### Three layers for loops nested 

Loop through each neighbor (1-6)
    
    Loop through each sample
        
        For each residue, find one similar neighbor in each sample
            
            case 1: No matching residue type, fill type: 21, distance: 999, angle: 999
            
            case 2: Only one residue matching type, fill it directly
            
            case 3: More than one residue matching type, pick one with least square difference
            
##### One example in case 3:

For example, we are searching for a closest residue of 4th residue of sample 4 in sample 6: 

sample 4: GLY d1 a1,	GLY d2	a2,	GLY d3 	a3,	**GLY	d4 	a4**,	GLY d5 a5, PHE d6 a6 

sample 6: ALA d7 a7, GLY d8 a8, GLU d9 a9, GLY d10 a10, VAL d11 a11, ILE d12 a12

Two GLY residues in sample 6.

min(√((d4-d8)^2 + (a4-a8)^2)  , √((d4-d10)^2 + (a4-a10)^2) ) will return


In [None]:
%%time
near_neigh_each_protein = []    
for i in range(18):   
    for j in range(2072):
        one = []
        one.append(i)
        one.append(j)
        res_name = Six_near_neigh.iloc[j,3*i]
        res_dis = Six_near_neigh.iloc[j,3*i+1]
        res_angle = Six_near_neigh.iloc[j,3*i+2]
        one.append(res_name)
        one.append(res_dis)
        one.append(res_angle)
        for k in range(2072):
            one.append(k)
            one_protein = Six_near_neigh.iloc[k,:]
            if len(one_protein[one_protein==res_name].index) == 0:
                one.append(21)
                one.append(999)
                one.append(999)
            elif len(one_protein[one_protein==res_name].index) == 1:
                for index in one_protein[one_protein==res_name].index:
                    index = int(index)
                    one.append(one_protein[index-1])
                    one.append(one_protein[index])
                    one.append(one_protein[index+1])
            else:
                candidate_index = []
                compare = []
                compare_candidate = []
                for index in one_protein[one_protein==res_name].index:
                    index = int(index)
                    candidate_index.append(index)
                    compare.append(math.sqrt((one_protein[index]-res_dis)**2 + (one_protein[index+1]-res_angle)**2))
                    compare_candidate.append(one_protein[index-1])
                    compare_candidate.append(one_protein[index])
                    compare_candidate.append(one_protein[index+1])
                min_index = compare.index(min(compare))
                one.append(compare_candidate[min_index*3])
                one.append(compare_candidate[min_index*3+1])
                one.append(compare_candidate[min_index*3+2])
        near_neigh_each_protein.append(one)

In [None]:
near_df = pd.DataFrame(near_neigh_each_protein)

2.2 Change columns' name

In [163]:
near_df.columns =\
    ["order inside sample","current sample","current type","current dis","current angle"]+ \
    [str(x) + str(y) for x in range(0,2072) for y in ["th sample"," type"," distance"," angle"]]

Save it to file

In [135]:
near_df.to_csv("/home/yuan/Documents/PTM_k_NN20_test.csv")

##### 2.3 Meaning of column names 

order inside sample: i-th neighbor of a sample, i = 0,1,2,3,4,5

current sample: j-th sample, j = 0,1,2,3......,2071

current type: residue type of i-th neighbor

Then, the closest residue's information of one residue in each sample


In [164]:
near_df.tail()

Unnamed: 0,order inside sample,current sample,current type,current dis,current angle,0th sample,0 type,0 distance,0 angle,1th sample,...,2069 distance,2069 angle,2070th sample,2070 type,2070 distance,2070 angle,2071th sample,2071 type,2071 distance,2071 angle
37291,17,2067,THR,11.23608,142.826211,0,THR,10.013707,96.714191,1,...,5.075596,137.567445,2070,21,999.0,999.0,2071,21,999.0,999.0
37292,17,2068,LEU,9.763929,145.657441,0,LEU,10.734056,82.374686,1,...,10.599439,92.274315,2070,21,999.0,999.0,2071,LEU,8.525135,134.060982
37293,17,2069,ASP,11.167239,100.976467,0,ASP,6.092159,82.986463,1,...,11.167239,100.976467,2070,ASP,5.430008,129.284858,2071,ASP,5.366877,153.427777
37294,17,2070,VAL,10.009726,74.638843,0,VAL,9.214063,75.950254,1,...,8.257221,70.966788,2070,VAL,10.009726,74.638843,2071,VAL,5.889296,93.607127
37295,17,2071,LEU,8.525135,134.060982,0,LEU,10.734056,82.374686,1,...,10.599439,92.274315,2070,21,999.0,999.0,2071,LEU,8.525135,134.060982


### 3.1 Functions for calculating the sum of root of the sum of squared difference 
#### second function is just wrap up the first function for purpose of multi-processing

The first function takes two residues as input [d1, a1] [d2, a2]

Normalize [d1, d2] [a1, a2] into the unit ([0,1])

    [20,999] will be normalized into something like [0.01,0.99]
   
The function will return √((d1 - d2)^2 + (a1 - a2)^2)

In [165]:
def norm_distance(ang_dis_1,ang_dis_2):
    # import math
    # import numpy as np
    # from sklearn import preprocessing
    norm_array = preprocessing.normalize([[ang_dis_1[0],ang_dis_2[0]],[ang_dis_1[1],ang_dis_2[1]]],axis=0)
    total_dis = math.sqrt((norm_array[0][0] - norm_array[0][1])**2+\
                (norm_array[1][0]- norm_array[1][1])**2)
    return(total_dis)

from multiprocessing import Pool
def wrap_nd(i):
    return(norm_distance(near_df.iloc[j,3:5],near_df.iloc[j,(7+4*i):(7+4*i+1+1)]))

### 3.2 The below is to calcuate the summation of the distance of one residue to the closest residue in each sample 

For each residue, we have its own distance and angle, donating as dis and ang.

We have its closest residue in this sample. d1 a1, d2 a2, ...., d2071 a2071, d2072 a2072

sum = norm_distance(dis ang, d1 a1) + norm_distance(dis ang, d2 a2) + ... + norm_distance(dis ang, d2072 a2072)

In [166]:
del near_neigh_each_protein

In [135]:
# near_df.iloc[48302:48305,2] = '21'
# near_df.iloc[48302:48305,3:5] = 0
# near_df.iloc[49338:49340,2] = '21'
# near_df.iloc[49338:49340,3:5] = 0

In [167]:
%%time
total_dis = list()
for j in range(2072*18):
    one_res = 0
    with Pool(processes=8) as pool:
        for i in pool.imap_unordered(wrap_nd, range(2072)):
            one_res=i+one_res
    total_dis.append(one_res)

CPU times: user 44min 47s, sys: 3h 17min 35s, total: 4h 2min 23s
Wall time: 21h 21min 31s


In [169]:
len(total_dis)

37296

total_dis_np = np.array(total_dis)
sort_index = np.argsort(total_dis_np)

In [170]:
near_df_copy = near_df.copy(deep=True)
near_df_copy["total_dis"] = total_dis

In [171]:
near_df_copy.iloc[0:5,[0,1,2,3,4,5,8291,8292,8293]]

Unnamed: 0,order inside sample,current sample,current type,current dis,current angle,0th sample,2071 distance,2071 angle,total_dis
0,0,0,LYS,4.632685,119.969047,0,999.0,999.0,585.728745
1,0,1,ILE,5.740031,130.301395,0,7.714107,92.158072,613.255853
2,0,2,GLY,5.4157,90.022171,0,8.071721,82.349489,449.459475
3,0,3,GLN,7.995939,75.280062,0,999.0,999.0,801.51797
4,0,4,GLY,4.55808,144.017527,0,4.683273,119.941672,491.473987


### 4.1 Clustering 
##### some residues have few samples, not enough clusters (max = 3)

In [172]:
cluster_name = []
cluster_ave_dis = []
col_index = [(7+4*i) for i in range(2072)] + [(7+4*i+1) for i in range(2072)]
col_index.sort()
cluster_near_df = pd.DataFrame()
for res in near_df["current type"].unique():
    one_res = near_df_copy[near_df_copy["current type"]==res].copy(deep=True)
    kmeans = KMeans(n_clusters=3, random_state=0).fit(one_res.iloc[:,col_index])
    one_res["Kmeans_cluster"] = kmeans.labels_
    cluster_near_df = pd.concat([cluster_near_df,one_res])
    for i in range(3):
        cluster_name.append(res+str(i))
        average_dis = sum(one_res[one_res["Kmeans_cluster"]==i]["total_dis"])/\
                        len(one_res[one_res["Kmeans_cluster"]==i]["total_dis"])
        cluster_ave_dis.append(average_dis)

In [173]:
cluster_near_df = cluster_near_df.sort_index()

In [174]:
cluster_near_df_sort = cluster_near_df.copy(deep=True)

In [175]:
cluster_near_df_sort.iloc[0:5,[0,1,2,3,4,5,8291,8292,8293,8294]]

Unnamed: 0,order inside sample,current sample,current type,current dis,current angle,0th sample,2071 distance,2071 angle,total_dis,Kmeans_cluster
0,0,0,LYS,4.632685,119.969047,0,999.0,999.0,585.728745,0
1,0,1,ILE,5.740031,130.301395,0,7.714107,92.158072,613.255853,1
2,0,2,GLY,5.4157,90.022171,0,8.071721,82.349489,449.459475,0
3,0,3,GLN,7.995939,75.280062,0,999.0,999.0,801.51797,1
4,0,4,GLY,4.55808,144.017527,0,4.683273,119.941672,491.473987,2


### 5.1 Sort by total distance of each sample (lowest, first) and return the first 6 clusters

In [176]:
cluster_near_df_sort.sort_values(by=["total_dis"]).loc[:,["current type","Kmeans_cluster"]].iloc[0:18,]

Unnamed: 0,current type,Kmeans_cluster
24438,TRP,0
31938,TRP,0
30620,LEU,2
25440,LEU,2
34228,LEU,2
14995,LEU,2
31527,LEU,2
25404,LEU,2
27343,LEU,2
20726,LEU,2


### 5.2 Sort by total distance of each sample (lowest, first) and return the first 6  unique clusters

In [177]:
cluster_near_df_sort.sort_values(by=["total_dis"]).loc[:,["current type","Kmeans_cluster"]].drop_duplicates().iloc[0:24,]

Unnamed: 0,current type,Kmeans_cluster
24438,TRP,0
30620,LEU,2
19481,LEU,0
20494,LEU,1
19494,CYS,0
22115,ALA,0
31284,ALA,1
33504,ALA,2
24405,GLY,0
31975,GLY,2


In [178]:
temp = cluster_near_df_sort.sort_values(by=["total_dis"]).loc[:,["current type","Kmeans_cluster"]].drop_duplicates().iloc[0:24,]

### 6.1 Generate features accroding to clustering importance

In [187]:
final_feature = []
important_cluster = [str(temp.iloc[i,0]) + str(temp.iloc[i,1]) for i in range(24)]
for i in range(2072):
    one_protein = []
    one_data = cluster_near_df[cluster_near_df["current sample"]==i].copy(deep=True)
    one_data["cluster"] = one_data["current type"] + one_data["Kmeans_cluster"].astype('str')
    for key_cluster in important_cluster:
        if one_data[one_data["cluster"]==key_cluster].shape[0] == 0:
            one_protein.extend(["unknown",999,999])
        else:
            one_protein.extend(list(one_data[one_data["cluster"]==key_cluster].sample(1).iloc[:,2:5].values[0]))
    final_feature.append(one_protein)

#### The below is final features for training and predicting

In [188]:
pd.DataFrame(final_feature).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71
0,unknown,999.0,999.0,LEU,10.734056,82.374686,unknown,999.0,999.0,unknown,...,82.986463,LYS,11.193563,50.741661,unknown,999.0,999.0,LYS,4.632685,119.969047
1,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,unknown,...,999.0,LYS,9.703662,61.074339,unknown,999.0,999.0,LYS,11.183988,121.029133
2,unknown,999.0,999.0,unknown,999.0,999.0,LEU,5.426839,109.033123,unknown,...,999.0,LYS,9.999034,63.564215,unknown,999.0,999.0,unknown,999.0,999.0
3,unknown,999.0,999.0,unknown,999.0,999.0,LEU,13.228336,102.52607,LEU,...,999.0,LYS,11.74087,62.400684,unknown,999.0,999.0,unknown,999.0,999.0
4,unknown,999.0,999.0,LEU,6.409839,88.749214,LEU,7.637948,127.114833,unknown,...,99.936539,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0


In [189]:
pd.DataFrame(final_feature).to_csv("/home/yuan/Documents/PTM_k_NN_order_samples_18_24.csv")

### 7.1 Prediction results

##### 6 neighbors - three cluster in each residue type

|Classifier|Accuracy|ROC|Precision|Recall|
|---|---|---|---|---|
|decision_tree |accuracy: 0.5439 | roc: 0.5664 | precision: 0.5356 | recall: 0.6949|
|naive_gaussian |accuracy: 0.5381 |roc: 0.5264 |precision: 0.5236| recall: 0.8514|
|K_neighbor |accuracy: 0.5183 |roc: 0.5314 |precision: 0.5191 |recall: 0.5264|
|bagging_knn |accuracy: 0.5145 |roc: 0.5330| precision: 0.5036 |recall: 0.5160|
|bagging_tree |accuracy: 0.5304| roc: 0.5756| precision: 0.5354| recall: 0.7123|
|random_forest |accuracy: 0.5449| roc: 0.5810|precision: 0.5469 |recall: 0.7037|
|adaboost |accuracy: 0.5391 |roc: 0.5437 |precision: 0.5291 |recall: 0.7007|
|gradient_boost |accuracy: 0.5362 |roc: 0.5383 |precision: 0.5260 |recall: 0.7182|
|svm |accuracy: 0.5063| roc: 0.5008 |precision: 0.5037| recall: 0.6990|


##### 12 neighbors - three clusters in each residue type

decision_tree

accuracy: 0.5444 roc: 0.5631 precision: 0.5396 recall: 0.5508

naive_gaussian

accuracy: 0.5038 roc: 0.5192 precision: 0.5014 recall: 0.7421

K_neighbor

accuracy: 0.5429 roc: 0.5376 precision: 0.5436 recall: 0.5530

bagging_knn

accuracy: 0.5357 roc: 0.5433 precision: 0.5318 recall: 0.5562

bagging_tree

accuracy: 0.5400 roc: 0.5693 precision: 0.5422 recall: 0.5867

random_forest

accuracy: 0.5516 roc: 0.5909 precision: 0.5423 recall: 0.5982

adaboost

accuracy: 0.5319 roc: 0.5319 precision: 0.5305 recall: 0.4854

gradient_boost

accuracy: 0.5044 roc: 0.5200 precision: 0.5068 recall: 0.4304

svm

accuracy: 0.5589 roc: 0.5590 precision: 0.5441 recall: 0.7054

##### 18 neighbors - three clusters in each residue type

decision_tree

accuracy: 0.5575 roc: 0.5584 precision: 0.5655 recall: 0.5155

naive_gaussian

accuracy: 0.5381 roc: 0.5525 precision: 0.5331 recall: 0.6197

K_neighbor

accuracy: 0.5396 roc: 0.5408 precision: 0.5392 recall: 0.5531

bagging_knn

accuracy: 0.5294 roc: 0.5791 precision: 0.5486 recall: 0.5578

bagging_tree

accuracy: 0.5657 roc: 0.5861 precision: 0.5384 recall: 0.6293

random_forest

accuracy: 0.5637 roc: 0.6262 precision: 0.5720 recall: 0.6157

adaboost

accuracy: 0.5425 roc: 0.5612 precision: 0.5433 recall: 0.5327

gradient_boost

accuracy: 0.5275 roc: 0.5540 precision: 0.5283 recall: 0.5192

svm

accuracy: 0.5763 roc: 0.6104 precision: 0.5417 recall: 0.9961