# Outline
#### 1.1 Read residues and neighbors
#### 1.2 First 6 neighbors selected into subsequent analysis
#### 2.1 For neighbor, find a most similar neighbor in each sample 
#### 2.2 Change columns' name
#### 2.3 Meaning of column names 
#### 3.1 Functions for calculating the sum of root of the sum of squared difference 
#### 3.2 The below is to calcuate the summation of the distance of one residue to the closest residue in each sample 
#### 4.1 Clustering 
#### 5.1 Sort by total distance of each sample (lowest, first) and return the first 6 clusters
#### 5.2 Sort by total distance of each sample (lowest, first) and return the first 6  unique clusters
#### 6.1 Generate features accroding to clustering importance
#### 7.1 Prediction results

### Load packges

In [2]:
import pandas as pd
import math
import numpy as np
from sklearn import preprocessing
from sklearn.cluster import KMeans

### 1.1 Read residues and neighbors
For each K residue, up to 20 neighbors have been recorded and loaded 

Note: These K residues are positive or negative

In [8]:
Twenty_near_neigh = pd.read_csv("/home/yuan/Documents/PTM_k_NN24.csv").iloc[:,2:]

In [9]:
Twenty_near_neigh.iloc[4:9,0:18]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
4,GLY,4.55808,144.017527,GLY,5.496251,84.293718,GLY,6.017002,71.365163,GLY,6.027032,125.744274,GLY,6.125764,133.60281,PHE,6.233976,86.603417
5,SER,5.382942,100.72826,ALA,6.947727,96.593505,ALA,7.038957,158.572218,VAL,7.055596,69.321587,PRO,7.472816,174.507082,LYS,8.033167,104.791887
6,GLN,4.664383,112.660623,GLY,4.964091,117.424241,GLY,6.261041,74.416412,SER,6.696716,103.0122,ALA,8.110972,37.400155,VAL,8.139168,34.274699
7,ILE,6.373498,105.584257,GLY,6.587568,72.583166,GLU,7.253496,75.290607,LYS,7.614024,106.538132,LEU,8.455212,49.066449,VAL,9.477051,69.560937
8,ALA,5.967033,140.713607,ALA,6.222193,111.021387,VAL,7.174093,73.843222,GLU,7.506627,71.196988,ALA,7.596534,139.716105,ILE,7.876493,107.489917


Dimensions of the dataset

2072: 2072 residues (pos: neg ~ 1:1)

60: 20 neighbors x 3 (neighbor type, distance to the residue, and angle to the residue)

In [10]:
Twenty_near_neigh.shape

(2072, 72)

#### 1.2 First 6 neighbors selected into subsequent analysis

In [11]:
Six_near_neigh = Twenty_near_neigh.iloc[:,0:72]

### 2.1 For neighbor, find a most similar neighbor in each sample 
##### Three layers for loops nested 

Loop through each neighbor (1-6)
    
    Loop through each sample
        
        For each residue, find one similar neighbor in each sample
            
            case 1: No matching residue type, fill type: 21, distance: 999, angle: 999
            
            case 2: Only one residue matching type, fill it directly
            
            case 3: More than one residue matching type, pick one with least square difference
            
##### One example in case 3:

For example, we are searching for a closest residue of 4th residue of sample 4 in sample 6: 

sample 4: GLY d1 a1,	GLY d2	a2,	GLY d3 	a3,	**GLY	d4 	a4**,	GLY d5 a5, PHE d6 a6 

sample 6: ALA d7 a7, GLY d8 a8, GLU d9 a9, GLY d10 a10, VAL d11 a11, ILE d12 a12

Two GLY residues in sample 6.

min(√((d4-d8)^2 + (a4-a8)^2)  , √((d4-d10)^2 + (a4-a10)^2) ) will return


In [12]:
%%time
near_neigh_each_protein = []    
for i in range(24):   
    for j in range(2072):
        one = []
        one.append(i)
        one.append(j)
        res_name = Six_near_neigh.iloc[j,3*i]
        res_dis = Six_near_neigh.iloc[j,3*i+1]
        res_angle = Six_near_neigh.iloc[j,3*i+2]
        one.append(res_name)
        one.append(res_dis)
        one.append(res_angle)
        for k in range(2072):
            one.append(k)
            one_protein = Six_near_neigh.iloc[k,:]
            if len(one_protein[one_protein==res_name].index) == 0:
                one.append(21)
                one.append(999)
                one.append(999)
            elif len(one_protein[one_protein==res_name].index) == 1:
                for index in one_protein[one_protein==res_name].index:
                    index = int(index)
                    one.append(one_protein[index-1])
                    one.append(one_protein[index])
                    one.append(one_protein[index+1])
            else:
                candidate_index = []
                compare = []
                compare_candidate = []
                for index in one_protein[one_protein==res_name].index:
                    index = int(index)
                    candidate_index.append(index)
                    compare.append(math.sqrt((one_protein[index]-res_dis)**2 + (one_protein[index+1]-res_angle)**2))
                    compare_candidate.append(one_protein[index-1])
                    compare_candidate.append(one_protein[index])
                    compare_candidate.append(one_protein[index+1])
                min_index = compare.index(min(compare))
                one.append(compare_candidate[min_index*3])
                one.append(compare_candidate[min_index*3+1])
                one.append(compare_candidate[min_index*3+2])
        near_neigh_each_protein.append(one)

CPU times: user 1d 10h 4min 45s, sys: 2min 47s, total: 1d 10h 7min 33s
Wall time: 1d 10h 7min 34s


In [13]:
near_df = pd.DataFrame(near_neigh_each_protein)

2.2 Change columns' name

In [14]:
near_df.columns =\
    ["order inside sample","current sample","current type","current dis","current angle"]+ \
    [str(x) + str(y) for x in range(0,2072) for y in ["th sample"," type"," distance"," angle"]]

Save it to file

In [135]:
near_df.to_csv("/home/yuan/Documents/PTM_k_NN20_test.csv")

##### 2.3 Meaning of column names 

order inside sample: i-th neighbor of a sample, i = 0,1,2,3,4,5

current sample: j-th sample, j = 0,1,2,3......,2071

current type: residue type of i-th neighbor

Then, the closest residue's information of one residue in each sample


In [362]:
near_df

Unnamed: 0,order inside sample,current sample,current type,current dis,current angle,0th sample,0 type,0 distance,0 angle,1th sample,...,2069 distance,2069 angle,2070th sample,2070 type,2070 distance,2070 angle,2071th sample,2071 type,2071 distance,2071 angle
0,0,0,LYS,4.632685,119.969047,0,LYS,4.632685,119.969047,1,...,999.000000,999.000000,2070,21,999.000000,999.000000,2071,21,999.000000,999.000000
1,0,1,ILE,5.740031,130.301395,0,ILE,7.797483,68.099175,1,...,999.000000,999.000000,2070,21,999.000000,999.000000,2071,21,999.000000,999.000000
2,0,2,GLY,5.415700,90.022171,0,GLY,7.562873,88.182333,1,...,999.000000,999.000000,2070,21,999.000000,999.000000,2071,GLY,4.966696,102.169358
3,0,3,GLN,7.995939,75.280062,0,21,999.000000,999.000000,1,...,8.122120,75.637151,2070,21,999.000000,999.000000,2071,21,999.000000,999.000000
4,0,4,GLY,4.558080,144.017527,0,GLY,7.562873,88.182333,1,...,999.000000,999.000000,2070,21,999.000000,999.000000,2071,GLY,4.683273,119.941672
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12427,5,2067,THR,6.882423,71.300467,0,THR,8.190358,25.060192,1,...,4.937804,119.508384,2070,21,999.000000,999.000000,2071,21,999.000000,999.000000
12428,5,2068,LYS,6.915732,168.183051,0,LYS,4.632685,119.969047,1,...,999.000000,999.000000,2070,21,999.000000,999.000000,2071,21,999.000000,999.000000
12429,5,2069,GLN,8.122120,75.637151,0,21,999.000000,999.000000,1,...,8.122120,75.637151,2070,21,999.000000,999.000000,2071,21,999.000000,999.000000
12430,5,2070,ALA,6.970045,69.479082,0,21,999.000000,999.000000,1,...,999.000000,999.000000,2070,ALA,6.970045,69.479082,2071,ALA,5.640251,85.161425


### 3.1 Functions for calculating the sum of root of the sum of squared difference 
#### second function is just wrap up the first function for purpose of multi-processing

The first function takes two residues as input [d1, a1] [d2, a2]

Normalize [d1, d2] [a1, a2] into the unit ([0,1])

    [20,999] will be normalized into something like [0.01,0.99]
   
The function will return √((d1 - d2)^2 + (a1 - a2)^2)

In [15]:
def norm_distance(ang_dis_1,ang_dis_2):
    # import math
    # import numpy as np
    # from sklearn import preprocessing
    norm_array = preprocessing.normalize([[ang_dis_1[0],ang_dis_2[0]],[ang_dis_1[1],ang_dis_2[1]]],axis=0)
    total_dis = math.sqrt((norm_array[0][0] - norm_array[0][1])**2+\
                (norm_array[1][0]- norm_array[1][1])**2)
    return(total_dis)

from multiprocessing import Pool
def wrap_nd(i):
    return(norm_distance(near_df.iloc[j,3:5],near_df.iloc[j,(7+4*i):(7+4*i+1+1)]))

### 3.2 The below is to calcuate the summation of the distance of one residue to the closest residue in each sample 

For each residue, we have its own distance and angle, donating as dis and ang.

We have its closest residue in this sample. d1 a1, d2 a2, ...., d2071 a2071, d2072 a2072

sum = norm_distance(dis ang, d1 a1) + norm_distance(dis ang, d2 a2) + ... + norm_distance(dis ang, d2072 a2072)

In [16]:
%%time
total_dis = list()
for j in range(49704):
    one_res = 0
    with Pool(processes=8) as pool:
        for i in pool.imap_unordered(wrap_nd, range(2071)):
            one_res=i+one_res
    total_dis.append(one_res)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

total_dis_np = np.array(total_dis)
sort_index = np.argsort(total_dis_np)

In [385]:
near_df_copy = near_df.copy(deep=True)
near_df_copy["total_dis"] = total_dis

In [392]:
near_df_copy.iloc[0:5,[0,1,2,3,4,5,8291,8292,8293]]

Unnamed: 0,order inside sample,current sample,current type,current dis,current angle,0th sample,2071 distance,2071 angle,total_dis
0,0,0,LYS,4.632685,119.969047,0,999.0,999.0,1090.186672
1,0,1,ILE,5.740031,130.301395,0,999.0,999.0,1089.139844
2,0,2,GLY,5.4157,90.022171,0,4.966696,102.169358,886.660775
3,0,3,GLN,7.995939,75.280062,0,999.0,999.0,1117.089191
4,0,4,GLY,4.55808,144.017527,0,4.683273,119.941672,935.628631


### 4.1 Clustering 
##### some residues have few samples, not enough clusters (max = 3)

In [284]:
cluster_name = []
cluster_ave_dis = []
col_index = [(7+4*i) for i in range(2071)] + [(7+4*i+1) for i in range(2071)]
col_index.sort()
cluster_near_df = pd.DataFrame()
for res in near_df["current type"].unique():
    one_res = near_df_copy[near_df_copy["current type"]==res].copy(deep=True)
    kmeans = KMeans(n_clusters=3, random_state=0).fit(one_res.iloc[:,col_index])
    one_res["Kmeans_cluster"] = kmeans.labels_
    cluster_near_df = pd.concat([cluster_near_df,one_res])
    for i in range(3):
        cluster_name.append(res+str(i))
        average_dis = sum(one_res[one_res["Kmeans_cluster"]==i]["total_dis"])/\
                        len(one_res[one_res["Kmeans_cluster"]==i]["total_dis"])
        cluster_ave_dis.append(average_dis)

In [373]:
cluster_near_df = cluster_near_df.sort_index()

In [374]:
cluster_near_df_sort = cluster_near_df.copy(deep=True)

In [393]:
cluster_near_df_sort.iloc[0:5,[0,1,2,3,4,5,8291,8292,8293,8294]]

Unnamed: 0,order inside protein,current protein,current type,current dis,current angle,0th protein,2071 distance,2071 angle,total_dis,Kmeans_cluster
0,0,0,LYS,4.632685,119.969047,0,999.0,999.0,3739.116168,1
1,0,1,ILE,5.740031,130.301395,0,999.0,999.0,3747.842381,0
2,0,2,GLY,5.4157,90.022171,0,4.966696,102.169358,3646.288395,2
3,0,3,GLN,7.995939,75.280062,0,999.0,999.0,3844.015506,1
4,0,4,GLY,4.55808,144.017527,0,4.683273,119.941672,3631.922711,0


### 5.1 Sort by total distance of each sample (lowest, first) and return the first 6 clusters

In [394]:
cluster_near_df_sort.sort_values(by=["total_dis"]).loc[:,["current type","Kmeans_cluster"]].iloc[0:6,]

Unnamed: 0,current type,Kmeans_cluster
807,GLU,0
2096,GLU,0
6288,GLU,0
152,GLU,0
3798,GLU,0
2567,GLU,0


### 5.2 Sort by total distance of each sample (lowest, first) and return the first 6  unique clusters

In [351]:
cluster_near_df_sort.sort_values(by=["total_dis"]).loc[:,["current type","Kmeans_cluster"]].drop_duplicates().iloc[0:6,]

Unnamed: 0,current type,Kmeans_cluster
807,GLU,0
1763,GLY,0
1720,GLU,2
4526,ALA,0
1987,GLY,2
304,LEU,1


### 6.1 Generate features accroding to clustering importance

In [400]:
final_feature = []
important_cluster = ['GLU0', 'GLY0', 'GLU2', 'ALA0', 'GLY2', 'LEU1']
for i in range(2072):
    one_protein = []
    one_data = cluster_near_df[cluster_near_df["current protein"]==i].copy(deep=True)
    one_data["cluster"] = one_data["current type"] + one_data["Kmeans_cluster"].astype('str')
    for key_cluster in important_cluster:
        if one_data[one_data["cluster"]==key_cluster].shape[0] == 0:
            one_protein.extend(["unknown",999,999])
        else:
            one_protein.extend(list(one_data[one_data["cluster"]==key_cluster].sample(1).iloc[:,2:5].values[0]))
    final_feature.append(one_protein)

#### The below is final features for training and predicting

In [399]:
pd.DataFrame(final_feature).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,GLY,7.562873,88.182333,unknown,999.0,999.0
1,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,ALA,6.028596,114.233626,unknown,999.0,999.0,unknown,999.0,999.0
2,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,ALA,8.645829,103.517937,GLY,5.4157,90.022171,LEU,5.426839,109.033123
3,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0,unknown,999.0,999.0
4,unknown,999.0,999.0,GLY,6.125764,133.60281,unknown,999.0,999.0,unknown,999.0,999.0,GLY,5.496251,84.293718,unknown,999.0,999.0


In [401]:
pd.DataFrame(final_feature).to_csv("/home/yuan/Documents/PTM_k_NN20_order_samples.csv")

### 7.1 Prediction results

|Classifier|Accuracy|ROC|Precision|Recall|
|---|---|---|---|---|
|decision_tree |accuracy: 0.5439 | roc: 0.5664 | precision: 0.5356 | recall: 0.6949|
|naive_gaussian |accuracy: 0.5381 |roc: 0.5264 |precision: 0.5236| recall: 0.8514|
|K_neighbor |accuracy: 0.5183 |roc: 0.5314 |precision: 0.5191 |recall: 0.5264|
|bagging_knn |accuracy: 0.5145 |roc: 0.5330| precision: 0.5036 |recall: 0.5160|
|bagging_tree |accuracy: 0.5304| roc: 0.5756| precision: 0.5354| recall: 0.7123|
|random_forest |accuracy: 0.5449| roc: 0.5810|precision: 0.5469 |recall: 0.7037|
|adaboost |accuracy: 0.5391 |roc: 0.5437 |precision: 0.5291 |recall: 0.7007|
|gradient_boost |accuracy: 0.5362 |roc: 0.5383 |precision: 0.5260 |recall: 0.7182|
|svm |accuracy: 0.5063| roc: 0.5008 |precision: 0.5037| recall: 0.6990|
