# Task1. 预测是酶还是非酶

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-05-26  

## 任务简介
该任务通过给定蛋白序列，预测该该蛋白是酶还是非酶。本任务所使用的数据集为Sport，对数据集的数据中进行学习，然后对新给定的蛋白序列数据预测是酶还是非酶。


## 数据统计
- 数据源Sprot，酶 219,227 条， 非酶226,539条。
- 将数据集中的所有数据按照时间排序，～90%作为训练集，～10%作为测试集，找到对应时间节点为2009年12月14日。
- 以2009年12月14日为时间节点，之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 





|     Items    | 酶       |   非酶    |合计                               |
| ------------ | --------| --------- |----------------------------------|
| 训练集        | 198,692 | 208,391   | 407,083（407,083/453,212=89.82%)  |
| 测试集        | 20,535  | 25,594    | 46,129 (46,129/453,212=10.18% )  |






## 实验方法

- 同源比对：使用训练集建立比对库，然后将测试集与之比对，取最优的比对结果，比对结果的（酶/非酶）属性当作测试集的测试结果
- 传统机器学习方法
- 深度学习方法



## 实验结果

### 同源比对

|Methods   | Accuracy                        |             Precision           |           Recall               |F1   |
| ---------| ------------------------------- | ------------------------------- |--------------------------------|-----|
| 同源比对  |  0.6258987956138774(27855/44504) | 0.8310707998925918(27855/33517) |0.753123314758224(33517/44504)|      |

### 机器学习 + onehot


|baslineName| accuracy 	 |precision(PPV) |	 NPV 	 |	recall |	f1 	 |	 auroc 	 |	 auprc 	 |	 confusion Matrix						|
| ----------| -----------|---------------| --------- | --------|---------|-----------|---------- |------------------------------------------|
| lr 		|	0.609676 |	0.560377 	 |	0.680254 |0.715023 |0.628324 |	0.661484 |	0.568861 |	 tp: 14683 fp: 11519 fn: 5852 tn: 12450	|
| xg 		|	0.665446 |	0.643591 	 |	0.682740 |0.616168 |0.629581 |	0.742674 |	0.684483 |	 tp: 12653 fp: 7007 fn: 7882 tn: 16962	|
| dt 		|	0.602418 |	0.570965 	 |	0.628129 |0.556562 |0.563671 |	0.599133 |	0.522388 |	 tp: 11429 fp: 8588 fn: 9106 tn: 15381	|
| rf 		|	0.666479 |	0.624934 	 |	0.710044 |0.693255 |0.657324 |	0.745132 |	0.689971 |	 tp: 14236 fp: 8544 fn: 6299 tn: 15425	|
| gbdt  	| 	0.624079 |	0.569351 	 |	0.712026 |0.760604 |0.651226 |	0.695638 |	0.628716 |	 tp: 15619 fp: 11814 fn: 4916 tn: 12155	|


### 机器学习 + unirep

|baslineName| accuracy 	 |precision(PPV) |	 NPV 	 |	recall |	f1 	 |	 auroc 	 |	 auprc 	 |	 confusion Matrix					  |
| ----------| -----------|---------------| --------- | --------|---------|-----------|---------- |----------------------------------------|
|lr 	    |0.826757 	 |	0.849132 	 |	0.811034 |0.759484 |0.801810 |	0.901500 |	0.892720 |	 tp: 15596 fp: 2771 fn: 4939 tn: 21198|
|xg 	    |0.860260	 |	0.883725 	 |	0.843327 |0.802776 |0.841308 |	0.934751 |	0.927646 |	 tp: 16485 fp: 2169 fn: 4050 tn: 21800|
|dt 	    |0.772290 	 |	0.767612 	 |	0.775916 |0.726418 |0.746447 |	0.769004 |	0.683843 |	 tp: 14917 fp: 4516 fn: 5618 tn: 19453|
|rf 	    |	0.851070 |	0.903031 	 |	0.818172 |0.758705 |0.824600 |	0.932207 |	0.922548 |	 tp: 15580 fp: 1673 fn: 4955 tn: 22296|
|gbdt 		|0.822825 	 |0.864764 		 |0.796054 	 |0.730217 |0.791815 |	0.898601 |	0.889249 |	 tp: 14995 fp: 2345 fn: 5540 tn: 21624|

## 1. 导入必要的包、定义公共函数

In [1]:
import numpy as np
import pandas as pd
import random
import time
import gzip
import re
import datetime
import sys
import os
from tqdm import tqdm

from functools import reduce
import matplotlib.pyplot as plt

sys.path.append("../../tools/")
import commontools
import funclib
from pyecharts.globals import CurrentConfig, OnlineHostType
CurrentConfig.ONLINE_HOST = OnlineHostType.NOTEBOOK_HOST
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB

from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.faker import Faker
from pyecharts.globals import ThemeType

%load_ext autoreload
%autoreload 2

## 2. 加载数据

In [69]:
enzyme_noemzyme = pd.read_csv('./data/emzyme_noemzyme_data.tsv', sep='\t',funclib.names=table_head) #读入文件
enzyme_noemzyme = enzyme_noemzyme.reset_index(drop=True)
enzyme_noemzyme.date_integraged = pd.to_datetime(enzyme_noemzyme['date_integraged'])
enzyme_noemzyme.head(2)

Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P02802,MT1_MOUSE,False,False,1,-,0,1986-07-21,1986-07-21,2021-04-07,MDPNCSCSTGGSCTCTSSCACKNCKCTSCKKSCCSCCPVGCSKCAQ...,61
1,P61852,SODC_DROSI,True,False,1,1.15.1.1,4,1986-07-21,2007-01-23,2020-12-02,MVVKAVCVINGDAKGTVFFEQESSGTPVKVSGEVCGLAKGLHGFHV...,153


## 3. unirep计算结果

In [None]:
unirep  =np.load('./data/emzyme_noemzyme_unirep.npy')

## 3. 划分训练集、测试集

In [None]:
thres = datetime.datetime(2009, 12, 14, 0, 0)

#划分数据
test = enzyme_noemzyme[enzyme_noemzyme.date_integraged > thres ].sort_values(by='date_integraged')
train = enzyme_noemzyme[enzyme_noemzyme.date_integraged <= thres ].sort_values(by='date_integraged')

# train.to_csv('./data/train.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)
# test.to_csv('./data/test.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)

funclib.table2fasta(train, './data/train.fasta')
funclib.table2fasta(test, './data/test.fasta')

In [72]:
train_1

Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P02802,MT1_MOUSE,False,False,1,-,0,1986-07-21,1986-07-21,2021-04-07,MDPNCSCSTGGSCTCTSSCACKNCKCTSCKKSCCSCCPVGCSKCAQ...,61
2301,P02561,TPM4_HORSE,False,False,1,-,0,1986-07-21,2007-01-23,2021-04-07,MAGLNSLEAVKRKIQALQQQADEAEDRAQGLQRELDGERERREKAE...,248
2302,P02635,SCPB_PENSP,False,False,1,-,0,1986-07-21,1986-07-21,2019-12-11,AYSWDNRVKYIVRYMYDIDNDGFLDKNDFECLAVRVTLIEGRGEFS...,192
2303,P00590,CUTI1_FUSVN,True,False,1,3.1.1.74,4,1986-07-21,1986-07-21,2021-04-07,MKFFALTTLLAATASALPTSNPAQELEARQLGRTTRDDLINGNSAS...,230
2304,P0AG66,RS17_SHIFL,False,False,1,-,0,1986-07-21,2007-01-23,2021-04-07,MTDKIRTLQGRVVSDKMEKSIVVAIERFVKHPIYGKFIKRTTKLHV...,84
...,...,...,...,...,...,...,...,...,...,...,...,...
400811,Q2J6D7,KU_FRACC,False,False,1,-,0,2009-11-24,2006-03-07,2020-12-02,MRATWKGVISFGLVSIPVRLYSATQERDVAFHQVRRSDGSRIRYRR...,396
400812,Q9FLU8,BGL32_ARATH,True,False,1,3.2.1.21,4,2009-11-24,2009-11-24,2020-12-02,MAIKLIALVITICVASWDSAQGRSLRFSTTPLNRYSFPPHFDFGVA...,534
400813,Q7SXM0,HDAC8_DANRE,True,False,1,3.5.1.98,4,2009-11-24,2003-10-01,2021-04-07,MSEKSDSNDDKSRTRSVVYVYSPEYIQTCDSLSKVPNRASMVHSLI...,378
400805,O48779,BGL33_ARATH,True,False,1,3.2.1.21,4,2009-11-24,1998-06-01,2020-12-02,MATATLTLFLGLLALTSTILSFNADARPQPSDEDLGTIIGPHQTSF...,614


## 4. 二分类
### 4.1 同源比对

In [33]:
def getblast(train, test):
    
    funclib.table2fasta(train, '/tmp/train.fasta')
    funclib.table2fasta(test, '/tmp/test.fasta')
    
    currentpath = os.getcwd();
    cmd1 = r'diamond makedb --in /tmp/train.fasta -d /tmp/train.dmnd'
    cmd2 = r'diamond blastp -d /tmp/train.dmnd  -q  /tmp/test.fasta -o /tmp/test_fasta_results.tsv -b5 -c1 -k 1'
    cmd3 = r'rm -rf /tmp/*.fasta /tmp/*.dmnd /tmp/*.tsv'
    print(cmd1)
    os.system(cmd1)
    print(cmd2)
    os.system(cmd2)
    res_data = pd.read_csv('/tmp/test_fasta_results.tsv', sep='\t', names=['id', 'sseqid', 'pident', 'length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore'])
    os.system(cmd3)
    return res_data

In [32]:
res_data=getblast(train,test)

Write finished
Write finished
diamond makedb --in /tmp/train.fasta -d /tmp/train.dmnd
diamond blastp -d /tmp/train.dmnd  -q  /tmp/test.fasta -o /tmp/test_fasta_results.tsv -b5 -c1 -k 1


In [38]:
#匹配查询结果
data_match = pd.merge(test,res_data, on=['id'], how='inner')

# 添加查询结果的EC号
resArray =[]
for i in tqdm(range(len(res_data))):
    counter+=1
    mn = train[train['id']== res_data['sseqid'][i]]['ec_number'].values
    resArray.append(mn)
data_match['sresults_ec']=np.array(resArray) 
data_match.head(3)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000


Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,...,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_ec
0,A9WBQ9,ACCD_CHLAA,True,False,1,2.1.3.15,4,2009-12-15,2008-02-05,2020-12-02,...,281,138,4,3,281,5,283,6.09e-89,270.0,2.1.3.15
1,A7HZV8,NUOK_CAMHC,True,False,1,7.1.1.-,3,2009-12-15,2007-09-11,2020-12-02,...,100,60,1,1,99,1,100,8.58e-21,82.0,7.1.1.-
2,Q2J0E4,NUOK_RHOP2,True,False,1,7.1.1.-,3,2009-12-15,2006-03-07,2020-12-02,...,52,26,0,10,61,8,59,3.97e-08,49.7,7.1.1.-


In [39]:
# 计算指标
data_match['iscorrect'] = data_match[['ec_number', 'sresults_ec']].apply(lambda x: x['ec_number'] == x['sresults_ec'], axis=1) #判断EC号是否一致
correct = sum(data_match['iscorrect'])
find  = len(data_match)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 44504
Matched records are: 33517
Accuracy: 0.6258987956138774(27855/44504)
Pricision: 0.8310707998925918(27855/33517)
Recall: 0.753123314758224(33517/44504)


In [80]:
sprot_exp.seqlength.describe()

count    529270.000000
mean        328.990479
std         195.433691
min          60.000000
25%         177.000000
50%         293.000000
75%         435.000000
max        1000.000000
Name: seqlength, dtype: float64

### 4.2 使用机器学习方法

In [40]:
trainset = train[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)
testset = test[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)

MAX_SEQ_LENGTH = 1000 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

In [41]:
f_train = funclib.dna_onehot(trainset) #训练集编码
f_test = funclib.dna_onehot(testset) #测试集编码

train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

X_train = train_full.iloc[:,4:]
X_test = test_full.iloc[:,4:]
Y_train = train_full.isemzyme.astype('int')
Y_test = test_full.isemzyme.astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)

In [43]:
methods=['lr', 'xg', 'dt', 'rf', 'gbdt']
print('baslineName', '\t', 'accuracy','\t', 'precision(PPV) \t NPV \t\t', 'recall','\t', 'f1', '\t\t', 'auroc','\t\t', 'auprc', '\t\t confusion Matrix')
for method in methods:
    funclib.evaluate(method, X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.609676 	0.560377 		0.680254 	0.715023 	0.628324 	0.661484 	0.568861 	 tp: 14683 fp: 11519 fn: 5852 tn: 12450
xg 		0.665446 	0.643591 		0.682740 	0.616168 	0.629581 	0.742674 	0.684483 	 tp: 12653 fp: 7007 fn: 7882 tn: 16962
dt 		0.602418 	0.570965 		0.628129 	0.556562 	0.563671 	0.599133 	0.522388 	 tp: 11429 fp: 8588 fn: 9106 tn: 15381
rf 		0.666479 	0.624934 		0.710044 	0.693255 	0.657324 	0.745132 	0.689971 	 tp: 14236 fp: 8544 fn: 6299 tn: 15425
gbdt 		0.624079 	0.569351 		0.712026 	0.760604 	0.651226 	0.695638 	0.628716 	 tp: 15619 fp: 11814 fn: 4916 tn: 12155


## 5. Unirep 特征

In [3]:
uni_data = np.load('./data/emzyme_noemzyme_unirep.npy', allow_pickle=True)
uni_data =pd.DataFrame([i for k in uni_data for i in k ])
uni_data[7] = pd.to_datetime(uni_data[7])
uni_data.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1902,1903,1904,1905,1906,1907,1908,1909,1910,1911
0,P02802,MT1_MOUSE,False,False,1,-,0,1986-07-21,1986-07-21,2021-04-07,...,-0.002212,0.119389,-0.019647,-0.016905,-0.532317,0.008659,-0.011296,0.001098,0.794472,0.210124
1,P61852,SODC_DROSI,True,False,1,1.15.1.1,4,1986-07-21,2007-01-23,2020-12-02,...,0.189573,-0.039781,-0.457674,0.018009,-0.097954,-0.030856,0.060694,0.01537,0.012495,0.00018
2,P00925,ENO2_YEAST,True,False,1,4.2.1.11,4,1986-07-21,2007-01-23,2021-04-07,...,-0.009811,0.000229,-0.227972,0.185412,-0.106373,0.060093,0.007852,0.061177,-0.071324,-0.030171


In [4]:
t_thres = datetime.datetime(2009, 12, 14, 0, 0)

#训练集
train = uni_data[uni_data[7] <= t_thres ].sort_values(by=7)
#测试集
test = uni_data[uni_data[7] > t_thres ].sort_values(by=7)

In [5]:
X_train = train.iloc[:,12:-1]
X_test = test.iloc[:,12:-1]

Y_train = train.iloc[:,2:3].astype('int')
Y_test = test.iloc[:,2:3].astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train).flatten()
Y_test = np.array(Y_test).flatten()

In [None]:
funclib.run_baseline(X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix




lr 		0.826757 	0.849132 		0.811034 	0.759484 	0.801810 	0.901500 	0.892720 	 tp: 15596 fp: 2771 fn: 4939 tn: 21198
xg 		0.860260 	0.883725 		0.843327 	0.802776 	0.841308 	0.934751 	0.927646 	 tp: 16485 fp: 2169 fn: 4050 tn: 21800
