# Task1. 预测是酶还是非酶

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-05-26  

## 任务简介
该任务通过给定蛋白序列，预测该该蛋白是酶还是非酶。本任务所使用的数据集为Sport，对数据集的数据中进行学习，然后对新给定的蛋白序列数据预测是酶还是非酶。


## 数据统计
- 数据源Sprot，酶 219,227 条， 非酶226,539条。
- 将数据集中的所有数据按照时间排序，～90%作为训练集，～10%作为测试集，找到对应时间节点为2009年12月14日。
- 以2009年12月14日为时间节点，之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 





|     Items    | 酶       |   非酶    |合计                               |
| ------------ | --------| --------- |----------------------------------|
| 训练集        | 198,692 | 208,391   | 407,083（407,083/453,212=89.82%)  |
| 测试集        | 20,535  | 25,594    | 46,129 (46,129/453,212=10.18% )  |






## 实验方法

- 同源比对：使用训练集建立比对库，然后将测试集与之比对，取最优的比对结果，比对结果的（酶/非酶）属性当作测试集的测试结果
- 传统机器学习方法
- 深度学习方法



## 实验结果

### baselines

|Methods   | Accuracy                        |             Precision           |           Recall               |F1   |
| ---------| ------------------------------- | ------------------------------- |--------------------------------|-----|
| 同源比对  |  0.7273298792516638(33551/46129) | 0.9637491741590785(33551/34813) |0.754687940341217(34813/46129) |      |
| DeepEC  |  0.7685620759175356(35453/46129) | 0.5372317856709904(35453/65992) |1.43059680461315(65992/46129)  |      |


### 机器学习 + onehot


|baslineName| accuracy 	 |precision(PPV) |	 NPV 	 |	recall |	f1 	 |	 auroc 	 |	 auprc 	 |	 confusion Matrix						|
| ----------| -----------|---------------| --------- | --------|---------|-----------|---------- |------------------------------------------|
| lr 		|	0.609676 |	0.560377 	 |	0.680254 |0.715023 |0.628324 |	0.661484 |	0.568861 |	 tp: 14683 fp: 11519 fn: 5852 tn: 12450	|
| xg 		|	0.665446 |	0.643591 	 |	0.682740 |0.616168 |0.629581 |	0.742674 |	0.684483 |	 tp: 12653 fp: 7007 fn: 7882 tn: 16962	|
| dt 		|	0.602418 |	0.570965 	 |	0.628129 |0.556562 |0.563671 |	0.599133 |	0.522388 |	 tp: 11429 fp: 8588 fn: 9106 tn: 15381	|
| rf 		|	0.666479 |	0.624934 	 |	0.710044 |0.693255 |0.657324 |	0.745132 |	0.689971 |	 tp: 14236 fp: 8544 fn: 6299 tn: 15425	|
| gbdt  	| 	0.624079 |	0.569351 	 |	0.712026 |0.760604 |0.651226 |	0.695638 |	0.628716 |	 tp: 15619 fp: 11814 fn: 4916 tn: 12155	|


### 机器学习 + unirep

|baslineName| accuracy 	 |precision(PPV) |	 NPV 	 |	recall |	f1 	 |	 auroc 	 |	 auprc 	 |	 confusion Matrix					  |
| ----------| -----------|---------------| --------- | --------|---------|-----------|---------- |----------------------------------------|
|lr 	    |0.826757 	 |	0.849132 	 |	0.811034 |0.759484 |0.801810 |	0.901500 |	0.892720 |	 tp: 15596 fp: 2771 fn: 4939 tn: 21198|
|xg 	    |0.860260	 |	0.883725 	 |	0.843327 |0.802776 |0.841308 |	0.934751 |	0.927646 |	 tp: 16485 fp: 2169 fn: 4050 tn: 21800|
|dt 	    |0.772290 	 |	0.767612 	 |	0.775916 |0.726418 |0.746447 |	0.769004 |	0.683843 |	 tp: 14917 fp: 4516 fn: 5618 tn: 19453|
|rf 	    |0.851070    |	0.903031 	 |	0.818172 |0.758705 |0.824600 |	0.932207 |	0.922548 |	 tp: 15580 fp: 1673 fn: 4955 tn: 22296|
|gbdt 		|0.822825 	 |0.864764 		 |0.796054 	 |0.730217 |0.791815 |	0.898601 |	0.889249 |	 tp: 14995 fp: 2345 fn: 5540 tn: 21624|


## 1. 导入必要的包、定义公共函数

In [1]:
import numpy as np
import pandas as pd
import random
import time
import gzip
import re
import datetime
import sys
import os
from tqdm import tqdm

from functools import reduce
import matplotlib.pyplot as plt

sys.path.append("../../tools/")
import commontools
import funclib
from pyecharts.globals import CurrentConfig, OnlineHostType
CurrentConfig.ONLINE_HOST = OnlineHostType.NOTEBOOK_HOST
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB

from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.faker import Faker
from pyecharts.globals import ThemeType

%load_ext autoreload
%autoreload 2

## 2. 加载数据

In [7]:
train = pd.read_hdf('./data/train.h5',key='data')
test = pd.read_hdf('./data/test.h5',key='data')
head = funclib.table_head + ['f'+str(i) for i in range(1, 1901) ]
train.columns = head
test.columns = head

## 3. 同源比对

In [4]:
res_data=funclib.getblast(train,test)

#匹配查询结果
data_match = pd.merge(test,res_data, on=['id'], how='inner')


Write finished
Write finished
diamond makedb --in /tmp/train.fasta -d /tmp/train.dmnd
diamond blastp -d /tmp/train.dmnd  -q  /tmp/test.fasta -o /tmp/test_fasta_results.tsv -b5 -c1 -k 1


In [30]:
# 添加查询结果的EC号
resArray =[]
for i in tqdm(range(len(res_data))):
    mn = train[train['id']== res_data['sseqid'][i]]['isemzyme'].values
    resArray.append(mn)
data_match['sresults_isemzyme']=np.array(resArray) 

# 计算指标
data_match['iscorrect'] = data_match[['isemzyme', 'sresults_isemzyme']].apply(lambda x: x['isemzyme'] == x['sresults_isemzyme'], axis=1) #判断EC号是否一致
correct = sum(data_match['iscorrect'])
find  = len(data_match)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

100%|##########| 34813/34813 [15:30<00:00, 37.39it/s]


Total query records are: 46129
Matched records are: 34813
Accuracy: 0.7273298792516638(33551/46129)
Pricision: 0.9637491741590785(33551/34813)
Recall: 0.754687940341217(34813/46129)


## 4. 机器学习方法预测
### 4.1 one-hot + 机器学习

In [31]:
trainset = train[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)
testset = test[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)

MAX_SEQ_LENGTH = 1500 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

In [32]:
f_train = funclib.dna_onehot(trainset) #训练集编码
f_test = funclib.dna_onehot(testset) #测试集编码

train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

X_train = train_full.iloc[:,4:]
X_test = test_full.iloc[:,4:]
Y_train = train_full.isemzyme.astype('int')
Y_test = test_full.isemzyme.astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)

In [33]:
funclib.run_baseline(X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.599081 	0.538479 		0.681049 	0.695447 	0.606979 	0.649262 	0.540892 	 tp: 14281 fp: 12240 fn: 6254 tn: 13354
xg 		0.662880 	0.626024 		0.690533 	0.602824 	0.614205 	0.729195 	0.653786 	 tp: 12379 fp: 7395 fn: 8156 tn: 18199
dt 		0.601010 	0.551674 		0.640856 	0.553689 	0.552680 	0.596333 	0.504138 	 tp: 11370 fp: 9240 fn: 9165 tn: 16354
rf 		0.654989 	0.600400 		0.709312 	0.672705 	0.634499 	0.732298 	0.662396 	 tp: 13814 fp: 9194 fn: 6721 tn: 16400
gbdt 		0.610440 	0.546365 		0.706411 	0.735963 	0.627147 	0.684524 	0.602722 	 tp: 15113 fp: 12548 fn: 5422 tn: 13046


### 5. Unirep + 机器学习

In [43]:
X_train = train.iloc[:,12:]
X_test = test.iloc[:,12:]
Y_train = train.iloc[:,2].astype('int')
Y_test = test.iloc[:,2].astype('int')
X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train).flatten()
Y_test = np.array(Y_test).flatten()

In [44]:
funclib.run_baseline(X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.806066 	0.804573 		0.807113 	0.745410 	0.773862 	0.881648 	0.857795 	 tp: 15307 fp: 3718 fn: 5228 tn: 21876
xg 		0.838952 	0.837158 		0.840258 	0.792355 	0.814140 	0.915790 	0.896587 	 tp: 16271 fp: 3165 fn: 4264 tn: 22429
dt 		0.756899 	0.734821 		0.773571 	0.710202 	0.722302 	0.752284 	0.650879 	 tp: 14584 fp: 5263 fn: 5951 tn: 20331
rf 		0.832578 	0.857677 		0.816648 	0.748040 	0.799116 	0.914840 	0.893545 	 tp: 15361 fp: 2549 fn: 5174 tn: 23045
gbdt 		0.803377 	0.819520 		0.793103 	0.715997 	0.764269 	0.882290 	0.856353 	 tp: 14703 fp: 3238 fn: 5832 tn: 22356


### 6. DeepEC

In [8]:
funclib.table2fasta(test, './data/deepec.fasta')
! python ../../baselines/deepec/deepec.py -i ./data/deepec.fasta -o ./data/deepec/

Write finished
2021-06-10 15:53:37.877576: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-06-10 16:15:35.268850: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-06-10 16:15:35.359579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:81:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.8GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-10 16:15:35.360343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:c1:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.8GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-10 16:15:35.361071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: 
pciBusID: 0000:c2:00

In [30]:
res_deepec = pd.read_csv('./data/deepec/log_files/Enzyme_prediction.txt', header=0, sep='\t')
ec_dict = {v: k for k,v in zip(test.isemzyme, test.id )} 
res_deepec['isemzyme'] = res_deepec.apply(lambda x: ec_dict.get(x['Query ID']), axis=1)
res_deepec['isemzyme_pred'] = res_deepec.apply(lambda x: (x['Predicted class']=='Enzyme'), axis=1)
res_deepec['iscorrect'] = res_deepec.apply(lambda x: (x['isemzyme_pred']==x['isemzyme']), axis=1)
res_deepec.head(3)

Unnamed: 0,Query ID,Predicted class,DNN activity,isemzyme,isemzyme_pred,iscorrect
0,B0VRF9,Enzyme,0.999659,True,True,True
1,C6E8M5,Enzyme,0.999995,True,True,True
2,Q3SWM0,Enzyme,0.99975,True,True,True


In [31]:
correct = sum(res_deepec['iscorrect'])
find  = len(res_deepec)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 46129
Matched records are: 65992
Accuracy: 0.7685620759175356(35453/46129)
Pricision: 0.5372317856709904(35453/65992)
Recall: 1.43059680461315(65992/46129)


### 7. 集成模型