# Task2. 预测酶是单功能酶还是多功能酶

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-04-28  
\n
## 任务简介
该任务通过给定酶序列，预测该酶还是多功能酶还是单功能酶。本任务所使用的数据集为Sport，对数据集的数据中进行学习，然后对新给定的酶进行单功能/多功能预测。\n



## 数据统计
- 遵照任务1找到的时间节点对数据进行划分\n
- 以2009年12月14日为时间节点，之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 



|     Items    | 单功能酶       |   多功能非酶    |合计                     |
| ------------ | --------| --------- |----------------------------------|
| 训练集        | 185,453 | 13,239   | 198,692（198,692/219,227=90.63%)  |
| 测试集        | 18,868  | 25,594    | 20,535 (20,535/219,227=9.37% )   |



## 数据集构建方法

* 根据蛋白注释信息与之前划定的酶与非酶数据集，将「酶」数据进行分类。
* 有1个EC号的被定义为「单功能酶」，有多个EC号的被定义为「多功能酶」。



## 实验结果

### 同源比对

|Methods   | Accuracy                        |             Precision           |           Recall               |F1   |
| ---------| ------------------------------- | ------------------------------- |--------------------------------|-----|
| 同源比对  |  0.6112490869247627(12552/20535) | 0.7082722040401761(12552/17722) |0.8630143657170685(17722/20535)|      |



### 机器学习 + onehot



|baslineName| accuracy  |precision(PPV) | NPV  |recall |f1  | auroc  | auprc  | confusion Matrix|
| ------| ----------|-----------| ---------- | ----------|-----------|-----------|---------- |------------------------------------------|
|lr |0.910640 |0.271739 |0.922299 |0.059988 |0.098280 |0.551935 |0.124858 | tp: 100 fp: 268 fn: 1567 tn: 18600|
|xg |0.918578 |0.490909 |0.924383 |0.080984 |0.139032 |0.626453 |0.193860 | tp: 135 fp: 140 fn: 1532 tn: 18728|
|dt |0.850889 |0.168331 |0.928765 |0.212358 |0.187798 |0.559768 |0.099686 | tp: 354 fp: 1749 fn: 1313 tn: 17119|
|<span style='color:red; font-size:16px;'> rf </span> |0.924032 |0.957265 |0.923842 |0.067187 |0.125561 |0.668621 |0.269894 | tp: 112 fp: 5 fn: 1555 tn: 18863|
|gbdt |0.917507 |0.398496 |0.920890 |0.031794 |0.058889 |0.601392 |0.161334 | tp: 53 fp: 80 fn: 1614 tn: 18788|



### 机器学习 + unirep\n


|baslineName| accuracy  |precision(PPV) | NPV  |recall |f1  | auroc  | auprc  | confusion Matrix    |
| ----------| -----------|---------------| --------- | --------|---------|-----------|---------- |------------------------------------------|
|lr     |0.906306  |0.269300      |0.924066 |0.089982 |0.134892 |0.658586 |0.162518 | tp: 150 fp: 407 fn: 1517 tn: 18461|
|xg     |0.922669  |0.613181      |0.928019 |0.128374 |0.212302 |0.730793 |0.294830 | tp: 214 fp: 135 fn: 1453 tn: 18733|
|dt     |0.870611  |0.205707      |0.929932 |0.207558 |0.206629 |0.568375 |0.107026 | tp: 346 fp: 1336 fn: 1321 tn: 17532|
|rf     |0.923789  |0.947368      |0.923657 |0.064787 |0.121280 |0.693704 |0.297006 | tp: 108 fp: 6 fn: 1559 tn: 18862|
|gbdt     |0.918578  |0.474747      |0.920728 |0.028194 |0.053228 |0.649263 |0.171179 | tp: 47 fp: 52 fn: 1620 tn: 18816|

## 1. 导入必要的包

In [2]:
import numpy as np
import pandas as pd
import random
import os
import datetime
import sys
from tqdm import tqdm

sys.path.append("../../tools/")
import funclib



%load_ext autoreload
%autoreload 2

## 2. 加载数据

In [3]:
train = pd.read_hdf('./data/train.h5',key='data')
test = pd.read_hdf('./data/test.h5',key='data')
head = funclib.table_head + ['f'+str(i) for i in range(1, 1901) ]
train.columns = head
test.columns = head

## 3. 同源比对

In [4]:
res_data=funclib.getblast(train,test)

Write finished
Write finished
diamond makedb --in /tmp/train.fasta -d /tmp/train.dmnd
diamond blastp -d /tmp/train.dmnd  -q  /tmp/test.fasta -o /tmp/test_fasta_results.tsv -b5 -c1 -k 1


In [5]:
#匹配查询结果
data_match = pd.merge(test,res_data, on=['id'], how='inner')

# 添加查询结果的EC号
resArray =[]
for i in tqdm(range(len(res_data))):
    mn = train[train['id']== res_data['sseqid'][i]]['ec_number'].values
    resArray.append(mn)
data_match['sresults_ec']=np.array(resArray) 
data_match.head(3)



100%|██████████| 17722/17722 [04:10<00:00, 70.80it/s]


Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,...,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_ec
0,B0VRF9,ACCD_ACIBS,True,False,1,2.1.3.15,4,2009-12-15,2008-04-08,2020-12-02,...,256,107,1,44,298,32,287,8.660000000000001e-103,304.0,2.1.3.15
1,A0LEQ5,NUOK1_SYNFM,True,False,1,7.1.1.-,3,2009-12-15,2006-12-12,2020-12-02,...,98,51,0,4,101,5,102,3.8900000000000005e-23,87.4,7.1.1.-
2,Q0DIT2,BGL19_ORYSJ,True,False,1,3.2.1.21,4,2009-12-15,2006-10-17,2020-12-02,...,504,206,6,31,508,13,510,3.66e-182,523.0,3.2.1.21


In [6]:
# 计算指标
data_match['iscorrect'] = data_match[['ec_number', 'sresults_ec']].apply(lambda x: x['ec_number'] == x['sresults_ec'], axis=1) #判断EC号是否一致
correct = sum(data_match['iscorrect'])
find  = len(data_match)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 20535
Matched records are: 17722
Accuracy: 0.6112490869247627(12552/20535)
Pricision: 0.7082722040401761(12552/17722)
Recall: 0.8630143657170685(17722/20535)


## 4. 机器学习方法预测
### 4.1 onehot + 机器学习

In [7]:
trainset = train[['id', 'isMultiFunctional','seq', 'seqlength']].reset_index(drop=True)
testset = test[['id', 'isMultiFunctional','seq', 'seqlength']].reset_index(drop=True)

MAX_SEQ_LENGTH = 1500 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

In [8]:
f_train = funclib.dna_onehot(trainset) #训练集编码
f_test = funclib.dna_onehot(testset) #测试集编码

train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

X_train = train_full.iloc[:,4:]
X_test = test_full.iloc[:,4:]
Y_train = train_full.isMultiFunctional.astype('int')
Y_test = test_full.isMultiFunctional.astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)


In [9]:
funclib.run_baseline(X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.910640 	0.271739 		0.922299 	0.059988 	0.098280 	0.551935 	0.124858 	 tp: 100 fp: 268 fn: 1567 tn: 18600
xg 		0.918578 	0.490909 		0.924383 	0.080984 	0.139032 	0.626453 	0.193860 	 tp: 135 fp: 140 fn: 1532 tn: 18728
dt 		0.851668 	0.173378 		0.929386 	0.219556 	0.193753 	0.563515 	0.101421 	 tp: 366 fp: 1745 fn: 1301 tn: 17123
rf 		0.924032 	0.957265 		0.923842 	0.067187 	0.125561 	0.668622 	0.269895 	 tp: 112 fp: 5 fn: 1555 tn: 18863
gbdt 		0.917507 	0.398496 		0.920890 	0.031794 	0.058889 	0.601392 	0.161334 	 tp: 53 fp: 80 fn: 1614 tn: 18788


### 4.2 Unirep + 机器学习

In [10]:
X_train = train.iloc[:,12:]
X_test = test.iloc[:,12:]

Y_train = train.iloc[:,3].astype('int')
Y_test = test.iloc[:,3].astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train).flatten()
Y_test = np.array(Y_test).flatten()

In [11]:
funclib.run_baseline(X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.906306 	0.268468 		0.924024 	0.089382 	0.134113 	0.658508 	0.161993 	 tp: 149 fp: 406 fn: 1518 tn: 18462
xg 		0.922669 	0.613181 		0.928019 	0.128374 	0.212302 	0.730793 	0.294830 	 tp: 214 fp: 135 fn: 1453 tn: 18733
dt 		0.867835 	0.204738 		0.930498 	0.217756 	0.211047 	0.571513 	0.108084 	 tp: 363 fp: 1410 fn: 1304 tn: 17458
rf 		0.923789 	0.947368 		0.923657 	0.064787 	0.121280 	0.693704 	0.297006 	 tp: 108 fp: 6 fn: 1559 tn: 18862
gbdt 		0.918578 	0.474747 		0.920728 	0.028194 	0.053228 	0.649263 	0.171179 	 tp: 47 fp: 52 fn: 1620 tn: 18816
