# Task2. 预测酶是单功能酶还是多功能酶

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-04-28  

## 任务简介
该任务通过给定酶序列预测该酶是单功能酶还是多功能酶。本任务所使用的数据集为Sport，对数据集的数据中进行学习，数据中只有1个EC号被认为是单功能酶，有多个EC号的被认为是多功能酶。


## 数据统计
- 数据源Sprot，共有酶数据270,236条，其中单功能酶253,107条，多功能酶17,129条。
- 将数据集中的所有数据按照时间排序，按照Task1 确定的时间节点 2010年2月9日为时间节点，对数据集进行划分。
- 2010年2月9日之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 


|     Items    | 单功能酶  |  多功能酶   |合计                            |
| ------------ | --------| ---------- |-------------------------------:|
| 训练集        | 230448  | 15323      | 242907（245771/270236≈90.95%） |
| 测试集        | 22659   | 1806       | 24465（24465/270236≈9.05%）    |
|合计           |253107   |17129       | 270236


## 实验方法

- 同源比对：使用训练集建立比对库，然后将测试集与之比对，取最优的比对结果，比对结果的（酶/非酶）属性当作测试集的测试结果
- 传统机器学习方法
- 深度学习方法


## 实验结果

### 1. 未对序列长度进行过滤

|Methods   | Accuracy                        |             Precision           |           Recall               |
| ---------| ------------------------------- | ------------------------------- |--------------------------------|
| 同源比对  |  0.6314171758937392(17256/27329) | 0.7272727272727273(17256/23727) |0.7595109699342543(41126/54148)|
| LR.      |  32                             |                                  |                             |




In [None]:
有多功能酶多EC/和 查看链接，对应功能区，不同的块，

In [93]:
sprot[sprot.isemzyme & sprot.isMultiFunctional]

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
6,P60175,True,True,2,"5.3.1.1, 4.2.3.3",1986-07-21,2007-01-23,2020-12-02,MAPSRKFFVGGNWKMNGRKQSLGELIGTLNAAKVPADTEVVCAPPT...,249
9,P00941,True,True,2,"5.3.1.1, 4.2.3.3",1986-07-21,1986-07-21,2020-12-02,APRKFFVGGNWKMNGDKKSLGELIQTLNAAKVPFTGEIVCAPPEAY...,247
11,P60174,True,True,2,"5.3.1.1, 4.2.3.3",1986-07-21,2020-10-07,2021-04-07,MAPSRKFFVGGNWKMNGRKQSLGELIGTLNAAKVPADTEVVCAPPT...,249
16,P00939,True,True,2,"5.3.1.1, 4.2.3.3",1986-07-21,2020-10-07,2021-04-07,MAPSRKFFVGGNWKMNGRKKNLGELITTLNAAKVPADTEVVCAPPT...,249
27,P00502,True,True,3,"2.5.1.18, 1.11.1.-, 5.3.3.-",1986-07-21,2007-01-23,2021-04-07,MSGKPVLHYFNARGRMECIRWLLAAAGVEFDEKFIQSPEDLEKLKK...,222
...,...,...,...,...,...,...,...,...,...,...
564556,F1M3J4,True,True,3,"7.6.2.-, 7.6.2.2, 7.6.2.3",2021-04-07,2015-07-22,2021-04-07,MLPVHTEVKPNPLQDANLCSRLFFWWLNPLFKAGHKRRLEEDDMFS...,1325
564563,A0A1X9ISH5,True,True,2,"4.2.3.183, 4.2.3.131",2021-04-07,2019-05-08,2021-04-07,MSTLKLIPFSTSIDKQFSGRTSILGGKCCLQIDGPKTTKKQSKILV...,782
564584,O62140,True,True,2,"1.3.3.-, 1.3.3.6",2021-04-07,1998-08-01,2021-04-07,MVHLNKTIQEGDNPDLTAERLTATFDTHAMAAQIYGGEMRARRRRE...,674
564604,A0A0Y0GRS3,True,True,2,"1.14.14.65, 1.14.14.60",2021-04-07,2016-04-13,2021-04-07,MDSFSLLAALFFISAATWFISSRRRRNLPPGPFPYPIVGNMLQLGA...,494


## 1. 导入必要的包

In [55]:
import numpy as np
import pandas as pd
import random
import time
import gzip
import re
from Bio import SeqIO
import datetime
import sys

from functools import reduce


import matplotlib.pyplot as plt

import random
import sys
sys.path.append("../../tools/")
import commontools

## 2. 加载数据

In [74]:
table_head = [  'id', 
                'isemzyme',
                'isMultiFunctional', 
                'functionCounts', 
                'ec_number', 
                'date_integraged',
                'date_sequence_update',
                'date_annotation_update',
                'seq', 
                'seqlength'
            ]

#加载数据并转换时间格式
sprot = pd.read_csv('../../data/sprot_full.tsv', sep='\t',names=table_head) #读入文件
sprot.date_integraged = pd.to_datetime(sprot['date_integraged'])
sprot.date_sequence_update = pd.to_datetime(sprot['date_sequence_update'])
sprot.date_annotation_update = pd.to_datetime(sprot['date_annotation_update'])

sprot.head(2)

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P02802,False,False,1,-,1986-07-21,1986-07-21,2021-04-07,MDPNCSCSTGGSCTCTSSCACKNCKCTSCKKSCCSCCPVGCSKCAQ...,61
1,P02732,False,False,1,-,1986-07-21,1986-07-21,2019-12-11,AATAATAATAATAATAATAATAATAATAATA,31


## 3. 划分训练集、测试集

In [75]:
thres = datetime.datetime(2010, 2, 10, 0, 0)
#训练集
train = sprot[sprot.isemzyme & (sprot.date_integraged <= thres) ].sort_values(by='date_integraged')
#测试集
test = sprot[sprot.isemzyme & (sprot.date_integraged > thres) ].sort_values(by='date_integraged')

train.to_csv('./data/train.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)
test.to_csv('./data/test.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)

commontools.table2fasta(train, './data/train.fasta')
commontools.table2fasta(test, './data/test.fasta')

Write finished
Write finished


## 4. 预测
### 4.1 同源比对

In [76]:
! diamond makedb --in ./data/train.fasta -d ./data/train.dmnd     #建库
! diamond blastp -d ./data/train.dmnd  -q ./data/test.fasta -o ./data/test_fasta_results.tsv -b5 -c1 -k 1   #生成比对文件

diamond v2.0.8.146 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: ./data/train.fasta
Opening the database file...  [0.043s]
Loading sequences...  [0.312s]
Masking sequences...  [0.248s]
Writing sequences...  [0.068s]
Hashing sequences...  [0.024s]
Loading sequences...  [0s]
Writing trailer...  [0s]
Closing the input file...  [0.003s]
Closing the database file...  [0.937s]
Database hash = 78d92e5ff6c43a419777788ac7605af0
Processed 245771 sequences, 98222973 letters.
Total time = 1.64s
diamond v2.0.8.146 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: ./data
#Target sequences to report alignments for: 

In [60]:
#读入比对结果
res_data = pd.read_csv('./data/test_fasta_results.tsv', sep='\t', names=['id', 'sseqid', 'pident', 'length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore'])

#匹配查询结果
data_match = pd.merge(test,res_data, on=['id'], how='inner')

# 添加查询结果的EC号
counter =0
resArray =[]
for i in range(len(res_data)):
    counter+=1
    mn = train[train['id']== res_data['sseqid'][i]]['isMultiFunctional'].values
    resArray.append(mn)
    if counter %1000 ==0:
        print(counter)
data_match['sresults_isMultiFunctional']=np.array(resArray) 
data_match.head(3)




1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000


Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength,...,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_isMultiFunctional
0,B4F119,True,False,1,2.1.1.199,2009-11-03,2008-09-23,2021-04-07,MTTNNFSHTSVLLDEAVNGLNIKPSGIYIDGTFGRGGHSRLILSQL...,315,...,315,68,1,1,315,1,314,4.4600000000000004e-175,489.0,False
1,C0H4W3,True,False,1,3.6.4.-,2009-11-03,2009-05-05,2021-04-07,MNEIKSESLLQTRPFKLGIEDIQNLGSSYFIENNEKLKKYNNEISS...,2082,...,1276,556,27,651,1921,903,1811,9.08e-112,401.0,False
2,A4IT43,True,False,1,1.2.1.10,2009-11-03,2009-11-03,2020-12-02,MAKVKVAILGSGNIGTDLMIKLERSSHLELTAMIGIDPESDGLKKA...,295,...,307,105,7,1,284,1,303,3.4e-108,318.0,False


In [77]:
# 计算指标
data_match['iscorrect'] = data_match[['isMultiFunctional', 'sresults_isMultiFunctional']].apply(lambda x: x['isMultiFunctional'] == x['sresults_isMultiFunctional'], axis=1) #判断EC号是否一致
correct = sum(data_match['iscorrect'])
find  = len(data_match)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 24465
Matched records are: 23727
Accuracy: 0.9056202738606172(22156/24465)
Pricision: 0.9337885109790534(22156/23727)
Recall: 0.9698344573881055(23727/24465)


In [69]:
data_match

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength,...,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_isMultiFunctional,iscorrect
0,B4F119,True,False,1,2.1.1.199,2009-11-03,2008-09-23,2021-04-07,MTTNNFSHTSVLLDEAVNGLNIKPSGIYIDGTFGRGGHSRLILSQL...,315,...,68,1,1,315,1,314,4.460000e-175,489.0,False,True
1,C0H4W3,True,False,1,3.6.4.-,2009-11-03,2009-05-05,2021-04-07,MNEIKSESLLQTRPFKLGIEDIQNLGSSYFIENNEKLKKYNNEISS...,2082,...,556,27,651,1921,903,1811,9.080000e-112,401.0,False,True
2,A4IT43,True,False,1,1.2.1.10,2009-11-03,2009-11-03,2020-12-02,MAKVKVAILGSGNIGTDLMIKLERSSHLELTAMIGIDPESDGLKKA...,295,...,105,7,1,284,1,303,3.400000e-108,318.0,False,True
3,C1AHZ3,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2009-05-26,2021-04-07,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303,...,124,3,4,301,5,314,7.980000e-109,320.0,False,False
4,B4J340,True,False,1,3.1.-.-,2009-11-03,2008-09-23,2020-12-02,MARIQTVLGSITPNLLGRTLTHEHVAMDFEHFYKPPPADFQSELEQ...,350,...,54,0,1,348,1,348,1.200000e-229,630.0,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23722,A0A2L0P0K3,True,False,1,6.3.2.-,2021-04-07,2018-04-25,2021-04-07,MAGEGAEIAIIGSGCRFPGNASSPSKLWALLQNPTSVASKVPALGG...,2979,...,1842,69,8,2963,11,3037,0.000000e+00,1520.0,False,True
23723,U3LW50,True,True,2,"4.2.3.173, 4.2.3.92",2021-04-07,2013-12-11,2021-04-07,MATSAVVNCLGGVRPHTIRYEPNMWTHTFSNFSIDEQVQGEYAEEI...,555,...,314,5,1,539,1,531,6.160000e-140,418.0,False,False
23724,P0DUG4,True,False,1,3.1.1.4,2021-04-07,2021-04-07,2021-04-07,NVYQYRKMLQCAMPNGGPFECCQTHDNCYGEAEKLKACTSTHSSPYFK,48,...,15,3,1,48,1,65,4.690000e-10,52.4,False,True
23725,Q17577,True,False,1,6.2.1.-,2021-04-07,1996-11-01,2021-04-07,MIFHGEQLENHDKPVHEVLLERFKVHQEKDPDNVAFVTAENEDDSL...,540,...,297,11,44,528,53,536,1.770000e-77,256.0,False,True


In [70]:
data_match[data_match.isMultiFunctional & data_match.sresults_isMultiFunctional]

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength,...,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_isMultiFunctional,iscorrect
211,Q55EU7,True,True,2,"2.5.1.10, 2.5.1.1",2009-11-03,2005-05-24,2020-12-02,MGSHYKIDLENEREEFIKVYSILKNDVFDELPKMKLASNAIDYIKE...,393,...,202,10,13,393,5,342,1.330000e-65,214.0,True,True
236,C3K2F6,True,True,2,"2.1.1.-, 2.1.1.35",2009-11-03,2009-06-16,2021-04-07,MTFDAARYTAQLQDKVTRLRDLLAPFDAPEPQVFDSPLQNFRLRAE...,359,...,13,0,1,359,5,363,6.520000e-260,707.0,True,True
239,Q7XZQ6,True,True,2,"1.14.11.9, 1.14.20.6",2009-11-03,2003-10-01,2020-12-02,MEVERVQAISKMSRCMDTIPSEYIRSESEQPAVTTMQGVVLQVPVI...,337,...,76,2,1,337,13,348,1.070000e-196,546.0,True,True
256,A1RP91,True,True,2,"2.1.1.-, 2.1.1.35",2009-11-03,2007-02-06,2021-04-07,MNLAAMDPQTYDAQLEHKRIKLEQAFAQFETPSVEVFASEPANYRM...,365,...,0,0,1,364,1,364,1.290000e-266,724.0,True,True
274,B8CH56,True,True,2,"2.1.1.-, 2.1.1.35",2009-11-03,2009-11-03,2020-12-02,MNLAAMDPLTYDVQLEAKRVKLKQLFTDFDTPELESFSSETAHYRM...,365,...,24,0,1,365,1,365,7.540000e-252,687.0,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23645,F1M3J4,True,True,3,"7.6.2.-, 7.6.2.2, 7.6.2.3",2021-04-07,2015-07-22,2021-04-07,MLPVHTEVKPNPLQDANLCSRLFFWWLNPLFKAGHKRRLEEDDMFS...,1325,...,177,0,1,1325,1,1325,0.000000e+00,2271.0,True,True
23699,Q9VUK8,True,True,2,"6.1.1.14, 2.7.7.-",2021-04-07,2000-05-01,2021-04-07,MSLQLLKALPHLRSATTAVRTQIARTTWSEHIATKVFFSTTTTKPT...,765,...,235,2,67,765,33,728,0.000000e+00,953.0,True,True
23707,A0A0G2K1Q8,True,True,2,"7.6.2.1, 7.6.2.2",2021-04-07,2015-07-22,2021-04-07,MVVLRQLRLLLWKNYTLKKRKVLVTVLELFLPLLFSGILIWLRLKI...,1704,...,69,0,1,1704,1,1704,0.000000e+00,3215.0,True,True
23711,Q7PRG3,True,True,2,"2.6.1.63, 2.6.1.44",2021-04-07,2007-01-09,2021-04-07,MKFTPPPASLRNPLIIPEKIMMGPGPSNCSKRVLTAMTNTVLSNFH...,396,...,201,0,6,381,32,407,2.680000e-127,375.0,True,True


In [64]:
data_match[data_match.isMultiFunctional]

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength,...,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_isMultiFunctional,iscorrect
3,C1AHZ3,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2009-05-26,2021-04-07,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303,...,124,3,4,301,5,314,7.980000e-109,320.0,False,False
9,A1KPM0,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2007-02-06,2021-04-07,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303,...,124,3,4,301,5,314,7.980000e-109,320.0,False,False
10,Q7TTR4,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2003-10-01,2021-04-07,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303,...,124,3,4,301,5,314,7.980000e-109,320.0,False,False
135,A5U8K9,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2007-07-10,2021-02-10,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303,...,124,3,4,301,5,314,7.980000e-109,320.0,False,False
211,Q55EU7,True,True,2,"2.5.1.10, 2.5.1.1",2009-11-03,2005-05-24,2020-12-02,MGSHYKIDLENEREEFIKVYSILKNDVFDELPKMKLASNAIDYIKE...,393,...,202,10,13,393,5,342,1.330000e-65,214.0,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23708,U3LVL5,True,True,2,"4.2.3.75, 4.2.3.100",2021-04-07,2013-12-11,2021-04-07,MEIHSSVVPAITNVKSLDEIRRSAKFHPTVWGDYFLAYNSDNTEIT...,549,...,286,8,18,549,19,554,1.980000e-160,470.0,False,False
23711,Q7PRG3,True,True,2,"2.6.1.63, 2.6.1.44",2021-04-07,2007-01-09,2021-04-07,MKFTPPPASLRNPLIIPEKIMMGPGPSNCSKRVLTAMTNTVLSNFH...,396,...,201,0,6,381,32,407,2.680000e-127,375.0,True,True
23712,Q0IG34,True,True,2,"2.6.1.63, 2.6.1.44",2021-04-07,2006-10-03,2021-04-07,MKFTPPPSSLRGPLVIPDKIMMGPGPSNCSKRVLAALNNTCLSNFH...,400,...,204,0,6,381,32,407,1.630000e-124,368.0,True,True
23723,U3LW50,True,True,2,"4.2.3.173, 4.2.3.92",2021-04-07,2013-12-11,2021-04-07,MATSAVVNCLGGVRPHTIRYEPNMWTHTFSNFSIDEQVQGEYAEEI...,555,...,314,5,1,539,1,531,6.160000e-140,418.0,False,False


In [73]:
22156/27329

0.810713893666069

In [72]:
test

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
505585,B4F119,True,False,1,2.1.1.199,2009-11-03,2008-09-23,2021-04-07,MTTNNFSHTSVLLDEAVNGLNIKPSGIYIDGTFGRGGHSRLILSQL...,315
506901,C0H4W3,True,False,1,3.6.4.-,2009-11-03,2009-05-05,2021-04-07,MNEIKSESLLQTRPFKLGIEDIQNLGSSYFIENNEKLKKYNNEISS...,2082
506899,A4IT43,True,False,1,1.2.1.10,2009-11-03,2009-11-03,2020-12-02,MAKVKVAILGSGNIGTDLMIKLERSSHLELTAMIGIDPESDGLKKA...,295
506895,C1AHZ3,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2009-05-26,2021-04-07,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303
506885,B4J340,True,False,1,3.1.-.-,2009-11-03,2008-09-23,2020-12-02,MARIQTVLGSITPNLLGRTLTHEHVAMDFEHFYKPPPADFQSELEQ...,350
...,...,...,...,...,...,...,...,...,...,...
564419,Q17577,True,False,1,6.2.1.-,2021-04-07,1996-11-01,2021-04-07,MIFHGEQLENHDKPVHEVLLERFKVHQEKDPDNVAFVTAENEDDSL...,540
564426,P0DUH5,True,False,1,3.5.4.-,2021-04-07,2021-04-07,2021-04-07,MYEAARVTDPIDHTSALAGFLVGAVLGIALIAAVAFATFTCGFGVA...,1427
564430,A0A2L0P0L5,True,False,1,1.-.-.-,2021-04-07,2018-04-25,2021-04-07,MTPTTLPTSQRAVQQDGNGKLHVANNAAIPSLLPGHVLVKTYAVAL...,368
564401,O17679,True,True,2,"2.1.1.-, 2.1.1.366",2021-04-07,2002-10-01,2021-04-07,MERSRTGGSSTYSMSTANVPIITISDDDDELTIWEEPRKSPISSNS...,708
