# Task1. 预测酶是单功能酶还是多功能酶

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-04-28  

## 任务简介
该任务通过给定酶序列预测该酶是单功能酶还是多功能酶。本任务所使用的数据集为Sport，对数据集的数据中进行学习，数据中只有1个EC号被认为是单功能酶，有多个EC号的被认为是多功能酶。


## 数据统计
- 数据源Sprot，共有酶数据270,236条，其中单功能酶253,107条，多功能酶17,129条。
- 将数据集中的所有数据按照时间排序，～90%作为训练集，～10%作为测试集，找到对应时间节点为2009年10月13日。
- 以2009年10月13日为时间节点，之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 


|     Items    | 单功能酶  |  多功能酶   |合计                            |
| ------------ | --------| ---------- |-------------------------------|
| 训练集        | 227669  | 15238      | 242907（242907/270236≈89.89%） |
| 测试集        | 25438   | 1891       | 27329（27329/270236≈10.11%）    |


## 实验方法

- 同源比对：使用训练集建立比对库，然后将测试集与之比对，取最优的比对结果，比对结果的（酶/非酶）属性当作测试集的测试结果
- 传统机器学习方法
- 深度学习方法


## 实验结果

### 1. 未对序列长度进行过滤

|Methods   | Accuracy                        |             Precision           |           Recall               |
| ---------| ------------------------------- | ------------------------------- |--------------------------------|
| 同源比对  |  0.6314171758937392(17256/27329) | 0.7272727272727273(17256/23727) |0.7595109699342543(41126/54148)|
| LR.      |  32                             |                                  |                             |




In [57]:
Accuracy: 0.6314171758937392(17256/27329)
Pricision: 0.7272727272727273(17256/23727)
Recall: 0.8681986168538914(23727/27329)

  Accuracy: 0.6314171758937392(17256/27329)


TypeError: 'float' object is not callable

## 1. 导入必要的包

In [55]:
import numpy as np
import pandas as pd
import random
import time
import gzip
import re
from Bio import SeqIO
import datetime
import sys

from functools import reduce


import matplotlib.pyplot as plt

import random
import sys
sys.path.append("../../tools/")
import commontools

## 2. 加载数据

In [3]:
table_head = [  'id', 
                'isemzyme',
                'isMultiFunctional', 
                'functionCounts', 
                'ec_number', 
                'date_integraged',
                'date_sequence_update',
                'date_annotation_update',
                'seq', 
                'seqlength'
            ]

#加载数据并转换时间格式
sprot = pd.read_csv('../../data/sprot_full.tsv', sep='\t',names=table_head) #读入文件
sprot.date_integraged = pd.to_datetime(sprot['date_integraged'])
sprot.date_sequence_update = pd.to_datetime(sprot['date_sequence_update'])
sprot.date_annotation_update = pd.to_datetime(sprot['date_annotation_update'])

sprot.head(2)

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P02802,False,False,1,-,1986-07-21,1986-07-21,2021-04-07,MDPNCSCSTGGSCTCTSSCACKNCKCTSCKKSCCSCCPVGCSKCAQ...,61
1,P02732,False,False,1,-,1986-07-21,1986-07-21,2019-12-11,AATAATAATAATAATAATAATAATAATAATA,31


## 3. 划分训练集、测试集

In [54]:
thres = datetime.datetime(2009, 10, 13, 0, 0)
#训练集
train = sprot[sprot.isemzyme & (sprot.date_integraged <= thres) ].sort_values(by='date_integraged')
#测试集
test = sprot[sprot.isemzyme & (sprot.date_integraged > thres) ].sort_values(by='date_integraged')

train.to_csv('./data/train.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)
test.to_csv('./data/test.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)

commontools.table2fasta(train, './data/train.fasta')
commontools.table2fasta(test, './data/test.fasta')

Write finished
Write finished


## 4. 预测
### 4.1 同源比对

In [None]:
! diamond makedb --in ./data/train.fasta -d ./data/train.dmnd     #建库
! diamond blastp -d ./data/train.dmnd  -q ./data/test.fasta -o ./data/test_fasta_results.tsv -b5 -c1 -k 1   #生成比对文件



In [None]:
#读入比对结果
res_data = pd.read_csv('./data/test_fasta_results.tsv', sep='\t', names=['id', 'sseqid', 'pident', 'length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore'])

#匹配查询结果
data_match = pd.merge(test,res_data, on=['id'], how='inner')

# 添加查询结果的EC号
counter =0
resArray =[]
for i in range(len(res_data)):
    counter+=1
    mn = train[train['id']== res_data['sseqid'][i]]['isMultiFunctional'].values
    resArray.append(mn)
    if counter %1000 ==0:
        print(counter)
data_match['sresults_isMultiFunctional']=np.array(resArray) 
data_match.head(3)




1000


In [None]:
# 计算指标
data_match['iscorrect'] = data_match[['ec_number', 'sresults_ec']].apply(lambda x: x['ec_number'] == x['sresults_ec'], axis=1) #判断EC号是否一致
correct = sum(data_match['iscorrect'])
find  = len(data_match)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

In [59]:
data_match

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength,...,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_ec
0,B4F119,True,False,1,2.1.1.199,2009-11-03,2008-09-23,2021-04-07,MTTNNFSHTSVLLDEAVNGLNIKPSGIYIDGTFGRGGHSRLILSQL...,315,...,315,68,1,1,315,1,314,4.460000e-175,489.0,2.1.1.199
1,C0H4W3,True,False,1,3.6.4.-,2009-11-03,2009-05-05,2021-04-07,MNEIKSESLLQTRPFKLGIEDIQNLGSSYFIENNEKLKKYNNEISS...,2082,...,1276,556,27,651,1921,903,1811,9.080000e-112,401.0,3.6.4.-
2,A4IT43,True,False,1,1.2.1.10,2009-11-03,2009-11-03,2020-12-02,MAKVKVAILGSGNIGTDLMIKLERSSHLELTAMIGIDPESDGLKKA...,295,...,307,105,7,1,284,1,303,3.400000e-108,318.0,1.2.1.10
3,C1AHZ3,True,True,2,"1.2.1.87, 1.2.1.10",2009-11-03,2009-05-26,2021-04-07,MPSKAKVAIVGSGNISTDLLYKLLRSEWLEPRWMVGIDPESDGLAR...,303,...,310,124,3,4,301,5,314,7.980000e-109,320.0,1.2.1.10
4,B4J340,True,False,1,3.1.-.-,2009-11-03,2008-09-23,2020-12-02,MARIQTVLGSITPNLLGRTLTHEHVAMDFEHFYKPPPADFQSELEQ...,350,...,348,54,0,1,348,1,348,1.200000e-229,630.0,3.1.-.-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23722,A0A2L0P0K3,True,False,1,6.3.2.-,2021-04-07,2018-04-25,2021-04-07,MAGEGAEIAIIGSGCRFPGNASSPSKLWALLQNPTSVASKVPALGG...,2979,...,3092,1842,69,8,2963,11,3037,0.000000e+00,1520.0,2.3.1.161
23723,U3LW50,True,True,2,"4.2.3.173, 4.2.3.92",2021-04-07,2013-12-11,2021-04-07,MATSAVVNCLGGVRPHTIRYEPNMWTHTFSNFSIDEQVQGEYAEEI...,555,...,540,314,5,1,539,1,531,6.160000e-140,418.0,4.2.3.61
23724,P0DUG4,True,False,1,3.1.1.4,2021-04-07,2021-04-07,2021-04-07,NVYQYRKMLQCAMPNGGPFECCQTHDNCYGEAEKLKACTSTHSSPYFK,48,...,70,15,3,1,48,1,65,4.690000e-10,52.4,3.1.1.4
23725,Q17577,True,False,1,6.2.1.-,2021-04-07,1996-11-01,2021-04-07,MIFHGEQLENHDKPVHEVLLERFKVHQEKDPDNVAFVTAENEDDSL...,540,...,502,297,11,44,528,53,536,1.770000e-77,256.0,6.2.1.-
