# Task1. 预测是酶还是非酶

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-04-27  

## 任务简介
该任务通过给定蛋白序列，预测该该蛋白是酶还是非酶。本任务所使用的数据集为Sport，对数据集的数据中进行学习，数据中有EC号被认为是酶，没有EC号的被认为是非酶。


## 数据统计
- 数据源Sprot，共有数据564,638条，其中有EC号的数据270,236条，无EC号的数据294402条。
- 将数据集中的所有数据按照时间排序，～90%作为训练集，～10%作为测试集，找到对应时间节点为2010年2月9日。
- 以2010年2月10日为时间节点，之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 





|     Items    | 酶     |   非酶    |合计                            |
| ------------ | -------| -------- |-------------------------------|
| 训练集        | 245771 | 264719   | 510490（510490/564638≈90.41%） |
| 测试集        | 24465  | 29683    | 54148（54148/564638≈9.59%）    |


## 实验方法

- 同源比对：使用训练集建立比对库，然后将测试集与之比对，取最优的比对结果，比对结果的（酶/非酶）属性当作测试集的测试结果
- 传统机器学习方法
- 深度学习方法

CUTOFF -< aad 有的是酶，但是没有没有酶号，这个标签需要重新确定（提高数据的质量），不同数据源得到的酶号可能不一样（而且有例子），多数据源联合比对，需要golden standard，

需要Ground truth

## 实验结果

### 1. 未对序列长度进行过滤

|Methods   | Accuracy                        |             Precision           |           Recall               |F1   |Group|
| ---------| ------------------------------- | ------------------------------- |--------------------------------|-----|      |
| 同源比对  |  0.6243628573539189(33808/54148) | 0.8220590380781014(33808/41126) |0.7595109699342543(41126/54148)|      |NO LENGTH FILTERING|
lr 	|	0.619377 |	0.562195 |	0.712160 |	0.628354 | NO LENGTH FILTERING |
xg 	|	0.678363 |	0.647462 |	0.632536 |	0.639912 | NO LENGTH FILTERING |
dt 	|	0.640725 |	0.605286 |	0.588759 |	0.596909 | NO LENGTH FILTERING |
rf 	|	0.696905 |	0.654373 |	0.697650 |	0.675319 | NO LENGTH FILTERING |
gbdt|	0.633024 |	0.571013 |	0.754956 |	0.650226 | NO LENGTH FILTERING |

Ecpred command:
java -jar ECPred.jar weighted /home/shizhenkun/codebase/uniprot/data/sprot_with_ec_query.fasta /home/shizhenkun/codebase/uniprot/deeppred/ECPred/ /home/shizhenkun/codebase/uniprot/temp/ /home/shizhenkun/codebase/uniprot/results/ecpred/sprot_with_ec_query_ecpred_results.tsv


400 长度


|baslineName| 	 accuracy 	 |precision(PPV)| 	 NPV |		 recall |	 f1 	|	 auroc 	|	 auprc 	|	 confusion Matrix  | 
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
lr 		|0.622017 	|0.565314 		|0.695735 	|0.707214 	|0.628352 	|0.679701 	|0.574374 	 |tp: 17302 fp: 13304 fn: 7163 tn: 16379    
xg 		|0.677532 	|0.647106 		|0.701404 	|0.629675 	|0.638271 	|0.755010 	|0.687809 	 |tp: 15405 fp: 8401 fn: 9060 tn: 21282    
dt 		|0.639174 	|0.603897 		|0.666656 	|0.585285 	|0.594445 	|0.634438 	|0.540833 	 |tp: 14319 fp: 9392 fn: 10146 tn: 20291   
rf 		|0.696997 	|0.654997 		|0.735775 	|0.695933 	|0.674845 	|0.790091 	|0.739525 	 |tp: 17026 fp: 8968 fn: 7439 tn: 20715  


700 长度

|baslineName| 	 accuracy 	 |precision(PPV)| 	 NPV |		 recall |	 f1 	|	 auroc 	|	 auprc 	|	 confusion Matrix  | 
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
lr 		|0.622017 	|0.565314 		|0.695735 	|0.707214 	|0.628352 	|0.679701 	|0.574374 	 |tp: 17302 fp: 13304 fn: 7163 tn: 16379  
xg 		|0.677532 	|0.647106 		|0.701404 	|0.629675 	|0.638271 	|0.755010 	|0.687809 	 |tp: 15405 fp: 8401 fn: 9060 tn: 21282   
dt 		|0.639174 	|0.603897 		|0.666656 	|0.585285 	|0.594445 	|0.634438 	|0.540833 	 |tp: 14319 fp: 9392 fn: 10146 tn: 20291   
rf 		|0.696997 	|0.654997 		|0.735775 	|0.695933 	|0.674845 	|0.790091 	|0.739525 	 |tp: 17026 fp: 8968 fn: 7439 tn: 20715   

## 1. 导入必要的包、定义公共函数

In [36]:
import numpy as np
import pandas as pd
import random
import time
import gzip
import re
from Bio import SeqIO
import datetime
import sys

from functools import reduce
import matplotlib.pyplot as plt

sys.path.append("../../tools/")
import commontools
import funclib

from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.faker import Faker
from pyecharts.globals import ThemeType

# Thres 》=100 《=600｜500

## 2. 加载数据

In [13]:
table_head = [  'id', 
                'isemzyme',
                'isMultiFunctional', 
                'functionCounts', 
                'ec_number', 
                'date_integraged',
                'date_sequence_update',
                'date_annotation_update',
                'seq', 
                'seqlength'
            ]

#加载数据并转换时间格式
sprot = pd.read_csv('../../data/sprot_full.tsv', sep='\t',names=table_head) #读入文件
sprot.date_integraged = pd.to_datetime(sprot['date_integraged'])
sprot.date_sequence_update = pd.to_datetime(sprot['date_sequence_update'])
sprot.date_annotation_update = pd.to_datetime(sprot['date_annotation_update'])

sprot.head(2)

Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P02802,False,False,1,-,1986-07-21,1986-07-21,2021-04-07,MDPNCSCSTGGSCTCTSSCACKNCKCTSCKKSCCSCCPVGCSKCAQ...,61
1,P02732,False,False,1,-,1986-07-21,1986-07-21,2019-12-11,AATAATAATAATAATAATAATAATAATAATA,31


In [14]:
sprot_length = sprot.seqlength.value_counts(ascending=True)
sprot_length = pd.DataFrame(sprot_length)
sprot_length['length'] = sprot_length.index
sprot_length = sprot_length.rename(columns={'seqlength':'count', 'length':'length'})
sprot_length=sprot_length.sort_values(by='length', ascending=True)
sprot_length=sprot_length.loc[:,['length','count']]
sprot_length.head(3)

Unnamed: 0,length,count
2,2,2
3,3,5
4,4,22


In [37]:
bar = (
    Bar(init_opts=opts.InitOpts(width="1700px",
                                height="750px",
                                page_title="造价四剑客",
                                theme=ThemeType.CHALK))
    .add_xaxis(list(sprot_length.length))
    .add_yaxis("序列长度", list(sprot_length['count']))
#     .add_yaxis("商家B", [15, 6, 45, 20, 35, 66])
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Bar-DataZoom（slider-水平）"),
        datazoom_opts=opts.DataZoomOpts()
    )
)

### 2  加载javascript
bar.load_javascript()
bar.render_notebook()


In [39]:
sprot_length

Unnamed: 0,length,count
2,2,2
3,3,5
4,4,22
5,5,41
6,6,35
...,...,...
14507,14507,1
18141,18141,1
18562,18562,1
34350,34350,1


## 3. 划分训练集、测试集

In [4]:
thres = datetime.datetime(2010, 2, 10, 0, 0)

#训练集
train = sprot[sprot.date_integraged <= thres ].sort_values(by='date_integraged')
#测试集
test = sprot[sprot.date_integraged > thres ].sort_values(by='date_integraged')

# train.to_csv('./data/train.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)
# test.to_csv('./data/test.tsv', sep='\t', columns=['id', 'isemzyme','seq'], index=0)

# table2fasta(train, './data/train.fasta')
# table2fasta(test, './data/test.fasta')

## 4. 二分类
### 4.1 同源比对

In [5]:
! diamond makedb --in ./data/train.fasta -d ./data/train.dmnd     #建库
! diamond blastp -d ./data/train.dmnd  -q ./data/test.fasta -o ./data/test_fasta_results.tsv -b5 -c1 -k 1   #生成比对文件

diamond v2.0.8.146 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: ./data/train.fasta
Opening the database file...  [0.04s]
Loading sequences...  [0.594s]
Masking sequences...  [0.448s]
Writing sequences...  [0.108s]
Hashing sequences...  [0.048s]
Loading sequences...  [0s]
Writing trailer...  [0.002s]
Closing the input file...  [0.004s]
Closing the database file...  [1.761s]
Database hash = c2598be544ca9c047fa6890d99402377
Processed 510490 sequences, 180136482 letters.
Total time = 3.008s
diamond v2.0.8.146 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: ./data
#Target sequences to report alignments 

In [6]:
#读入比对结果
res_data = pd.read_csv('./data/test_fasta_results.tsv', sep='\t', names=['id', 'sseqid', 'pident', 'length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore'])

#匹配查询结果
data_match = pd.merge(test,res_data, on=['id'], how='inner')

In [7]:
# 添加查询结果的EC号
counter =0
resArray =[]
for i in range(len(res_data)):
    counter+=1
    mn = train[train['id']== res_data['sseqid'][i]]['ec_number'].values
    resArray.append(mn)
    if counter %1000 ==0:
        print(counter)
data_match['sresults_ec']=np.array(resArray) 
data_match.head(3)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000


Unnamed: 0,id,isemzyme,isMultiFunctional,functionCounts,ec_number,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength,...,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore,sresults_ec
0,B3M1H7,False,False,1,-,2010-03-02,2008-09-02,2020-12-02,MESLSQLVKSTLPNYLSNLPIPDSVGGWFKLSFKDWLALIPPTVVV...,134,...,133,23,0,1,133,1,133,4.33e-88,254.0,-
1,Q9UA93,False,False,1,-,2010-03-02,2000-05-01,2020-04-22,VLIIAVLFLAACQLTTAETSSRGKQKHRALRSTDKNSRMTKRCTPA...,74,...,68,24,0,1,68,6,73,6.29e-24,88.6,-
2,Q9TVR4,False,False,1,-,2010-03-02,2000-05-01,2020-04-22,VLIIAVLFLTACQLTTAETSSRGKQKHRALRSTDKNSRMTKRCTPA...,74,...,68,23,0,1,68,6,73,7.67e-25,90.9,-


In [19]:
# 计算指标
data_match['iscorrect'] = data_match[['ec_number', 'sresults_ec']].apply(lambda x: x['ec_number'] == x['sresults_ec'], axis=1) #判断EC号是否一致
correct = sum(data_match['iscorrect'])
find  = len(data_match)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 54148
Matched records are: 41128
Accuracy: 0.624325921548349(33806/54148)
Pricision: 0.8219704337677495(33806/41128)
Recall: 0.7595479057398242(41128/54148)


In [4]:
sprot.seqlength.describe()

count    564638.000000
mean        360.442643
std         336.460236
min           2.000000
25%         169.000000
50%         294.000000
75%         449.000000
max       35213.000000
Name: seqlength, dtype: float64

### 4.2 使用机器学习方法

In [5]:
trainset = train[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)
testset = test[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)

MAX_SEQ_LENGTH = 400 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

In [22]:
f_train = dna_onehot(trainset) #训练集编码
f_test = dna_onehot(testset) #测试集编码

train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

In [23]:
X_train = train_full.iloc[:,4:]
X_test = test_full.iloc[:,4:]
Y_train = train_full.isemzyme.astype('int')
Y_test = test_full.isemzyme.astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)

In [14]:
methods=['lr', 'xg', 'dt', 'rf', 'gbdt']
print('baslineName', '\t', 'accuracy','\t', 'precision(PPV) \t NPV \t\t', 'recall','\t', 'f1', '\t\t', 'auroc','\t\t', 'auprc', '\t\t confusion Matrix')
for method in methods:
    function.evaluate(method, X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.619377 	0.562195 		0.695902 	0.712160 	0.628354 	0.676836 	0.571420 	 tp: 17423 fp: 13568 fn: 7042 tn: 16115
xg 		0.678363 	0.647462 		0.702780 	0.632536 	0.639912 	0.755800 	0.689642 	 tp: 15475 fp: 8426 fn: 8990 tn: 21257
dt 		0.640725 	0.605286 		0.668512 	0.588759 	0.596909 	0.636148 	0.542188 	 tp: 14404 fp: 9393 fn: 10061 tn: 20290
rf 		0.696905 	0.654373 		0.736433 	0.697650 	0.675319 	0.790402 	0.740078 	 tp: 17068 fp: 9015 fn: 7397 tn: 20668
gbdt 		0.633024 	0.571013 		0.725025 	0.754956 	0.650226 	0.703792 	0.623460 	 tp: 18470 fp: 13876 fn: 5995 tn: 15807


In [24]:
#过滤400
methods=['lr', 'xg', 'dt', 'rf', 'gbdt']
print('baslineName', '\t', 'accuracy','\t', 'precision(PPV) \t NPV \t\t', 'recall','\t', 'f1', '\t\t', 'auroc','\t\t', 'auprc', '\t\t confusion Matrix')
for method in methods:
    function.evaluate(method, X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.620171 	0.560333 		0.708661 	0.739873 	0.637707 	0.672616 	0.564063 	 tp: 18101 fp: 14203 fn: 6364 tn: 15480
xg 		0.675962 	0.644537 		0.700857 	0.630574 	0.637479 	0.752133 	0.683886 	 tp: 15427 fp: 8508 fn: 9038 tn: 21175
dt 		0.638251 	0.601897 		0.667042 	0.588759 	0.595256 	0.634066 	0.540324 	 tp: 14404 fp: 9527 fn: 10061 tn: 20156
rf 		0.693987 	0.640435 		0.751795 	0.735827 	0.684825 	0.784831 	0.731240 	 tp: 18002 fp: 10107 fn: 6463 tn: 19576
gbdt 		0.633615 	0.569193 		0.737548 	0.777723 	0.657316 	0.703348 	0.619026 	 tp: 19027 fp: 14401 fn: 5438 tn: 15282


In [None]:
# 训练集筛选50bp到500bp，测试集不变

In [25]:
trainset = train[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)
testset = test[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)

# MAX_SEQ_LENGTH = 400 #定义序列最长的长度
# trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
# testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

In [27]:
trainset = trainset[trainset.seqlength>=50]

In [28]:
MAX_SEQ_LENGTH = 500 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [29]:
f_train = dna_onehot(trainset) #训练集编码
f_test = dna_onehot(testset) #测试集编码
train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

X_train = train_full.iloc[:,4:]
X_test = test_full.iloc[:,4:]
Y_train = train_full.isemzyme.astype('int')
Y_test = test_full.isemzyme.astype('int')

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)

In [None]:
methods=['lr', 'xg', 'dt', 'rf', 'gbdt']
print('baslineName', '\t', 'accuracy','\t', 'precision(PPV) \t NPV \t\t', 'recall','\t', 'f1', '\t\t', 'auroc','\t\t', 'auprc', '\t\t confusion Matrix')
for method in methods:
    function.evaluate(method, X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.493555 	0.408692 		0.529776 	0.270591 	0.325603 	0.455357 	0.419394 	 tp: 6620 fp: 9578 fn: 17845 tn: 20105
xg 		0.545154 	0.495353 		0.569229 	0.357286 	0.415141 	0.526119 	0.495626 	 tp: 8741 fp: 8905 fn: 15724 tn: 20778
dt 		0.502512 	0.450279 		0.546877 	0.457715 	0.453967 	0.499028 	0.450720 	 tp: 11198 fp: 13671 fn: 13267 tn: 16012


In [9]:
trainset = train[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)
testset = test[['id', 'isemzyme','seq', 'seqlength']].reset_index(drop=True)

# trainset = trainset[trainset.seqlength>=50]
# testset = testset[testset.seqlength>=50]

MAX_SEQ_LENGTH = 700 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))



In [None]:
f_train = funclib.dna_onehot(trainset) #训练集编码
f_test = funclib.dna_onehot(testset) #测试集编码
train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

X_train = train_full.iloc[:,4:]
X_test = test_full.iloc[:,4:]
Y_train = train_full.isemzyme.astype('int')
Y_test = test_full.isemzyme.astype('int')

methods=['lr', 'xg', 'dt', 'rf', 'gbdt']
print('baslineName', '\t', 'accuracy','\t', 'precision(PPV) \t NPV \t\t', 'recall','\t', 'f1', '\t\t', 'auroc','\t\t', 'auprc', '\t\t confusion Matrix')
for method in methods:
    funclib.evaluate(method, X_train, Y_train, X_test, Y_test)

baslineName 	 accuracy 	 precision(PPV) 	 NPV 		 recall 	 f1 		 auroc 		 auprc 		 confusion Matrix
lr 		0.622017 	0.565314 		0.695735 	0.707214 	0.628352 	0.679701 	0.574374 	 tp: 17302 fp: 13304 fn: 7163 tn: 16379
xg 		0.677532 	0.647106 		0.701404 	0.629675 	0.638271 	0.755010 	0.687809 	 tp: 15405 fp: 8401 fn: 9060 tn: 21282
dt 		0.639174 	0.603897 		0.666656 	0.585285 	0.594445 	0.634438 	0.540833 	 tp: 14319 fp: 9392 fn: 10146 tn: 20291
rf 		0.696997 	0.654997 		0.735775 	0.695933 	0.674845 	0.790091 	0.739525 	 tp: 17026 fp: 8968 fn: 7439 tn: 20715
