# Task3. 酶的EC号预测

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-06-03  

## 任务简介
该任务通过给定酶序列，预测该酶的反应类别（EC号）


## 数据统计
- 遵照任务1找到的时间节点对数据进行划分
- 以2009年12月14日为时间节点，之前的数据为训练集，之后的数据为测试集，具体数据集统计如下： 


|     Items    | 数量     |        |合计                     |
| ------------ | --------| --------- |----------------------------------|
| 训练集        | 185,453 |     | 185,453（185,453/204,321=90.8%) |
| 测试集        | 18,868  |      | 18,868 (18,868/204,321 = 9.2%)   |


## 数据集构建方法

* 根据蛋白注释信息与之前划定的酶与非酶数据集，将「酶」数据进行分类。
* 有1个EC号的被定义为「单功能酶」，有多个EC号的被定义为「多功能酶」。
* EC号数目：4676个


## 实验结果

### Baselines

|Methods   | Accuracy                        |             Precision           |           Recall               |F1   |
| ---------| ------------------------------- | ------------------------------- |--------------------------------|-----|
| 同源比对  |  0.6345134619461522(11972/18868) | 0.7470360663921128(11972/16026) |0.84937460250159(16026/18868)   |      |
| DeepEC<sup style='color:red'>[DeepEC, 2019]</sup>   |  0.5264468942124232(9933/18868)| 0.7691056910569106(9933/12915) |0.6844922620309519(12915/18868)   |      |
| ECPred<sup style='color:red'>[ECPred, 2018]</sup>   |  0.4903540385838457(9252/18868)|   |   |      |





[1. DeepEC, 2019](https://bitbucket.org/kaistsystemsbiology/deepec/src/master/) </br>
<span style='color:red'>Ryu, Jae Yong, Hyun Uk Kim, and Sang Yup Lee. "Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers." Proceedings of the National Academy of Sciences 116.28 (2019): 13996-14001.</span>

[2. ECpred, 2018](https://ecpred.kansil.org/)

<span style='color:red'>Dalkiran, A., Rifaioglu, A. S., Martin, M. J., Cetin-Atalay, R., Atalay, V., & Doğan, T. (2018). ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC bioinformatics, 19(1), 334. </span>

[3. EzyPred, 2007](http://www.csbio.sjtu.edu.cn/bioinf/EzyPred/#) 

<span style='color:red'>Shen, Hong-Bin, and Kuo-Chen Chou. "EzyPred: a top–down approach for predicting enzyme functional classes and subclasses." Biochemical and biophysical research communications 364.1 (2007): 53-59. </span>

[4. SVM-Prot, 2016](http://bidd.group/cgi-bin/svmprot/) [[速度奇慢]]

<span style='color:red'>Li, Ying Hong, et al. "SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity." PloS one 11.8 (2016): e0155290. </span>




### 机器学习 + onehot


|baslineName| accuracy 	 |precision(PPV) |	 NPV 	 |	recall |	f1 	 |	 auroc 	 |	 auprc 	 |	 confusion Matrix					|
| ------| ----------|-----------| ---------- | ----------|-----------|-----------|---------- |------------------------------------------|





### 机器学习 + unirep

|baslineName| accuracy 	 |precision(PPV) |	 NPV 	 |	recall |	f1 	 |	 auroc 	 |	 auprc 	 |	 confusion Matrix					    |
| ----------| -----------|---------------| --------- | --------|---------|-----------|---------- |------------------------------------------|



## 1. 导入必要的包

In [1]:
import numpy as np
import pandas as pd
import random
import sys
from tqdm import tqdm
sys.path.append("../../tools/")
import commontools
import funclib

from xgboost.sklearn import XGBClassifier


from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import hamming_loss
from sklearn.metrics import jaccard_score
from sklearn.metrics import hinge_loss

%load_ext autoreload


## 2. 加载数据

In [2]:
train = pd.read_hdf('./data/train.h5',key='data')
test = pd.read_hdf('./data/test.h5',key='data')
head = funclib.table_head + ['f'+str(i) for i in range(1, 1901) ]
head = head + ['ec_label','ec_appears']
train.columns = head
test.columns = head

In [173]:
ec_list = list(set(list(train.ec_number) + list(test.ec_number)))
ec_list.sort()
ec_dict = {k:v for k, v in zip(ec_list, range(len(ec_list)))}

## 3. 同源比对

In [17]:
res_data=funclib.getblast(train,test)

Write finished
Write finished
diamond makedb --in /tmp/train.fasta -d /tmp/train.dmnd
diamond blastp -d /tmp/train.dmnd  -q  /tmp/test.fasta -o /tmp/test_fasta_results.tsv -b5 -c1 -k 1


In [32]:
# 提取测试集项目
test_results_df=test[['id', 'ec_number']]

# 给预测结果添加EC
id_ec_dict = {v: k for v,k in zip( train.id, train.ec_number)} 
res_data['ec_pred'] = res_data.sseqid.apply(lambda x : id_ec_dict.get(x))


In [38]:
pred_ec_dict = {v: k for v,k in zip( res_data.id, res_data.ec_pred)} 
with pd.option_context('mode.chained_assignment', None):
    test_results_df['ec_pred'] = test_results_df.id.apply(lambda x: pred_ec_dict.get(x))

In [41]:
# 计算指标
tp = len(test_results_df[test_results_df.ec_number == test_results_df.ec_pred])
# fp = len(test_results_df[(test_results_df.isemzyme ==False) & (test_results_df.pred)])
# tn = len(test_results_df[(test_results_df.isemzyme ==False) & (test_results_df.pred ==False)])
# fn = len(test_results_df[(test_results_df.isemzyme ) & (test_results_df.pred == False)])
# print('baslineName', '\t', 'accuracy','\t', 'precision(PPV) \t NPV \t\t', 'recall','\t', 'f1', '\t\t', 'auroc','\t\t', 'auprc', '\t\t confusion Matrix')
# funclib.caculateMetrix_1('同源比对\t',tp, fp, tn,fn)

In [43]:
tp/len(test)

0.6345134619461522

In [125]:
y_true = np.array(test_results_df['ec_number'].apply(lambda x: ec_dict.get(x)))
y_pred = np.array(test_results_df['ec_pred'].apply(lambda x:ec_dict.get(x))).astype('int')

kappa = cohen_kappa_score(y_true, y_pred)
ham_distance = hamming_loss(y_true, y_pred)
jaccrd_score = jaccard_score(y_true,y_pred, average=None)
print('kappa: \t\t hamming_loss \t jaccrd_score')
print('{0:.6f} \t {1:.6f} \t {2:.6f}'.format(kappa, ham_distance, sum(jaccrd_score)/len(test)))

kappa: 		 hamming_loss 	 jaccrd_score
0.632965 	 0.365487 	 0.055759


In [186]:
y_pred

array([None, None, None, ..., None, None, None], dtype=object)

In [51]:
# 匹配查询结果
id_map_ec = train[['id', 'ec_number']].append(test[['id', 'ec_number']],ignore_index=True)
id_ec_dict = {v: k for v,k in zip( id_map_ec.id, id_map_ec.ec_number)} 
res_data['is_ec_match']=res_data.apply(lambda x: (id_ec_dict.get(x['id'])== id_ec_dict.get(x['sseqid'])), axis=1)

# 输出比对结果
funclib.evaluateBlast(res_data,train,test, 'ec_number')

Total query records are: 18868
Matched records are: 16026
Accuracy: 0.6345134619461522(11972/18868)
Pricision: 0.7470360663921128(11972/16026)
Recall: 0.84937460250159(16026/18868)


## 4. DeepEC

In [126]:
funclib.table2fasta(test, './data/deepec.fasta')
! python ../../baselines/deepec/deepec.py -i ./data/deepec.fasta -o ./data/deepec/

Write finished
2021-06-21 13:00:14.415741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-06-21 13:08:46.504939: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-06-21 13:08:46.595470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:81:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.8GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-21 13:08:46.596240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:c1:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.8GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-21 13:08:46.596975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: 
pciBusID: 0000:c2:00

In [20]:
# 读取预测结果
deepec_results = pd.read_csv('./data/deepec/DeepEC_Result.txt', sep='\t',names=['id', 'ec_number'], header=0) #读入文件
deepec_results.ec_number=deepec_results.apply(lambda x: x['ec_number'].replace('EC:',''), axis=1)

In [21]:
# 计算指标
ec_dict = {v: k for k,v in zip(test.ec_number, test.id )} 
deepec_results['ec_groundtruth'] = deepec_results.apply(lambda x: ec_dict.get(x['id']), axis=1)
deepec_results['iscorrect'] = deepec_results.apply(lambda x: x.ec_number == x.ec_groundtruth, axis=1)

correct = sum(deepec_results['iscorrect'])
find  = len(deepec_results)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 18868
Matched records are: 12915
Accuracy: 0.5264468942124232(9933/18868)
Pricision: 0.7691056910569106(9933/12915)
Recall: 0.6844922620309519(12915/18868)


In [22]:
deepec_results

Unnamed: 0,id,ec_number,ec_groundtruth,iscorrect
0,B0VRF9,6.4.1.2,2.1.3.15,False
1,Q339X2,3.2.1.21,3.2.1.21,True
2,Q0DIT2,3.2.1.21,3.2.1.21,True
3,A9WBQ9,6.4.1.2,2.1.3.15,False
4,Q2G8S9,6.4.1.2,2.1.3.15,False
...,...,...,...,...
12910,G3E4M4,4.2.1.133,5.5.1.13,False
12911,A0A1W6QDJ1,4.2.1.133,5.5.1.13,False
12912,Q1LW01,3.4.24.67,3.4.24.67,True
12913,Q9RN59,5.5.1.23,5.5.1.26,False


In [23]:
mmc= test.merge(deepec_results, how='left', on='id')

In [25]:
mmc[mmc['date_integraged']>='2019-01-01']

Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number_x,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,...,f1896,f1897,f1898,f1899,f1900,ec_label,ec_appears,ec_number_y,ec_groundtruth,iscorrect
16955,A0A2U9GGW3,THS2_PAPSO,True,False,1,4.2.99.24,4,2019-01-16,2018-09-12,2021-04-07,...,0.106471,0.015043,-0.021910,-0.090344,0.070594,4503,2,,,
16956,A0A2U9GHG9,THS1_PAPSO,True,False,1,4.2.99.24,4,2019-01-16,2018-09-12,2020-12-02,...,0.208322,0.000840,-0.009849,-0.142026,0.055294,4503,2,,,
16957,A0A142C7A4,PHNF_PENHR,True,False,1,2.5.1.-,3,2019-01-16,2016-06-08,2020-04-22,...,0.212885,-0.011224,0.079702,-0.293908,-0.006126,2032,452,,,
16958,A0A0H3MBJ2,PKN1_CHLT2,True,False,1,2.7.11.1,4,2019-01-16,2015-09-16,2021-02-10,...,-0.076654,-0.006188,0.299633,-0.033017,0.009245,1577,2358,2.7.11.1,2.7.11.1,True
16959,B9JN19,DERI_AGRRK,True,False,1,5.3.1.34,4,2019-01-16,2009-03-24,2020-12-02,...,0.000623,0.110547,0.024155,-0.100186,0.379172,1625,5,5.3.1.6,5.3.1.34,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19032,Q9RN59,SNOAL_STRNO,True,False,1,5.5.1.26,4,2021-04-07,2000-05-01,2021-04-07,...,-0.191364,-0.001056,-0.002186,0.123507,-0.126100,2343,1,5.5.1.23,5.5.1.26,False
19033,G3F5K2,BKT_PROBT,True,False,1,1.14.99.63,4,2021-04-07,2011-11-16,2021-04-07,...,-0.077786,-0.067015,0.046054,-0.030360,0.019115,3711,5,,,
19034,P93789,SGT1_SOLTU,True,False,1,2.4.1.-,3,2021-04-07,2006-01-10,2021-04-07,...,0.255037,-0.003941,0.037680,0.005358,-0.186378,321,811,,,
19035,A0A1Z3GBS4,CYPH3_ISORU,True,False,1,1.14.14.175,4,2021-04-07,2017-09-27,2021-04-07,...,0.103256,-0.021321,0.142498,-0.021635,-0.000273,2268,2,1.14.13.190,1.14.14.175,False


In [26]:
mmc[(mmc['date_integraged']>='2019-01-01')&(mmc.iscorrect==True)]

Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number_x,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,...,f1896,f1897,f1898,f1899,f1900,ec_label,ec_appears,ec_number_y,ec_groundtruth,iscorrect
16958,A0A0H3MBJ2,PKN1_CHLT2,True,False,1,2.7.11.1,4,2019-01-16,2015-09-16,2021-02-10,...,-0.076654,-0.006188,0.299633,-0.033017,0.009245,1577,2358,2.7.11.1,2.7.11.1,True
16960,P0DPS8,PKND_CHLTR,True,False,1,2.7.11.1,4,2019-01-16,2019-01-16,2021-04-07,...,-0.016689,-0.180407,0.152834,-0.020861,-0.005935,1577,2358,2.7.11.1,2.7.11.1,True
16976,D4GT97,IPYR_HALVD,True,False,1,3.6.1.1,4,2019-01-16,2010-05-18,2021-04-07,...,-0.086756,-0.009807,0.004317,0.055228,0.011670,4500,216,3.6.1.1,3.6.1.1,True
16978,A1B9Z3,HTPA_PARDP,True,False,1,2.6.1.77,4,2019-01-16,2007-01-23,2021-04-07,...,0.294928,-0.004621,0.047856,-0.013025,0.193465,3126,3,2.6.1.77,2.6.1.77,True
16981,A1B4L2,ALDH_PARDP,True,False,1,1.2.1.3,4,2019-01-16,2007-01-23,2020-12-02,...,-0.442951,0.004721,0.015430,-0.025434,-0.017034,1092,83,1.2.1.3,1.2.1.3,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18974,A0A0S2IHL6,BASM1_KALSE,True,False,1,5.4.99.39,4,2021-04-07,2016-02-17,2021-04-07,...,0.169727,-0.127872,0.029169,-0.039093,-0.007797,3619,10,5.4.99.39,5.4.99.39,True
18980,Q0KHV6,PINK1_DROME,True,False,1,2.7.11.1,4,2021-04-07,2006-10-03,2021-04-07,...,0.015917,0.004286,0.010538,-0.083933,-0.022021,1577,2358,2.7.11.1,2.7.11.1,True
18998,P96890,ACCA3_MYCTU,True,False,1,6.3.4.14,4,2021-04-07,1997-07-01,2021-04-07,...,0.055945,0.006023,0.648527,0.101467,-0.012606,2880,3,6.3.4.14,6.3.4.14,True
19020,Q06GJ0,HEXC_OSTFU,True,False,1,3.2.1.52,4,2021-04-07,2006-10-31,2021-04-07,...,-0.001001,-0.014013,-0.191038,-0.024860,0.031504,3863,130,3.2.1.52,3.2.1.52,True


## 5. ECpred

In [4]:
## 读取结果
ecpred_results = pd.read_csv('./data/ecpred_results.tsv', sep='\t', header=0) #读入文件
ec_dict = {v: k for k,v in zip(test.ec_number, test.id )} 
ecpred_results['ec_groundtruth'] = ecpred_results.apply(lambda x: ec_dict.get(x['Protein ID']), axis=1)
ecpred_results['iscorrect'] = ecpred_results.apply(lambda x: x['EC Number'] == x.ec_groundtruth, axis=1)

In [5]:
ecpred_results = ecpred_results.drop_duplicates(subset=['Protein ID'], keep='first')

In [27]:
ecpred_results

Unnamed: 0,Protein ID,EC Number,Confidence Score(max 1.0),ec_groundtruth,iscorrect
0,B0VRF9,6.4.1.2,1.00,2.1.3.15,False
1,Q339X2,3.2.1.21,1.00,3.2.1.21,True
2,A0LEQ5,1.6.5.11,1.00,7.1.1.-,False
3,Q0DIT2,3.2.1.21,1.00,3.2.1.21,True
4,Q28FQ5,3.1.4.-,0.83,3.1.4.-,True
...,...,...,...,...,...
18862,E5ASS2,2.7.7.68,0.79,2.7.7.106,False
18863,Q9RN59,4.-.-.-,0.55,5.5.1.26,False
18864,G3F5K2,1.14.19.-,0.66,1.14.99.63,False
18866,A0A1Z3GBS4,1.14.13.-,0.82,1.14.14.175,False


In [59]:
# 计算指标
correct = sum(ecpred_results['iscorrect'])
find  = len(ecpred_results)
total = len(test)
print('Total query records are: {0}'.format(total))
print('Matched records are: {0}'.format(find))
print('Accuracy: {0}({1}/{2})'.format(correct/total, correct, total))
print('Pricision: {0}({1}/{2})'.format(correct/find, correct, find))
print('Recall: {0}({1}/{2})'.format(find/total, find, total))

Total query records are: 18868
Matched records are: 18868
Accuracy: 0.4903540385838457(9252/18868)
Pricision: 0.4903540385838457(9252/18868)
Recall: 1.0(18868/18868)


## 6. 机器学习EC号预测
### 6.1 Unirep + 机器学习

In [145]:
train[train.ec_appears<10]

Unnamed: 0,id,name,isemzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,...,f1893,f1894,f1895,f1896,f1897,f1898,f1899,f1900,ec_label,ec_appears
468,P00736,C1R_HUMAN,True,False,1,3.4.21.41,4,1986-07-21,2008-12-16,2021-04-07,...,0.075576,0.066957,-0.170500,0.038070,-0.023567,-0.008000,0.399710,-0.391056,4391,6
435,P00778,PRLA_LYSEN,True,False,1,3.4.21.12,4,1986-07-21,1996-02-01,2021-04-07,...,-0.001054,0.226078,-0.234684,0.113336,0.001585,-0.132556,-0.143564,0.046508,1791,2
437,P00801,PRLB_LYSEN,True,False,1,3.4.24.32,4,1986-07-21,1986-07-21,2021-04-07,...,0.107413,0.041127,-0.118509,0.103417,-0.013218,0.414396,0.006479,0.182354,3705,2
505,P00798,PEPA1_PENJA,True,False,1,3.4.23.20,4,1986-07-21,1998-07-15,2021-04-07,...,0.087106,0.006279,0.003213,0.012956,-0.243595,0.035452,0.014719,0.101651,4766,6
479,P00748,FA12_HUMAN,True,False,1,3.4.21.38,4,1986-07-21,2011-01-11,2021-04-07,...,0.020127,0.089754,-0.138130,0.231970,-0.087967,-0.041153,0.738969,-0.031532,4324,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
185339,Q5VQG4,RFS_ORYSJ,True,False,1,2.4.1.82,4,2009-11-24,2004-12-07,2020-10-07,...,-0.384832,0.178423,-0.083334,-0.073801,0.003934,0.027571,0.060408,0.005245,4227,7
185340,Q9FND9,RFS5_ARATH,True,False,1,2.4.1.82,4,2009-11-24,2001-03-01,2021-04-07,...,-0.356027,0.256136,-0.134222,-0.121302,0.015219,0.057752,0.033300,0.002145,4227,7
185341,Q9SYJ4,RFS4_ARATH,True,False,1,2.4.1.82,4,2009-11-24,2011-05-03,2020-12-02,...,-0.066600,0.022275,-0.059235,0.026875,-0.021814,0.363252,-0.192488,0.012332,4227,7
185348,Q03C44,PAGL1_LACP3,True,False,1,3.2.1.122,4,2009-11-24,2006-11-14,2021-04-07,...,-0.258700,0.126944,-0.105607,0.047332,0.005912,0.006230,-0.070941,-0.060853,2305,6


In [139]:
print(len(set(train.ec_number)))
print(len(set(test.ec_number)))

3364
3041


In [179]:
X_train = train.iloc[:,12:1912]
X_test = train.iloc[:,12:1912]
Y_train = train.ec_number.apply(lambda x: ec_dict.get(x))
Y_test = test.ec_number.apply(lambda x: ec_dict.get(x))

In [None]:
xgboost_clf = XGBClassifier(min_child_weight=6,max_depth=15, objective='multi:softmax',num_class=5, n_jobs=-2)
print("-" * 60)
print("xgboost模型：", xgboost_clf)
clf = xgboost_clf.fit(X_train, Y_train)
res = clf.predict(X_test)
aa = pd.DataFrame()
aa['ground_truth'] = Y_test
aa['pred'] = res
aa['iscorrect']= aa.apply(lambda x: x.ground_truth == x.pred, axis=1)
aa.iscorrect.sum()

------------------------------------------------------------
xgboost模型： XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=15,
              min_child_weight=6, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=-2, num_class=5, num_parallel_tree=None,
              objective='multi:softmax', random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, validate_parameters=None, verbosity=None)






In [184]:
aa['pred'] = res

ValueError: Length of values (185453) does not match length of index (18868)

In [16]:
train_sub= train[train.ec_appears>=100]
test_sub= test[test.ec_appears>=100]

In [19]:
trainset = train_sub[['id', 'ec_label','seq', 'seqlength']].reset_index(drop=True)
testset = test_sub[['id', 'ec_label','seq', 'seqlength']].reset_index(drop=True)

MAX_SEQ_LENGTH = 1000 #定义序列最长的长度
trainset.seq = trainset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))
testset.seq = testset.seq.map(lambda x : x[0:MAX_SEQ_LENGTH].ljust(MAX_SEQ_LENGTH, 'X'))

f_train = funclib.dna_onehot(trainset) #训练集编码
f_test = funclib.dna_onehot(testset) #测试集编码

train_full = pd.concat([trainset, f_train], axis=1, join='inner' ) #拼合训练集
test_full = pd.concat([testset, f_test], axis=1, join='inner' )    #拼合测试集

X_train = np.array(train_full.iloc[:,4:])
X_test = np.array(test_full.iloc[:,4:])
Y_train = np.array(train_full.ec_label.astype('int'))
Y_test = np.array(test_full.ec_label.astype('int'))

In [20]:
xgboost_clf = XGBClassifier(min_child_weight=6,max_depth=15, objective='multi:softmax',num_class=5, n_jobs=-2)
print("-" * 60)
print("xgboost模型：", xgboost_clf)
clf = xgboost_clf.fit(X_train, Y_train)
res = clf.predict(X_test)
aa = pd.DataFrame()
aa['ground_truth'] = Y_test
aa['pred'] = res
aa['iscorrect']= aa.apply(lambda x: x.ground_truth == x.pred, axis=1)
aa.iscorrect.sum()

------------------------------------------------------------
xgboost模型： XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=15,
              min_child_weight=6, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=-2, num_class=5, num_parallel_tree=None,
              objective='multi:softmax', random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, validate_parameters=None, verbosity=None)






1604

In [182]:
aa

Unnamed: 0,ground_truth
185453,1630
185837,3099
185838,4684
185839,3099
185840,2968
...,...
204222,4488
204223,824
204224,1889
204226,645


In [22]:
1604/len(test_sub)

0.16916262391900444

In [23]:
len(test_sub)

9482

In [30]:
traincc = train.iloc[:, 12:-2]

In [31]:
traincc.to_csv('./data/sprot_trn_ft_mat_dense.txt', header=0, index=0, sep=' ')

In [37]:
traincc_label = train.iloc[:, -2:-1]

In [49]:
traincc_label.to_csv('../../baselines/slice/Sandbox/Data/sprot/sprot_trn_lbl_mat.txt', header=0, index=0, sep=' ')

In [6]:
dtc = tree.DecisionTreeClassifier(criterion="entropy")
clf = dtc.fit(X_train, Y_train)
res = clf.predict(X_test)
# print(y_test)
# dot_data = tree.export_graphviz(clf, out_file=None)

In [86]:
xmlcnn_trn_lbl_mat

In [48]:
test_label = test.iloc[:, -2:-1]
test_label.to_csv('../../baselines/slice/Sandbox/Data/sprot/sprot_tst_lbl_mat.txt', header=0, index=0, sep=' ')

In [47]:
testcc = test.iloc[:, 12:-2]
testcc.to_csv('../../baselines/slice/Sandbox/Data/sprot/sprot_tst_ft_mat_dense.txt', header=0, index=0, sep=' ')

In [7]:
iris = load_iris()
x = iris['data']
y = iris['target']
dtc = tree.DecisionTreeClassifier(criterion="entropy")
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
clf = dtc.fit(x_train, y_train)
print(clf.predict(x_test))
print(y_test)
dot_data = tree.export_graphviz(clf, out_file=None)

NameError: name 'load_iris' is not defined

In [21]:
160/18000

0.008888888888888889