# Prepare Datasets for Benchmarking Tasks

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2022-10-05  


## Dataset1. Enzyme None-enzyme Dataset
The enzyme dataset is consists of two parts: <u>a training set</u> and <u>a testing set</u>.   
The training set is from snapshot Feb-2018 and ***excludes*** those <u>deleted items</u> and <u>sequences changed items</u> in snapshot Feb-2022.    
The training set is consists of ***469,134*** records, of which ***222,567*** are enzymes, and ***246,567*** are none-enzymes.   
The testing set is from snapshot Feb-2022 and excludes these items that appeared in snapshot Feb-2018.   
The testing set is consists of ***10,614*** records, of which ***5111*** are enzymes, and ***5503*** are none-enzymes.   
Unlike previous works,  we did not filter any sequences in terms of length and homology to make the data more inclusive. We make a label for each sequence, 1 for enzyme and 0 for none-enzyme.   

## Dataset2. Enzyme Function Quantity Dataset
The enzyme quantity dataset only contains enzyme data, contain ***222,567*** records. The function quantity ranges from 1 to 8.

## Dataset 3: EC Dataset

The EC dataset consists of 227,678 enzyme records, 222,567 are training-set, and the rest 5111 are testing-set, covering 6,031 EC numbers. Up to Feb 2022, ***cmopared with [ExplorEnz](https://www.enzyme-database.org/stats.php) CURRENT EC = 6674***, there still exist 643 EC numbers that the model can not handle in the benchmarking. Thus, we exclude the sequences with these 267 EC numbers in the evaluation process. But, this problem can be resolved in the production scenario because we use the entire data from Swiss-Prot. Now the EC coverage is 6,031 and can be automatically extended, for the training is real-time based on the publication of Swiss-Prot every 8 weeks. 

## 1. Import packages

In [1]:
import numpy as np
import pandas as pd
import sys,os
from tqdm import tqdm
import config as cfg
from functools import reduce
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
from tools import filetool as ftool
from tools import exact_ec_from_uniprot as exactec
from tools import funclib
from tools import minitools as mtool
from tools import embdding_onehot as onehotebd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=False)

from sklearn.preprocessing import MultiLabelBinarizer
from keras.models import Model
from keras.optimizers import Adam
from keras.layers import Input, Dense
from keras.layers import GRU, Bidirectional
from tools import Attention

%load_ext autoreload
%autoreload 2



INFO: Pandarallel will run on 52 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## 2. Download rawdata from unisprot

> IF first time run pls uncomment the cell below

In [2]:
# load data20221118
ftool.wget(download_url=cfg.URL_DATA_20221118, save_file=cfg.FILE_DATA_20221118)

wget -q https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz -O /home/dengrui/DMLF/data/uniprot/uniprot_sprot_leatest_20221118.dat.gz


## 3. Extract records from rawdata

In [3]:
cmd_array = [
    # last data
    f'tar -zxvf {cfg.FILE_DATA_20221118} -C {cfg.DIR_UNIPROT}',
    f'mv {cfg.DIR_UNIPROT}uniprot_sprot_leatest_20221118.dat.gz {cfg.DIR_UNIPROT}uniprot_sprot_leatest_20221118.data.gz', 
    f'rm -f {cfg.DIR_UNIPROT}uniprot_sprot.fasta.gz {cfg.DIR_UNIPROT}uniprot_sprot_varsplic.fasta.gz {cfg.DIR_UNIPROT}uniprot_sprot.xml.gz'
]

[os.system(item) for item in cmd_array]

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors


[512, 0, 0]

In [4]:
# #lasten data
exactec.run_exact_task(infile=f'{cfg.DIR_UNIPROT}uniprot_sprot_leatest_20221118.data.gz', outfile=f'{cfg.DIR_UNIPROT}uniprot_sprot_leatest_20221118.tsv')


finished use time 313.882 s


## 4. Load records & Drop Duplicates

In [5]:
#加载数据并转换时间格式
uniprot_sprot_leatest_20221118 = pd.read_csv(f'{cfg.DIR_UNIPROT}uniprot_sprot_leatest_20221118.tsv', sep='\t',header=0) #读入文件
uniprot_sprot_leatest_20221118 = mtool.convert_DF_dateTime(inputdf = uniprot_sprot_leatest_20221118)

#Drop Duplicates
uniprot_sprot_leatest_20221118.drop_duplicates(subset=['seq'], keep='first', inplace=True)
uniprot_sprot_leatest_20221118.reset_index(drop=True, inplace=True)


In [6]:
uniprot_sprot_leatest_20221118.head(3)

Unnamed: 0,id,name,isenzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P03742,FIB35_BPT4,False,False,0,-,0,1986-07-21,2001-12-19,2020-08-12,MEKFMAEFGQGYVQTPFLSESNSVRYKISIAGSCPLSTAGPSYVKF...,372
1,P00648,RNBR_BACAM,True,False,1,3.1.27.-,3,1986-07-21,1989-07-01,2022-05-25,MMKMEGIALKKRLSWISVCLLVLVSAAGMLFSTAAKTETSSHKAHT...,157
2,P00155,CYF_PEA,False,False,0,-,0,1986-07-21,1989-07-01,2022-10-12,MQTRNAFSWIKKEITRSISVLLMIYIITRAPISNAYPIFAQQGYEN...,320


## 5. Preprocessing
### 5.1 format EC

In [7]:
uniprot_sprot_leatest_20221118['ec_number'] = uniprot_sprot_leatest_20221118.ec_number.parallel_apply(lambda x: mtool.format_ec(x))
uniprot_sprot_leatest_20221118['ec_number'] = uniprot_sprot_leatest_20221118.ec_number.parallel_apply(lambda x: mtool.specific_ecs(x))
uniprot_sprot_leatest_20221118['functionCounts'] = uniprot_sprot_leatest_20221118.ec_number.parallel_apply(lambda x: 0 if x=='-'  else len(x.split(',')))

print('uniprot_sprot_leatest_20221118 finished')

uniprot_sprot_leatest_20221118 finished


In [8]:
uniprot_sprot_leatest_20221118.to_feather(cfg.DIR_UNIPROT + '/uniprot_sprot_leatest_20221118.feather')

### 5.2 Split Tain Test

### 5.3 Remove changed seqence in test set

### 5.4 Trim string

In [9]:
with pd.option_context('mode.chained_assignment', None):
    uniprot_sprot_leatest_20221118.ec_number = uniprot_sprot_leatest_20221118.ec_number.parallel_apply(lambda x : str(x).strip()) #ec trim
    uniprot_sprot_leatest_20221118.seq = uniprot_sprot_leatest_20221118.seq.parallel_apply(lambda x : str(x).strip()) #seq trim

In [10]:
uniprot_sprot_leatest_20221118

Unnamed: 0,id,name,isenzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P03742,FIB35_BPT4,False,False,0,-,0,1986-07-21,2001-12-19,2020-08-12,MEKFMAEFGQGYVQTPFLSESNSVRYKISIAGSCPLSTAGPSYVKF...,372
1,P00648,RNBR_BACAM,True,False,1,3.1.27.-,3,1986-07-21,1989-07-01,2022-05-25,MMKMEGIALKKRLSWISVCLLVLVSAAGMLFSTAAKTETSSHKAHT...,157
2,P00155,CYF_PEA,False,False,0,-,0,1986-07-21,1989-07-01,2022-10-12,MQTRNAFSWIKKEITRSISVLLMIYIITRAPISNAYPIFAQQGYEN...,320
3,P01630,KV2A6_MOUSE,False,False,0,-,0,1986-07-21,1986-07-21,2022-05-25,DIVMTQTAPSALVTPGESVSISCRSSKSLLHSNGNTYLYWFLQRPG...,113
4,P01629,KV2A4_MOUSE,False,False,0,-,0,1986-07-21,1986-07-21,2022-10-12,DIVMTQAAFSNPVTLGTSASFSCRSSKSLQQSKGITYLYWYLQKPG...,112
...,...,...,...,...,...,...,...,...,...,...,...,...
480276,Q7XWV4,SRT1_ORYSJ,True,False,1,2.3.1.286,4,2022-10-12,2004-03-01,2022-10-12,MSLGYAEKLSYREDVGNVGMPEIFDSPELLHKKIEELAVMVRESKH...,483
480277,O31201,HUTG_PSEPU,True,False,1,3.5.1.68,4,2022-10-12,1998-01-01,2022-10-12,MDKVLSFHQGRLPLLISMPHAGLRLSDAVRDGLVEEARSLPDTDWH...,267
480278,Q9HU92,HUTG_PSEAE,True,False,1,3.5.1.68,4,2022-10-12,2001-03-01,2022-10-12,MDEVLSFKRGRVPLLISMPHPGTRLTPAVDAGLVEEARALTDTDWH...,266
480279,P0DW99,FABH_CUTAC,True,True,2,"2.3.1.300,2.3.1.180",4,2022-10-12,2022-10-12,2022-10-12,MTAIKTRPVHGYSKFLSTGSARGSRVVTNKEMCTLIDSTPEWIEQR...,332


### 5.5 Save train test

In [11]:
uniprot_sprot_leatest_20221118.to_feather(cfg.DATADIR + 'datasets/uniprot_sprot_leatest_20221118.feather')

## 6. Build benchmarking datasets
### 6.1 Task 1 isEnzyme

In [12]:
train_leatest = pd.read_feather(cfg.DIR_DATASETS + 'uniprot_sprot_leatest_20221118.feather')
task1_train_leatest_20221118 = train_leatest[['id','seq','isenzyme']]
task1_train_leatest_20221118.to_feather(cfg.FILE_TASK1_TRAIN_LEATEST)
funclib.table2fasta(table=task1_train_leatest_20221118[['id', 'seq']], file_out=cfg.FILE_TASK1_TRAIN_LEATEST_FASTA)


Write finished


### 6.2 Task2 Function Counts

In [13]:
task2_train_leatest_20221118 = train_leatest[train_leatest.functionCounts >0]
task2_train_leatest_20221118.reset_index(drop=True, inplace=True)
task2_train_leatest_20221118 = task2_train_leatest_20221118[['id','seq','functionCounts']]
task2_train_leatest_20221118.to_feather(cfg.FILE_TASK2_TRAIN_LEATEST)
funclib.table2fasta(table=task2_train_leatest_20221118[['id', 'seq']], file_out=cfg.FILE_TASK2_TRAIN_LEATEST_FASTA)


Write finished


### 6.3 Task3 EC Number

In [14]:
task3_train_leatest_20221118 = train_leatest[train_leatest.functionCounts >0]
task3_train_leatest_20221118.reset_index(drop=True, inplace=True)
task3_train_leatest_20221118 = task3_train_leatest_20221118[['id','seq','ec_number']]
task3_train_leatest_20221118.to_feather(cfg.FILE_TASK3_TRAIN_LEATEST)
funclib.table2fasta(table=task3_train_leatest_20221118[['id', 'seq']], file_out=cfg.FILE_TASK3_TRAIN_LEATEST_FASTA)

Write finished


## 7 Make Feature Bank

### 7.1 ESM embedding 

In [15]:
# loading sprot data
uniprot_sprot_leatest_20221118 = pd.read_feather(cfg.DIR_UNIPROT + '/uniprot_sprot_leatest_20221118.feather')

# merge
uniprot_sprot_leatest_20221118 = uniprot_sprot_leatest_20221118.sort_values(by=['id', 'date_annotation_update'], ascending=False)
uniprot_sprot_leatest_20221118 = uniprot_sprot_leatest_20221118[['id', 'seq']].drop_duplicates(subset='id', keep='first')
uniprot_sprot_leatest_20221118.reset_index(drop=True, inplace=True)


# loading exsisting features
if ftool.isfileExists(cfg.FILE_FEATURE_ESM0):
    feature_esm0 = pd.read_feather(cfg.FILE_FEATURE_ESM0)
    feature_esm32 = pd.read_feather(cfg.FILE_FEATURE_ESM32)
    feature_esm33 = pd.read_feather(cfg.FILE_FEATURE_ESM33)
    feature_unirep = pd.read_feather(cfg.FILE_FEATURE_UNIREP)
    feature_onehot = pd.read_feather(cfg.FILE_FEATURE_ONEHOT)
    #caculate embedding list
    needesm = uniprot_sprot_leatest_20221118[~uniprot_sprot_leatest_20221118.id.isin(list(set(feature_esm33.id)))]
    needunirep = uniprot_sprot_leatest_20221118[~uniprot_sprot_leatest_20221118.id.isin(list(set(feature_unirep.id)))]
    needonehot = uniprot_sprot_leatest_20221118[~uniprot_sprot_leatest_20221118.id.isin(list(set(feature_onehot.id)))]
else:
    needesm = uniprot_sprot_leatest_20221118
    needunirep = uniprot_sprot_leatest_20221118
    needonehot = uniprot_sprot_leatest_20221118



In [16]:
# !pip install fair-esm
from tools import embedding_esm as esmebd
if len(needesm)>0:
    tr_rep0, tr_rep32, tr_rep33 = esmebd.get_rep_multi_sequence(sequences=needesm, model='esm1b_t33_650M_UR50S',seqthres=1022)

    #merge existing
    feature_esm0 = pd.concat([feature_esm0, tr_rep0], axis=0).reset_index(drop=True)
    feature_esm32 = pd.concat([feature_esm32, tr_rep32], axis=0).reset_index(drop=True)
    feature_esm33 = pd.concat([feature_esm33, tr_rep33], axis=0).reset_index(drop=True)


    #save
    feature_esm0.to_feather(cfg.FILE_FEATURE_ESM0)
    feature_esm32.to_feather(cfg.FILE_FEATURE_ESM32)
    feature_esm33.to_feather(cfg.FILE_FEATURE_ESM33)



### 7.2 Unirep

In [17]:
if len(needunirep) > 0:
    from tools import embedding_unirep as unirep
    tr_unirep = unirep.getunirep(needunirep, 40)

    feature_unirep = pd.concat([feature_unirep, tr_unirep],axis=0).reset_index(drop=True)
    feature_unirep.to_feather(cfg.FILE_FEATURE_UNIREP)


### 7.3 one-hot

In [18]:
if len(needonehot) > 0:
    tr_unirep = onehotebd.get_onehot(sequences=needonehot, padding=True, padding_window=1500)
    feature_onehot = pd.concat([feature_onehot, tr_unirep],axis=0).reset_index(drop=True)
    feature_onehot.to_feather(cfg.FILE_FEATURE_ONEHOT)

# 8 Train Model

In [21]:
# embedding feather
feature_esm0 = pd.read_feather(cfg.FILE_FEATURE_ESM0)
feature_esm32 = pd.read_feather(cfg.FILE_FEATURE_ESM32)
feature_esm33 = pd.read_feather(cfg.FILE_FEATURE_ESM33)
feature_unirep = pd.read_feather(cfg.FILE_FEATURE_UNIREP)
feature_onehot = pd.read_feather(cfg.FILE_FEATURE_ONEHOT)

In [22]:
# train data 
task1_train = pd.read_feather(cfg.FILE_TASK1_TRAIN_LEATEST)
task2_train = pd.read_feather(cfg.FILE_TASK2_TRAIN_LEATEST)
task3_train = pd.read_feather(cfg.FILE_TASK3_TRAIN_LEATEST)
print('task1_train: ,task2_train: ,task3_train: ',task1_train.shape,task2_train.shape,task3_train.shape)

task1_train: ,task2_train: ,task3_train:  (480281, 3) (232874, 3) (232874, 3)


# 8.1 task1 

In [3]:
# embedding方式可更改
task1_train = task1_train.merge(feature_onehot,on='id',how='left')
# feature
task1_feature = task1_train.iloc[:, 3:]
task1_feature = np.array(task1_feature)
task1_feature =np.reshape(task1_feature,(task1_feature.shape[0],1,task1_feature.shape[1]))


NameError: name 'task1_train' is not defined

In [24]:
# 标签处理
ecs= (set(task1_train.isenzyme))
def get_label(ecnum_str):
    label_init = np.zeros(len(label_dict),  dtype=int)
    label_init[label_dict.get(ecnum_str)] = 1
    return list(label_init)

label_dict = dict(zip(set(ecs), range(len(ecs))))
label_init = np.zeros(len(label_dict),  dtype=int)
train_label = task1_train.isenzyme.apply(lambda x: get_label(ecnum_str=x))

train_label=[item for item in train_label]
y_train = np.array(train_label)


In [28]:
# 模型设置 训练 保存
inputs = Input(shape=(1,task1_feature.shape[2]), name="input")
gru = Bidirectional(GRU(512, dropout=0.2, return_sequences=True), name="bi-gru")(inputs)
attention = Attention(32)(gru)
num_class = len(ecs)
output = Dense(num_class, activation='sigmoid', name="dense")(attention)
model = Model(inputs, output)

model.compile(loss='binary_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])
history = model.fit(task1_feature, y_train, batch_size=512, epochs= 400)
model.save(cfg.MODELDIR+'task1_onehot.h5')


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


# 8.2 task2

In [29]:
# embedding方式可更改
task2_train = task2_train.merge(feature_onehot,on='id',how='left')
# feature
task2_feature = task2_train.iloc[:, 3:]
task2_feature = np.array(task2_feature)
task2_feature =np.reshape(task2_feature,(task2_feature.shape[0],1,task2_feature.shape[1]))

In [32]:
# 标签处理
ecs= (set(task2_train.functionCounts))
def get_label(ecnum_str):
    label_init = np.zeros(len(label_dict),  dtype=int)
    label_init[label_dict.get(ecnum_str)] = 1
    return list(label_init)

label_dict = dict(zip(set(ecs), range(len(ecs))))
label_init = np.zeros(len(label_dict),  dtype=int)
train_label = task2_train.functionCounts.apply(lambda x: get_label(ecnum_str=x))

train_label=[item for item in train_label]
y_train = np.array(train_label)


In [33]:
# 模型设置 训练 保存
inputs = Input(shape=(1,task2_feature.shape[2]), name="input")
gru = Bidirectional(GRU(512, dropout=0.2, return_sequences=True), name="bi-gru")(inputs)
attention = Attention(32)(gru)
num_class = len(ecs)
output = Dense(num_class, activation='sigmoid', name="dense")(attention)
model = Model(inputs, output)

model.compile(loss='binary_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])
history = model.fit(task2_feature, y_train, batch_size=512, epochs= 1)
model.save(cfg.MODELDIR+'task2_onehot.h5')

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


# 8.3 task3

In [40]:
# embedding方式可更改
task3_train = task3_train.merge(feature_onehot,on='id',how='left')
# feature
task3_feature = task3_train.iloc[:, 3:]
task3_feature = np.array(task3_feature)
task3_feature =np.reshape(task3_feature,(task3_feature.shape[0],1,task3_feature.shape[1]))

In [42]:
# 标签处理
ecs= (set(task3_train.ec_number))
eclist = []
for item in ecs:
    eclist = eclist + item.split(',')
def get_label(ecnum_str):
    label_init = np.zeros(len(label_dict),  dtype=int)
    ec_array=ecnum_str.split(',')
    for item in ec_array:
        if label_dict.get(item) != None:
            label_init[label_dict.get(item)] = 1
    return list(label_init)
  
label_dict = dict(zip(set(eclist), range(len(eclist))))
label_init = np.zeros(len(label_dict),  dtype=int)
train_label = task3_train.ec_number.apply(lambda x: get_label(ecnum_str=x))

train_label = [item for item in train_label]
y_train = np.array(train_label)

In [54]:
# 模型设置 训练 保存
inputs = Input(shape=(1,task3_feature.shape[2]), name="input")
gru = Bidirectional(GRU(512, dropout=0.2, return_sequences=True), name="bi-gru")(inputs)
attention = Attention(32)(gru)
num_class = len(label_dict)
output = Dense(num_class, activation='sigmoid', name="dense")(attention)
model = Model(inputs, output)

model.compile(loss='binary_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])
history = model.fit(task3_feature, y_train, batch_size=3948, epochs= 400)
model.save(cfg.MODELDIR+'task3_onehot.h5')

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
