# Construct Datasets for benchmarking tasks

> author: Shizhenkun   
> email: zhenkun.shi@tib.cas.cn   
> date: 2021-10-08  

## Dataset1. Enzyme None-enzyme Dataset
The enzyme dataset is consists of two parts: a training set and a testing set. The training set is from snapshot Feb-2018 and excludes those deleted items in snapshot Jun-2020. The training set is consists of 467,973 records, of which 222,290 are enzymes, and 245,683 are not enzymes. The testing set is from snapshot Jun-2020 and excludes these items that appeared in snapshot Feb-2018. The testing set is consists of 8033 records, of which 3579 are enzymes, and 4454 are not-enzymes. Unlike previous works,  we did not filter any sequences in terms of length and homology to make the data more inclusive. We make a label for each sequence, 1 for enzyme and 0 for not-enzyme. 

## Dataset2. Enzyme Quantity Dataset
The enzyme quantity dataset only contains enzyme data, contain 13,108 records. The function quantity ranges from 2 to 10.

## Dataset 3: EC number Dataset

Similar to the enzyme quantity dataset, the EC number dataset is consists of 225,221 enzyme records, 221,642 are training-set, and the rest 3579 are testing-set, covering 4852 EC numbers. Up to Feb 2020, there still exist 267 EC numbers that the model can not handle in the benchmarking. Thus, we exclude the sequences with these 267 EC numbers in the evaluation process. But, this problem can be resolved in the production scenario because we use the entire data from Swiss-Prot. Now the EC coverage is 5307 and can be automatically extended, for the training is real-time based on the publication of Swiss-Prot every 8 weeks. 

## 1. Import packages

In [1]:
import numpy as np
import pandas as pd

import os
import re
import time
import datetime
import sys
from tqdm import tqdm
import config as cfg
from functools import reduce
import matplotlib.pyplot as plt

sys.path.append("./tools/")
import exact_ec_from_uniprot as exactec
import minitools as mtool
import embedding_esm as esmebd
# import embedding_unirep as unirep

from pandarallel import pandarallel # 导入pandaralle
pandarallel.initialize() # 初始化该这个b...并行库


%load_ext autoreload
%autoreload 2

INFO: Pandarallel will run on 80 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## 2. Define Functions

In [2]:
# install axel for download dataset
def install_axel():
    isExists = !which axel
    if 'axel' in str(isExists[0]):
        return True
    else:
        !sudo apt install axel -y

# add missing '-' for ec number
def refill_ec(ec):   
    if ec == '-':
        return ec
    levelArray = ec.split('.')
    if  levelArray[3]=='':
        levelArray[3] ='-'
    ec = '.'.join(levelArray)
    return ec

def specific_ecs(ecstr):
    if '-' not in ecstr or len(ecstr)<4:
        return ecstr
    ecs = ecstr.split(',')
    if len(ecs)==1:
        return ecstr
    
    reslist=[]
    
    for ec in ecs:
        recs = ecs.copy()
        recs.remove(ec)
        ecarray = np.array([x.split('.') for x in recs])
        
        if '-' not in ec:
            reslist +=[ec]
            continue
        linearray= ec.split('.')
        if linearray[1] == '-':
            #l1 in l1s and l2 not empty
            if (linearray[0] in  ecarray[:,0]) and (len(set(ecarray[:,0]) - set({'-'}))>0):
                continue
        if linearray[2] == '-':
            # l1, l2 in l1s l2s, l3 not empty
            if (linearray[0] in  ecarray[:,0]) and (linearray[1] in  ecarray[:,1]) and (len(set(ecarray[:,2]) - set({'-'}))>0):
                continue
        if linearray[3] == '-':
            # l1, l2, l3 in l1s l2s l3s, l4 not empty
            if (linearray[0] in  ecarray[:,0]) and (linearray[1] in  ecarray[:,1]) and (linearray[2] in  ecarray[:,2]) and (len(set(ecarray[:,3]) - set({'-'}))>0):
                continue
                
        reslist +=[ec]
    return ','.join(reslist)

#format ec
def format_ec(ecstr):
    ecArray= ecstr.split(',')
    ecArray=[x.strip() for x in ecArray] #strip blank
    ecArray=[refill_ec(x) for x in ecArray] #format ec to full
    ecArray = list(set(ecArray)) # remove duplicates
    
    return ','.join(ecArray)

## 3. Download rawdata from unisprot

In [4]:
# download location ./tmp
install_axel()
!axel -n 10 https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz -o ./data/uniprot_sprot_latest.dat.gz -q -c

## 4. Extract records from rawdata

In [5]:
# 2018 data
! tar -zxvf ./tmp/uniprot_sprot-only2018_02.tar.gz -C ./tmp/
! mv ./tmp/uniprot_sprot.dat.gz ./tmp/sprot2018.data.gz
! rm -rf ./tmp/*.fasta.gz ./tmp/*.xml.gz

# 2020 data
! tar -zxvf ./tmp/uniprot_sprot-only2020_06.tar.gz -C ./tmp/
! mv ./tmp/uniprot_sprot.dat.gz ./tmp/sprot2020.data.gz
! rm -rf ./tmp/*.fasta.gz ./tmp/*.xml.gz

uniprot_sprot.dat.gz
uniprot_sprot.fasta.gz
uniprot_sprot_varsplic.fasta.gz
uniprot_sprot.xml.gz
uniprot_sprot.dat.gz
uniprot_sprot.fasta.gz
uniprot_sprot_varsplic.fasta.gz
uniprot_sprot.xml.gz


In [5]:
exactec.run_exact_task(infile=cfg.DATADIR+'uniprot_sprot_latest.dat.gz', outfile=cfg.DATADIR+'sprot_latest.tsv')

565928it [04:41, 2011.66it/s]


finished use time 279.099 s


## 5. Load records

In [6]:
#加载数据并转换时间格式
sprot_latest = pd.read_csv(cfg.DATADIR+'sprot_latest.tsv', sep='\t',header=0) #读入文件
sprot_latest = mtool.convert_DF_dateTime(inputdf = sprot_latest)

## 6. Preprocessing
### 6.1 Drop Duplicates

In [7]:
sprot_latest.drop_duplicates(subset=['seq'], keep='first', inplace=True)
sprot_latest.reset_index(drop=True, inplace=True)


### 6.2 format EC

In [8]:
#sprot_latest
sprot_latest['ec_number'] = sprot_latest.ec_number.parallel_apply(lambda x: format_ec(x))
sprot_latest['ec_number'] = sprot_latest.ec_number.parallel_apply(lambda x: specific_ecs(x))
sprot_latest['functionCounts'] = sprot_latest.ec_number.parallel_apply(lambda x: 0 if x=='-'  else len(x.split(',')))

In [9]:
sprot_latest

Unnamed: 0,id,name,isenzyme,isMultiFunctional,functionCounts,ec_number,ec_specific_level,date_integraged,date_sequence_update,date_annotation_update,seq,seqlength
0,P84233,H32_XENLA,False,False,0,-,0,1986-07-21,2007-01-23,2021-09-29,MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGT...,136
1,P0A7F3,PYRI_ECOLI,False,False,0,-,0,1986-07-21,2007-01-23,2021-09-29,MTHDNKLQVEAIKRGTVIDHIPAQIGFKLLSLFKLTETDQRITIGL...,153
2,P03212,GL_EBVB9,False,False,0,-,0,1986-07-21,1986-07-21,2021-09-29,MRAVGVFLAICLVTIFVLPTWGNWAYPCCHVTQLRAQHLLALENIS...,137
3,P01158,DSIP_RABIT,False,False,0,-,0,1986-07-21,1986-07-21,2019-12-11,WAGGDASGE,9
4,P0A6U4,MNMG_ECOL6,False,False,0,-,0,1986-07-21,2005-03-29,2021-09-29,MFYPDPFDVIIIGGGHAGTEAAMAAARMGQQTLLLTHNIDTLGQMS...,629
...,...,...,...,...,...,...,...,...,...,...,...,...
477912,A0A0U5GMR5,AUSJ_ASPCI,False,False,0,-,0,2021-09-29,2016-03-16,2021-09-29,MTTTRHRLLATASRFVTTLESLDVDAMLAVRSPTCLHHMCLPSFRN...,165
477913,S8FIE4,LAC1_FOMPI,True,False,1,1.10.3.2,4,2021-09-29,2013-10-16,2021-09-29,MAFTAISLFLAALGVINTAFAQSAVIGPVTDLDIINAEVNLDGFPR...,539
477914,S8FGV1,LAC2_FOMPI,True,False,1,1.10.3.2,4,2021-09-29,2013-10-16,2021-09-29,MLLSSAFVGSCLAILNFAAAVSAQGGLSRTTLNIVNKVISPDGYSR...,530
477915,Q57286,MMDB_VEIPA,True,False,1,7.2.4.3,4,2021-09-29,1996-11-01,2021-09-29,MEAFAVAIQSVINDSGFLAFTTGNAIMILVGLILLYLAFAREFEPL...,373


### 6.3 Get Tainset

In [10]:
train = sprot_latest.iloc[:,np.r_[0,2:7,10:12]]

### 6.4 Trim string

In [11]:
with pd.option_context('mode.chained_assignment', None):
    train.ec_number = train.ec_number.parallel_apply(lambda x : str(x).strip()) #ec trim
    train.seq = train.seq.parallel_apply(lambda x : str(x).strip()) #seq trim

### 6.5 Save train latest

In [12]:
train.to_feather(cfg.DATADIR + 'sprot_latest.feature')

## 6.51 Move Existing fetures to Bank

In [20]:
! mv $cfg.DATADIR'sprot_latest_rep0.feather' $cfg.DATADIR'featureBank/sprot_latest_rep0.feather'
! mv $cfg.DATADIR'sprot_latest_rep32.feather' $cfg.DATADIR'featureBank/sprot_latest_rep32.feather'
! mv $cfg.DATADIR'sprot_latest_rep33.feather' $cfg.DATADIR'featureBank/sprot_latest_rep33.feather'
! mv $cfg.DATADIR'sprot_latest_unirep.feather' $cfg.DATADIR'featureBank/sprot_latest_unirep.feather'

mv: cannot stat '/home/shizhenkun/codebase/DMLF/data/sprot_latest_rep0.feather': No such file or directory


## 6.52 Loading exsiting records with features

In [21]:
existing= pd.read_feather(cfg.DATADIR + 'featureBank/sprot_latest_rep0.feather')

In [22]:
existing

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f1271,f1272,f1273,f1274,f1275,f1276,f1277,f1278,f1279,f1280
0,P0A6V3,-0.083714,0.005133,0.030848,0.019758,-0.052391,0.025399,-0.003676,0.003147,-0.031114,...,-0.016008,-0.083440,-0.017645,0.062802,0.001200,0.109981,0.028567,0.001691,-0.028847,0.068695
1,P0AEE5,-0.173797,0.002517,0.027010,0.029303,-0.010111,0.067154,0.016163,0.037110,-0.033415,...,0.007333,-0.102797,-0.023060,0.076121,0.044377,0.104551,0.021937,-0.000274,0.051581,0.093139
2,P01508,-0.202491,-0.003329,-0.018311,-0.001395,-0.037488,-0.142075,0.008514,0.031612,-0.018615,...,0.051290,-0.127064,0.013276,0.049693,0.068135,0.113995,0.021024,0.005433,-0.097443,0.135079
3,P01936,-0.181227,0.002400,0.025886,0.033929,-0.007706,-0.010443,0.038471,0.053604,0.004288,...,0.051255,-0.125306,-0.003652,0.063140,0.039054,0.106840,0.017322,0.005568,0.016057,0.076521
4,P01949,-0.163552,0.001912,0.024957,0.034715,-0.003451,-0.003466,0.028878,0.101189,0.006893,...,0.051574,-0.116994,-0.003302,0.069067,0.044195,0.108148,0.009724,0.007866,0.023943,0.072685
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
477257,C0HLS6,-0.217145,0.035837,-0.023007,0.004089,-0.027746,-0.181881,0.028747,-0.001439,-0.035573,...,0.069679,-0.115121,0.003409,0.066552,0.014156,0.169604,0.022391,0.000741,-0.065113,0.126902
477258,A0A0D1BUH6,-0.077459,0.000840,0.011526,0.009182,-0.024317,0.026610,-0.019136,-0.032119,-0.009435,...,-0.005582,-0.076323,-0.004898,0.034576,0.024867,0.025274,0.009874,0.001545,-0.032846,0.019942
477259,D0LHE5,-0.174094,0.009709,0.023257,0.016269,-0.000720,0.104040,-0.035836,-0.070802,-0.005294,...,0.002129,-0.071617,-0.032393,0.064785,-0.002218,0.110696,0.005110,0.008968,0.056669,0.051867
477260,P0DUN4,-0.168995,0.034663,0.017448,0.022124,-0.004571,0.073529,0.010942,0.056266,-0.009900,...,0.043128,-0.082297,-0.020699,0.056861,0.015562,0.130053,0.015234,0.004304,0.047166,0.060338


### 6.6 ESM embedding 

In [2]:
train= pd.read_feather(cfg.DATADIR + 'sprot_latest.feature')

In [3]:
# !pip install fair-esm
tr_rep0, tr_rep32, tr_rep33 = esmebd.get_rep_multi_sequence(sequences=train, model='esm1b_t33_650M_UR50S',seqthres=1022)
tr_rep0.to_feather(cfg.DATADIR + 'sprot_latest_rep0.feather')
tr_rep32.to_feather(cfg.DATADIR + 'sprot_latest_rep32.feather')
tr_rep33.to_feather(cfg.DATADIR + 'sprot_latest_rep33.feather')

Transferred model to GPU


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 477917/477917 [6:50:07<00:00, 19.42it/s]


### 6.7 Unirep embedding

In [5]:
tr_unirep = unirep.getunirep(train, 200)
tr_unirep.to_feather(cfg.DATADIR + 'sprot_latest_unirep.feather')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2387/2387 [67:02:55<00:00, 101.12s/it]

length not match





## 7. Build benchmarking datasets
### 7.1 Task 1 isEnzyme

In [15]:
task1_train = train.iloc[:,np.r_[0,7,1]]
task1_test = test.iloc[:,np.r_[0,7,1]]

task1_train.to_feather(cfg.DATADIR + 'task1/train.feather')
task1_test.to_feather(cfg.DATADIR + 'task1/test.feather')

### 7.2 Task2 Function Counts

In [16]:
task2_train = train[train.functionCounts >0]
task2_train.reset_index(drop=True, inplace=True)
task2_train = task2_train.iloc[:,np.r_[0,7,3]]

task2_test = test[test.functionCounts >0]
task2_test.reset_index(drop=True, inplace=True)
task2_test = task2_test.iloc[:,np.r_[0,7,3]]

task2_train.to_feather(cfg.DATADIR + 'task2/train.feather')
task2_test.to_feather(cfg.DATADIR + 'task2/test.feather')


### 7.3 Task3 EC Number

In [13]:
task3_train = train[train.functionCounts >0]
task3_train.reset_index(drop=True, inplace=True)
task3_train = task3_train.iloc[:,np.r_[0,7,4]]

task3_test = test[test.functionCounts >0]
task3_test.reset_index(drop=True, inplace=True)
task3_test = task3_test.iloc[:,np.r_[0,7,4]]

task3_train.to_feather(cfg.DATADIR + 'task3/train.feather')
task3_test.to_feather(cfg.DATADIR + 'task3/test.feather')
