# Make the splits for task 1 (protein extrapolation)

For this set we want to perform enzyme classification and make training and testing for different levels of challenge.

### Task A: Predicting easily misclassified enzymes
Here we aim to use the Price et al, dataset which includes 149 enzymes which were challenging to discern the activity of (Mutant phenotypes for thousands of bacterial genes of unknown function). 

### Task B: Predicting Promiscuous enzymes
Being able to classify enzymes that are able to catalyse multiple reactions remains a challenge, so we test these separately. 

### Task C: Predicting enzymes with low sequence identity 
Given many enzymes share high sequence similarity, we sought to investigate the efficacy of each method using sequnces with 30 and 50% siumilarity.

In [47]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [48]:
swissprot = pd.read_csv('../processed_data/protein2EC.csv')

In [49]:
swissprot

Unnamed: 0,Entry,Entry Name,Sequence,EC number,Length,EC All,clusterRes50,clusterRes30,EC3,EC2,EC1
0,A0A009IHW8,ABTIR_ACIB9,MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENA...,3.2.2.6,269,3.2.2.-; 3.2.2.6,A0A009IHW8,A1AY86,3.2.2,3.2,3
1,A0A024SC78,CUTI1_HYPJR,MRSLAILTTLLAGHAFAYPKPAPQSVNRRDWPSINEFLSELAKVMP...,3.1.1.74,248,3.1.1.74,A0A024SC78,A0A024SC78,3.1.1,3.1,3
2,A0A024SH76,GUX2_HYPJR,MIVGILTTLATLATLAASVPLEERQACSSVWGQCGGQNWSGPTCCA...,3.2.1.91,471,3.2.1.91,G4MM92,G4MM92,3.2.1,3.2,3
3,A0A059TC02,CCR1_PETHY,MRSVSGQVVCVTGAGGFIASWLVKILLEKGYTVRGTVRNPDDPKNG...,1.2.1.44,333,1.2.1.44,Q9S9N9,P14721,1.2.1,1.2,1
4,A0A068J840,UGT1_PANGI,MKSELIFLPAPAIGHLVGMVEMAKLFISRHENLSVTVLIAKFYMDT...,2.4.1.363,475,2.4.1.363,Q2V6K0,Q40287,2.4.1,2.4,2
...,...,...,...,...,...,...,...,...,...,...,...
149783,P36352,POLR_PHMV,VIVGTPPISPNWPAIKDLLHLKFKTEITSSPLFCGYYLSPAGCIRN...,2.7.7.48,178,2.7.7.48,P36352,P36352,2.7.7,2.7,2
149784,P39262,VG56_BPT4,MAHFNECAHLIEGVDKAQNEYWDILGDEKDPLQVMLDMQRFLQIRL...,3.6.1.12,171,3.6.1.12,Q94MV8,Q94MV8,3.6.1,3.6,3
149785,Q05115,AMDA_BORBO,MQQASTPTIGMIVPPAAGLVPADGARLYPDLPFIASGLGLGSVTPE...,4.1.1.76,240,4.1.1.76,Q05115,Q05115,4.1.1,4.1,4
149786,Q94MV8,VG56_BPLZ5,MAHFNECAHLIEGVDKANRAYAENIMHNIDPLQVMLDMQRHLQIRL...,3.6.1.12,172,3.6.1.12,Q94MV8,Q94MV8,3.6.1,3.6,3


## Task A: challenging enzymes

The price dataset was downloaded from: https://github.com/tttianhao/CLEAN/blob/main/app/data/price.fasta

In [50]:
price = pd.read_csv('raw_data/price.tsv', sep='\t')
#remove sequences in price that are in swissprot
price = price[~price['Sequence'].isin(swissprot['Sequence'])]
price.to_csv('../splits/task1/price_protein_test.csv', index=False)
price

Unnamed: 0,Entry,EC number,Sequence
0,WP_063460136,5.3.1.7,MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGT...
1,WP_063462980,4.2.1.43,VPTTFHEDGTLDLDSQKRCLDFMIDAGVDGVCILANFSEQFSLSDA...
2,WP_063462990,1.1.1.48,LIDCNIDMTQLFAPSSSSTDATGAPQGLAKFPSLQGRAVFVTGGGS...
3,WP_041412631,4.2.1.25,MCLGRRRCHMNNKKPKTLRSASWFGSDDKNGFMYRSWMKNQGIPEH...
4,WP_011717048,5.1.3.3,MQLSVTQKSLQHAAFADELQLVTLTNSHGLEVVLSNYGASIWSVKL...
...,...,...,...
144,WP_010207013,1.3.8.7,MADYKAPLRDMRFVLNEVFEVATTWAQLPALADTVDAETVEAILEE...
145,WP_010207016,1.3.8.7,MPDYKAPLRDIRFVRDELLGYEAHYQSLPACQDATPDMVDAILEEG...
146,WP_010207340,2.6.1.19,MSSNNPQTREWQALSSDHHLAPFSDFKQLKEKGPRIITKAHGVYLW...
147,WP_010207341,6.3.1.11,MSVPPRAVQLNEANAFLKDHPEVLYVDLLIADMNGVVRGKRIERTS...


## Task 2: Selecting low sequence identity proteins

We test two levels of sequnece identity: 30% and 50%.

In [51]:
from sciutil import SciUtil

u = SciUtil()

train_isolated30 = swissprot.drop_duplicates(subset='clusterRes30')
test_isolated30 = swissprot[~swissprot['Entry'].isin(train_isolated30['Entry'].values)]

# Make a validation set that is completely held out.
np.random.seed(42)
random.seed(42)

#sample a random one from each unique EC at level 3 for validation (i.e. not in training or the larger test set)
validation_30 = train_isolated30.groupby('EC3').sample(1)
u.dp(['Testing set: ', len(test_isolated30), 'Training: ', len(train_isolated30), 'Validation:', len(validation_30)])

[94m--------------------------------------------------------------------------------[0m
[94m             Testing set: 	142188	Training: 	6958	Validation:	230	              [0m
[94m--------------------------------------------------------------------------------[0m


At 50% sequence identity clustering (moderate sequence identity)

In [52]:
train_isolated50 = swissprot.drop_duplicates(subset='clusterRes50')
test_isolated50 = swissprot[~swissprot['Entry'].isin(train_isolated50['Entry'].values)]

# Make a validation set that is completely held out.
np.random.seed(42)
random.seed(42)

#sample a random one from each unique EC at level 3 for validation (i.e. not in training or the larger test set)
validation_50 = train_isolated50.groupby('EC3').sample(1)
u.dp(['Testing set 50%: ', len(test_isolated50), 'Training: ', len(train_isolated50), 'Validation:', len(validation_50)])

[94m--------------------------------------------------------------------------------[0m
[94m           Testing set 50%: 	124532	Training: 	23633	Validation:	234	           [0m
[94m--------------------------------------------------------------------------------[0m


## Task 3: Promiscuous enzymes

Here we want to look at the promiscuous enzymes and see how well they can be classfiied 


In [59]:
swissprot['Promiscuous'] = swissprot['Sequence'].duplicated(keep=False)
not_promiscuous = swissprot[~swissprot['Promiscuous']]
promiscuous = swissprot[swissprot['Promiscuous']]
promiscuous = promiscuous.groupby('Sequence').agg({'EC number': lambda x: list(x)}).reset_index()
promiscuous

Unnamed: 0,Sequence,EC number
0,AAAWMLNGCLQVMDSRTIPANRNADNVDPALQTATHLCFPTRPVRV...,"[1.1.1.100, 2.3.1.41, 2.3.1.86]"
1,AEVCYSHLGCFSDEKPWAGTSQRPIKSLPSDPKKINTRFLLYTNEN...,"[3.1.1.26, 3.1.1.3]"
2,AGTVGKVIKCKAAVAWEAGKPLCIEEIEVAPPKAHEVRIKIIATAV...,"[1.1.1.1, 1.1.1.284]"
3,AHNIVLYTGAKMPILGLGTWKSPPGKVTEAVKVAIDLGYRHIDCAH...,"[1.1.1.21, 1.1.1.300, 1.1.1.372, 1.1.1.54]"
4,AICACCKVLNSNEKASCFSNKTFKGLGNAGGLPWKCNSVDMKHFVS...,"[1.5.1.3, 2.1.1.45]"
...,...,...
7208,SSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPY...,"[2.6.1.1, 2.6.1.7]"
7209,TDATGKPIKCMAAIAWEAKKPLSIEEVEVAPPKSGEVRIKILHSGV...,"[1.1.1.1, 1.1.1.284]"
7210,VQNPGASAIQCRAAVLRKEGQPMKIEQVLIQAPGPNQVRVKMVSSG...,"[1.1.1.347, 1.1.1.354]"
7211,VVSAGLIAGDFVTVLQALPRSEHQVVAVAARDLRRAEEFARTHGIP...,"[1.1.1.179, 1.3.1.20]"


In [60]:
def get_difference_level(predicted_ECs):
    counters = []

    for true_EC in predicted_ECs:

        #convert true_EC to a list
        if type(predicted_ECs) == str:
            predicted_ECs = [predicted_ECs]
        true_split = true_EC.split('.')

        for predicted in predicted_ECs:
            #print(true_EC)
            
            predicted_split = predicted.split('.')
            counter = 0
            for predicted, true in zip(predicted_split, true_split):
                if predicted == true:
                    counter += 1
                else:
                    break
            counters.append(4 - counter)

    return np.max(counters)

In [62]:
promiscuous['Surprise Level'] = promiscuous['EC number'].apply(get_difference_level)
promiscuous['Number of ECs'] = promiscuous['EC number'].apply(lambda x: len(x))
# Check if there are duplicates in terms of EC and sequence
promiscuous['Duplicated EC'] = promiscuous['EC number'].duplicated(keep=False)
promiscuous['Duplicated Sequence'] = promiscuous['Sequence'].duplicated(keep=False)
promiscuous = promiscuous.sort_values(['Duplicated EC', 'Surprise Level', 'Number of ECs'], ascending=False)
promiscuous

Unnamed: 0,Sequence,EC number,Surprise Level,Number of ECs,Duplicated EC,Duplicated Sequence
350,MAGWSCLVTGAGGFVGQRIIKMLVQEKELQEVRALDKVFRPETKEE...,"[1.1.1.145, 1.1.1.210, 1.1.1.270, 5.3.3.1]",4,4,True,False
495,MALLRGVFIVAAKRTPFGAYGGLLKDFSATDLTEFAARAALSAGKV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
496,MALLRGVFIVAAKRTPFGAYGGLLKDFTATDLTEFAARAALSAGKV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
497,MALLRGVFIVAAKRTPFGAYGGLLKDFTPTDMAEFAARAALSAGRV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
498,MALLRGVFVVAAKRTPFGAYGGLLKDFTATDLSEFAAKAALSAGKV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
...,...,...,...,...,...,...
7053,MVTVEEVRKAQRAEGPATVLAIGTATPPNCVGQSTYPDYYFRITNS...,"[2.3.1.212, 2.3.1.74]",1,2,False,False
7067,MVVRPNVKELPGPKAKEVIERNFKYLAMTTQDPENLPIVIERGEGI...,"[2.6.1.13, 2.6.1.36]",1,2,False,False
7076,MVYPRLLINLKEIEENARKVVEMASRRGIEIVGVTKVTLGDPRFAE...,"[5.1.1.12, 5.1.1.5]",1,2,False,False
7194,MYSLHDFKFPEDWIEPPANDKCIYTCYKEVVDFKLFEENKKTLEYY...,"[4.2.3.15, 4.2.3.47]",1,2,False,False


## Potentially remove duplicated sequneces for promiscous enzymes that have similar reactions

In [64]:
#promiscuous.drop_duplicates(subset='EC number', inplace=True)
# promiscuous.reset_index(inplace=True)
# promiscuous

In [63]:
promiscuous = promiscuous[promiscuous['Surprise Level'] >= 2]
promiscuous = promiscuous[promiscuous['Duplicated EC']]
promiscuous

Unnamed: 0,Sequence,EC number,Surprise Level,Number of ECs,Duplicated EC,Duplicated Sequence
350,MAGWSCLVTGAGGFVGQRIIKMLVQEKELQEVRALDKVFRPETKEE...,"[1.1.1.145, 1.1.1.210, 1.1.1.270, 5.3.3.1]",4,4,True,False
495,MALLRGVFIVAAKRTPFGAYGGLLKDFSATDLTEFAARAALSAGKV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
496,MALLRGVFIVAAKRTPFGAYGGLLKDFTATDLTEFAARAALSAGKV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
497,MALLRGVFIVAAKRTPFGAYGGLLKDFTPTDMAEFAARAALSAGRV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
498,MALLRGVFVVAAKRTPFGAYGGLLKDFTATDLSEFAAKAALSAGKV...,"[2.3.1.16, 2.3.1.9, 3.1.2.1, 3.1.2.2]",4,4,True,False
...,...,...,...,...,...,...
7101,MWKTTDLCDEFENELQICRQPFRSFGKKEQFHGKIATVKVKDDNVL...,"[4.1.1.112, 4.1.3.17]",2,2,True,False
7102,MWKTTDLCDEFENELQICRQSFRSFGKKEQFHGKIATVKVKDDNVL...,"[4.1.1.112, 4.1.3.17]",2,2,True,False
7153,MYHVYLLSDATGETVERVARAALTQFRDVDIRLRRMGQIRNREDIL...,"[2.7.11.32, 2.7.4.27]",2,2,True,False
7154,MYHVYLLSDATGETVERVARAALTQFRDVDIRLRRMGQIRNREDIL...,"[2.7.11.32, 2.7.4.27]",2,2,True,False


BLAST these to check sequence identity to the training set

## Create a training dataset that doesn't include any of our proteins for validation

Check for the sequences before we can use them for training.

In [67]:
test_pooled_seqs = pd.concat([train_isolated30, train_isolated50, validation_30, 
                              validation_50, price, promiscuous])['Sequence'].unique()
len(test_pooled_seqs)

27046

In [68]:
#remove from the training set
train_swissprot = swissprot[~swissprot['Sequence'].isin(test_pooled_seqs)]
train_swissprot

Unnamed: 0,Entry,Entry Name,Sequence,EC number,Length,EC All,clusterRes50,clusterRes30,EC3,EC2,EC1,Promiscuous
6,A0A072VDF2,CCR1_MEDTR,MPAATAAAAAESSSVSGETICVTGAGGFIASWMVKLLLEKGYTVRG...,1.2.1.44,342,1.2.1.-; 1.2.1.44,Q9S9N9,P14721,1.2.1,1.20,1,False
20,A0A0C5QRZ2,C76H2_SALFT,MDPFPLVAAALFIAATWFITFKRRRNLPPGPFPYPIVGNMLQLGSQ...,1.14.14.60,492,1.14.14.-; 1.14.14.175; 1.14.14.60,A0A0Y0GRS3,Q6XQ14,1.14.14,1.14,1,False
32,A0A0E3T552,AADH1_MALDO,MAIQIPSRLLFIDGEWREPVLKKRIPIINPATEEIIGHIPAATAED...,1.2.1.19,503,1.2.1.-; 1.2.1.19,C6KEM4,P52476,1.2.1,1.20,1,False
64,A0A140JWS2,PTMG_PENSI,MLFLAPGYIFPHVATPVTVAIDFAQAVKEGAYSFLDLKASPVPNPE...,2.5.1.1,356,2.5.1.-; 2.5.1.1; 2.5.1.10; 2.5.1.29,P24322,P24322,2.5.1,2.50,2,True
65,A0A140JWS2,PTMG_PENSI,MLFLAPGYIFPHVATPVTVAIDFAQAVKEGAYSFLDLKASPVPNPE...,2.5.1.10,356,2.5.1.-; 2.5.1.1; 2.5.1.10; 2.5.1.29,P24322,P24322,2.5.1,2.50,2,True
...,...,...,...,...,...,...,...,...,...,...,...,...
149776,W5AWH5,HIS7C_WHEAT,MTTAPVVSPSLSRLHSAPASPFPKAPVGSGAGVAFPARPYGPSLRL...,4.2.1.19,269,4.2.1.19,Q43072,Q43072,4.2.1,4.20,4,False
149777,W5BGD1,HIS7B_WHEAT,MTTAPFLFPSLSRLHSARASSFPKPPVGSGAGVAFPARPYGSSLRL...,4.2.1.19,269,4.2.1.19,Q43072,Q43072,4.2.1,4.20,4,False
149779,W6Q1E9,IFGF2_PENRF,MTILVLGGRGKTASRLAALLDQAKTPFLVGSSSASPSDPYKSSQFN...,1.5.1.44,287,1.5.1.44,A2TBU1,A2TBU1,1.5.1,1.50,1,False
149780,W6QRI9,IFGF1_PENRF,MTILVLGGRGKTASRLAALLDAAKTPFLVGSSSTSQESPYNSSHFN...,1.5.1.44,287,1.5.1.44,A2TBU1,A2TBU1,1.5.1,1.50,1,False


In [69]:
#save indices of train_swissprot to txt
np.savetxt('../splits/task1/protein2EC_train_indices.txt', train_swissprot.index, fmt='%d')