### Run from within cloned CLEAN repo so that it's easier
1. Clone the CLEAN repo
2. Follow the installation instruction
3. Move this file to `CLEAN/app` (should be in same folder as demo.ipynb)
4. Move `EC2protein_train.csv` and `price_protein_test.csv` to `CLEAN/app/data`
5. Run this notebook from the CLEAN directory

In [1]:
from CLEAN.utils import *
import pandas as pd
ensure_dirs("data/esm_data")
ensure_dirs("data/pretrained")

In [None]:
#convert to the correct format for clean
test_set_list = ['30_protein_test', '30-50_protein_test', 'price_protein_test', 'promiscuous_protein_test']
for test_set in test_set_list:
    df = pd.read_csv('./data/{}.csv')
    df = df[['Entry', 'EC number', 'Sequence']]
    df.to_csv('./data/{}.csv'.format(test_set), sep='\t', index=False)

### training
might be better to do this in a script if it takes a long time


In [3]:
from CLEAN.utils import mutate_single_seq_ECs, retrive_esm1b_embedding, compute_esm_distance
train_set = "protein_train"

In [4]:
df = pd.read_csv('./data/{}.csv'.format(train_set))
df = df.groupby(['Entry', 'Sequence']).agg({'EC number': lambda x: list(x)}).reset_index()
df['EC number'] = df['EC number'].apply(lambda x: ';'.join(x))
df = df[['Entry', 'EC number', 'Sequence']]
df.to_csv('./data/{}.csv'.format(train_set), sep='\t', index=False)
csv_to_fasta("data/{}.csv".format(train_set), "data/{}.fasta".format(train_set))

KeyError: 'Entry'

Steps below are a bit slower

In [74]:
retrive_esm1b_embedding(train_set)

Transferred model to GPU
Read data/protein2EC_train.fasta with 1000 sequences
Processing 1 of 73 batches (37 sequences)
Processing 2 of 73 batches (33 sequences)
Processing 3 of 73 batches (32 sequences)
Processing 4 of 73 batches (31 sequences)
Processing 5 of 73 batches (31 sequences)
Processing 6 of 73 batches (29 sequences)
Processing 7 of 73 batches (26 sequences)
Processing 8 of 73 batches (25 sequences)
Processing 9 of 73 batches (23 sequences)
Processing 10 of 73 batches (21 sequences)
Processing 11 of 73 batches (21 sequences)
Processing 12 of 73 batches (20 sequences)
Processing 13 of 73 batches (19 sequences)
Processing 14 of 73 batches (18 sequences)
Processing 15 of 73 batches (18 sequences)
Processing 16 of 73 batches (17 sequences)
Processing 17 of 73 batches (17 sequences)
Processing 18 of 73 batches (17 sequences)
Processing 19 of 73 batches (16 sequences)
Processing 20 of 73 batches (16 sequences)
Processing 21 of 73 batches (15 sequences)
Processing 22 of 73 batches 

In [75]:
#this is to generate masked sequences for examples iwth only one positive
masked_fasta_file = mutate_single_seq_ECs(train_set) #get rid of duplicates

#retrieving embeddings and distance matrix needs to be done once per training dataset
retrive_esm1b_embedding(masked_fasta_file)

Number of EC numbers with only one sequences: 219
Number of single-seq EC number sequences need to mutate:  204
Number of single-seq EC numbers already mutated:  15


Transferred model to GPU
Read data/protein2EC_train_single_seq_ECs.fasta with 2040 sequences
Processing 1 of 237 batches (25 sequences)
Processing 2 of 237 batches (24 sequences)
Processing 3 of 237 batches (23 sequences)
Processing 4 of 237 batches (22 sequences)
Processing 5 of 237 batches (21 sequences)
Processing 6 of 237 batches (20 sequences)
Processing 7 of 237 batches (19 sequences)
Processing 8 of 237 batches (18 sequences)
Processing 9 of 237 batches (18 sequences)
Processing 10 of 237 batches (17 sequences)
Processing 11 of 237 batches (17 sequences)
Processing 12 of 237 batches (17 sequences)
Processing 13 of 237 batches (16 sequences)
Processing 14 of 237 batches (16 sequences)
Processing 15 of 237 batches (16 sequences)
Processing 16 of 237 batches (16 sequences)
Processing 17 of 237 batches (15 sequences)
Processing 18 of 237 batches (15 sequences)
Processing 19 of 237 batches (15 sequences)
Processing 20 of 237 batches (15 sequences)
Processing 21 of 237 batches (15 seq

In [76]:
compute_esm_distance(train_set)

100%|██████████| 368/368 [00:00<00:00, 34855.45it/s]


Calculating distance map, number of unique EC is 368


0it [00:00, ?it/s]

368it [00:00, 8477.58it/s]


In [106]:
# #if this step is too long
# #make a temporary empty file
# model_name = '{}_triplet_2.pth'.format(train_set)
# with open('./data/model/{}'.format(model_name), 'w') as fp:
#     pass

### train with triplet loss for now

In [109]:
!python ./train-triplet.py --training_data protein_train --model_name protein_train_triplet --epoch 7000

==> device used: cuda:0 | dtype used:  torch.float32 
==> args: Namespace(learning_rate=0.0005, epoch=2, model_name='protein2EC_train_triplet', training_data='protein2EC_train', hidden_dim=512, out_dim=128, adaptive_rate=100, verbose=False)
The number of unique EC numbers:  368
---------------------------------------------------------------------------
| end of epoch   1 | time:  0.41s | training loss 0.8793
---------------------------------------------------------------------------
Best from epoch :   2; loss: 0.8413
---------------------------------------------------------------------------
| end of epoch   2 | time:  0.12s | training loss 0.8413
---------------------------------------------------------------------------


In [98]:
#this might work better, but it's slower and you need to manually change the dimension size during inference
# !python ./train-supconH.py --training_data protein2EC_train --model_name protein2EC_train_supconH --epoch 5250 --n_pos 9 --n_neg 30 -T 0.1

==> device used: cuda:0 | dtype used:  torch.float32 
==> args: Namespace(learning_rate=0.0005, epoch=2, model_name='protein2EC_train_supconH', training_data='protein2EC_train', temp=0.1, n_pos=9, n_neg=30, hidden_dim=512, out_dim=256, adaptive_rate=60, verbose=False)
The number of unique EC numbers:  368
---------------------------------------------------------------------------
| end of epoch   1 | time:  2.32s | training loss 3.5778
---------------------------------------------------------------------------
Best from epoch :   2; loss: 3.3484
---------------------------------------------------------------------------
| end of epoch   2 | time:  2.56s | training loss 3.3484
---------------------------------------------------------------------------


### inference on the test set

In [89]:
for test_set in test_set_list:
    test_set = 'price_protein_test'
    csv_to_fasta("data/{}.csv".format(test_set), "data/{}.fasta".format(test_set))
    retrive_esm1b_embedding(test_set)

Transferred model to GPU
Read data/price_protein_test.fasta with 148 sequences
Processing 1 of 16 batches (15 sequences)
Processing 2 of 16 batches (13 sequences)
Processing 3 of 16 batches (13 sequences)
Processing 4 of 16 batches (11 sequences)
Processing 5 of 16 batches (11 sequences)
Processing 6 of 16 batches (10 sequences)
Processing 7 of 16 batches (10 sequences)
Processing 8 of 16 batches (9 sequences)
Processing 9 of 16 batches (9 sequences)
Processing 10 of 16 batches (8 sequences)
Processing 11 of 16 batches (8 sequences)
Processing 12 of 16 batches (8 sequences)
Processing 13 of 16 batches (7 sequences)
Processing 14 of 16 batches (6 sequences)
Processing 15 of 16 batches (6 sequences)
Processing 16 of 16 batches (4 sequences)


In [None]:
#add this chunk of code to change the output of CLEAN so that is compatible with our evaluation framework
#CLEAN/app/src/CLEAN/infer.py after line 51

### manually added this part to save the results in a different format ###
# new_df = pd.read_csv('./data/{}.csv'.format(test_data), delimiter='\t')
# num_cols = len(new_df.columns)
# #new_df['Entry'] = eval_df.columns
# for j in range(len(eval_df)):
#     new_df[j] = np.nan

# for i, col in enumerate(eval_df.columns):
#     sorted_ECs = eval_df[col].sort_values(ascending=True).index.values
#     new_df.iloc[i, num_cols:] = sorted_ECs

# new_df.to_csv('./results/{}/{}_results_df.csv'.format(train_data, test_data), index=False)
### end of manual addition ###

### move the trained model to the pretrained folder and rename to the [train_set].pth

In [110]:
from CLEAN.infer import infer_pvalue

for test_set in test_set_list:
    infer_pvalue(train_set, test_set, p_value=1e-5, nk_random=20, report_metrics=True, pretrained=True)

The embedding sizes for train and test: torch.Size([1060, 128]) torch.Size([148, 128])


100%|██████████| 368/368 [00:00<00:00, 36703.78it/s]


Calculating eval distance map, between 148 test ids and 368 train EC cluster centers


148it [00:00, 7698.64it/s]
100%|██████████| 368/368 [00:00<00:00, 34840.50it/s]
20000it [00:01, 10726.82it/s]
100%|██████████| 148/148 [00:00<00:00, 929.15it/s]


############ EC calling results using random chosen 20k samples ############
---------------------------------------------------------------------------
>>> total samples: 148 | total ec: 56 
>>> precision: 0.0326 | recall: 0.053| F1: 0.0404 | AUC: 0.526 | accuracy: 0.0541 
---------------------------------------------------------------------------


In [111]:
#don't use this for now
# from CLEAN.infer import infer_maxsep
# infer_maxsep(train_set, test_set, report_metrics=True, pretrained=True)

The embedding sizes for train and test: torch.Size([1060, 128]) torch.Size([148, 128])


100%|██████████| 368/368 [00:00<00:00, 49129.58it/s]


Calculating eval distance map, between 148 test ids and 368 train EC cluster centers


148it [00:00, 11704.89it/s]

############ EC calling results using maximum separation ############
---------------------------------------------------------------------------
>>> total samples: 148 | total ec: 56 
>>> precision: 0.0283 | recall: 0.053| F1: 0.0369 | AUC: 0.526 | accuracy: 0.0541 
---------------------------------------------------------------------------
############ EC calling results using maximum separation ############
---------------------------------------------------------------------------
>>> total samples: 148 | total ec: 56 
>>> precision: 0.0283 | recall: 0.053| F1: 0.0369 | AUC: 0.526 
---------------------------------------------------------------------------



