### Run from within cloned CLEAN repo so that it's easier
1. Clone the CLEAN repo
2. Follow the installation instruction
3. Move this file to `CLEAN/app` (should be in same folder as demo.ipynb)
4. Move `protein_train50.csv` and `{}_test.csv` to `CLEAN/app/data`
5. Run this notebook from the CLEAN directory

In [1]:
from CLEAN.utils import *
import pandas as pd
ensure_dirs("data/esm_data")
ensure_dirs("data/pretrained")
test_set_list = ['30_protein_test', '30-50_protein_test', 'price_protein_test', 'promiscuous_protein_test']

In [8]:
#convert to the correct format for clean
for test_set in test_set_list:
    df = pd.read_csv(f'./data/{test_set}.csv')
    df = df[['Entry', 'EC number', 'Sequence']]
    df.to_csv('./data/{}.csv'.format(test_set), sep='\t', index=False)

KeyError: "None of [Index(['Entry', 'EC number', 'Sequence'], dtype='object')] are in the [columns]"

### training
might be better to do this in a script if it takes a long time


In [9]:
from CLEAN.utils import mutate_single_seq_ECs, retrive_esm1b_embedding, compute_esm_distance
train_set = "protein_train50"

In [10]:
df = pd.read_csv('./data/{}.csv'.format(train_set))
df = df.groupby(['Entry', 'Sequence']).agg({'EC number': lambda x: list(x)}).reset_index()
df['EC number'] = df['EC number'].apply(lambda x: ';'.join(x))
df = df[['Entry', 'EC number', 'Sequence']]
df.to_csv('./data/{}.csv'.format(train_set), sep='\t', index=False)
csv_to_fasta("data/{}.csv".format(train_set), "data/{}.fasta".format(train_set))

Steps below are a bit slower

In [11]:
retrive_esm1b_embedding(train_set)

Transferred model to GPU
Read data/protein_train50.fasta with 28182 sequences
Processing 1 of 2994 batches (39 sequences)
Processing 2 of 2994 batches (37 sequences)
Processing 3 of 2994 batches (36 sequences)
Processing 4 of 2994 batches (35 sequences)
Processing 5 of 2994 batches (34 sequences)
Processing 6 of 2994 batches (34 sequences)
Processing 7 of 2994 batches (33 sequences)
Processing 8 of 2994 batches (33 sequences)
Processing 9 of 2994 batches (32 sequences)
Processing 10 of 2994 batches (31 sequences)
Processing 11 of 2994 batches (31 sequences)
Processing 12 of 2994 batches (30 sequences)
Processing 13 of 2994 batches (30 sequences)
Processing 14 of 2994 batches (29 sequences)
Processing 15 of 2994 batches (29 sequences)
Processing 16 of 2994 batches (28 sequences)
Processing 17 of 2994 batches (28 sequences)
Processing 18 of 2994 batches (28 sequences)
Processing 19 of 2994 batches (27 sequences)
Processing 20 of 2994 batches (27 sequences)
Processing 21 of 2994 batches (

In [12]:
#this is to generate masked sequences for examples iwth only one positive
masked_fasta_file = mutate_single_seq_ECs(train_set) #get rid of duplicates

#retrieving embeddings and distance matrix needs to be done once per training dataset
retrive_esm1b_embedding(masked_fasta_file)

Number of EC numbers with only one sequences: 2792
Number of single-seq EC number sequences need to mutate:  1517
Number of single-seq EC numbers already mutated:  1275
Transferred model to GPU
Read data/protein_train50_single_seq_ECs.fasta with 15170 sequences
Processing 1 of 2821 batches (23 sequences)
Processing 2 of 2821 batches (21 sequences)
Processing 3 of 2821 batches (20 sequences)
Processing 4 of 2821 batches (19 sequences)
Processing 5 of 2821 batches (18 sequences)
Processing 6 of 2821 batches (18 sequences)
Processing 7 of 2821 batches (17 sequences)
Processing 8 of 2821 batches (17 sequences)
Processing 9 of 2821 batches (17 sequences)
Processing 10 of 2821 batches (16 sequences)
Processing 11 of 2821 batches (16 sequences)
Processing 12 of 2821 batches (16 sequences)
Processing 13 of 2821 batches (16 sequences)
Processing 14 of 2821 batches (16 sequences)
Processing 15 of 2821 batches (16 sequences)
Processing 16 of 2821 batches (15 sequences)
Processing 17 of 2821 batch

In [13]:
compute_esm_distance(train_set)

100%|██████████| 4936/4936 [00:00<00:00, 39891.49it/s]


Calculating distance map, number of unique EC is 4936


4936it [00:03, 1383.01it/s]


In [106]:
# #if this step is too long
# #make a temporary empty file
# model_name = '{}_triplet_2.pth'.format(train_set)
# with open('./data/model/{}'.format(model_name), 'w') as fp:
#     pass

### train with triplet loss for now

In [109]:
!python ./train-triplet.py --training_data protein_train50 --model_name protein_train50_triplet --epoch 7000

==> device used: cuda:0 | dtype used:  torch.float32 
==> args: Namespace(learning_rate=0.0005, epoch=2, model_name='protein2EC_train_triplet', training_data='protein2EC_train', hidden_dim=512, out_dim=128, adaptive_rate=100, verbose=False)
The number of unique EC numbers:  368
---------------------------------------------------------------------------
| end of epoch   1 | time:  0.41s | training loss 0.8793
---------------------------------------------------------------------------
Best from epoch :   2; loss: 0.8413
---------------------------------------------------------------------------
| end of epoch   2 | time:  0.12s | training loss 0.8413
---------------------------------------------------------------------------


In [98]:
#this might work better, but it's slower and you need to manually change the dimension size during inference
#we did not explore this further
# !python ./train-supconH.py --training_data protein2EC_train --model_name protein2EC_train_supconH --epoch 2 --n_pos 9 --n_neg 30 -T 0.1

==> device used: cuda:0 | dtype used:  torch.float32 
==> args: Namespace(learning_rate=0.0005, epoch=2, model_name='protein2EC_train_supconH', training_data='protein2EC_train', temp=0.1, n_pos=9, n_neg=30, hidden_dim=512, out_dim=256, adaptive_rate=60, verbose=False)
The number of unique EC numbers:  368
---------------------------------------------------------------------------
| end of epoch   1 | time:  2.32s | training loss 3.5778
---------------------------------------------------------------------------
Best from epoch :   2; loss: 3.3484
---------------------------------------------------------------------------
| end of epoch   2 | time:  2.56s | training loss 3.3484
---------------------------------------------------------------------------


### inference on the test set

In [2]:
for test_set in test_set_list:
    csv_to_fasta("data/{}.csv".format(test_set), "data/{}.fasta".format(test_set))
    retrive_esm1b_embedding(test_set)

Transferred model to GPU
Read data/30_protein_test.fasta with 432 sequences
Processing 1 of 46 batches (27 sequences)
Processing 2 of 46 batches (22 sequences)
Processing 3 of 46 batches (20 sequences)
Processing 4 of 46 batches (18 sequences)
Processing 5 of 46 batches (17 sequences)
Processing 6 of 46 batches (16 sequences)
Processing 7 of 46 batches (15 sequences)
Processing 8 of 46 batches (14 sequences)
Processing 9 of 46 batches (13 sequences)
Processing 10 of 46 batches (12 sequences)
Processing 11 of 46 batches (12 sequences)
Processing 12 of 46 batches (11 sequences)
Processing 13 of 46 batches (11 sequences)
Processing 14 of 46 batches (11 sequences)
Processing 15 of 46 batches (10 sequences)
Processing 16 of 46 batches (10 sequences)
Processing 17 of 46 batches (10 sequences)
Processing 18 of 46 batches (10 sequences)
Processing 19 of 46 batches (9 sequences)
Processing 20 of 46 batches (9 sequences)
Processing 21 of 46 batches (9 sequences)
Processing 22 of 46 batches (8 se

In [13]:
#move the trained model to the pretrained folder and rename to the [train_set].pth

train_set = "protein_train50"
#make a directory
ensure_dirs("results/" + train_set)

### Modification to the CLEAN code during inference time

Move the trained model `.pth` file from `data/model` to `data/pretrained`.
Add after line 51 in `CLEAN/app/src/CLEAN/infer.py` to output retrieval results that are compatible with CARE benchmarks. Rename from `protein_train50_triplet.pth` to `protein_train50.pth`

```
### manually added this part to save the results in a different format ###
new_df = pd.read_csv('./data/{}.csv'.format(test_data), delimiter='\t')
num_cols = len(new_df.columns)
#new_df['Entry'] = eval_df.columns

#need to fix this
for j in range(len(eval_df)):
    new_df[j] = np.nan

for i, col in enumerate(eval_df.columns):
    sorted_ECs = eval_df[col].sort_values(ascending=True).index.values
    new_df.iloc[i, num_cols:] = sorted_ECs

new_df.to_csv('./results/{}/{}_results_df.csv'.format(train_data, test_data), index=False)
### end of manual addition ###
```


In [15]:
from CLEAN.infer import infer_pvalue

for test_set in test_set_list:
    infer_pvalue(train_set, test_set, p_value=1e-5, nk_random=20, report_metrics=True, pretrained=True)

The embedding sizes for train and test: torch.Size([28316, 128]) torch.Size([432, 128])


100%|██████████| 4936/4936 [00:00<00:00, 38658.55it/s]


Calculating eval distance map, between 432 test ids and 4936 train EC cluster centers


432it [00:00, 1065.74it/s]
100%|██████████| 4936/4936 [00:00<00:00, 42414.53it/s]
20000it [00:14, 1409.58it/s]
100%|██████████| 432/432 [00:05<00:00, 84.17it/s]


############ EC calling results using random chosen 20k samples ############
---------------------------------------------------------------------------
>>> total samples: 432 | total ec: 333 
>>> precision: 0.575 | recall: 0.576| F1: 0.567 | AUC: 0.788 | accuracy: 0.562 
---------------------------------------------------------------------------
The embedding sizes for train and test: torch.Size([28316, 128]) torch.Size([560, 128])


100%|██████████| 4936/4936 [00:00<00:00, 43541.81it/s]


Calculating eval distance map, between 560 test ids and 4936 train EC cluster centers


560it [00:00, 1381.65it/s]
100%|██████████| 4936/4936 [00:00<00:00, 24706.21it/s]
20000it [00:14, 1359.03it/s]
100%|██████████| 560/560 [00:06<00:00, 87.69it/s] 


############ EC calling results using random chosen 20k samples ############
---------------------------------------------------------------------------
>>> total samples: 560 | total ec: 389 
>>> precision: 0.839 | recall: 0.825| F1: 0.825 | AUC: 0.912 | accuracy: 0.818 
---------------------------------------------------------------------------
The embedding sizes for train and test: torch.Size([28316, 128]) torch.Size([148, 128])


100%|██████████| 4936/4936 [00:00<00:00, 37189.21it/s]


Calculating eval distance map, between 148 test ids and 4936 train EC cluster centers


148it [00:00, 1064.36it/s]
100%|██████████| 4936/4936 [00:00<00:00, 44855.08it/s]
20000it [00:14, 1387.84it/s]
100%|██████████| 148/148 [00:01<00:00, 75.82it/s]


############ EC calling results using random chosen 20k samples ############
---------------------------------------------------------------------------
>>> total samples: 148 | total ec: 56 
>>> precision: 0.477 | recall: 0.338| F1: 0.37 | AUC: 0.669 | accuracy: 0.338 
---------------------------------------------------------------------------
The embedding sizes for train and test: torch.Size([28316, 128]) torch.Size([209, 128])


100%|██████████| 4936/4936 [00:00<00:00, 43420.44it/s]


Calculating eval distance map, between 209 test ids and 4936 train EC cluster centers


209it [00:00, 1576.31it/s]
100%|██████████| 4936/4936 [00:00<00:00, 44588.25it/s]
20000it [00:14, 1386.92it/s]
100%|██████████| 209/209 [00:03<00:00, 54.93it/s]


############ EC calling results using random chosen 20k samples ############
---------------------------------------------------------------------------
>>> total samples: 209 | total ec: 384 
>>> precision: 0.559 | recall: 0.449| F1: 0.48 | AUC: 0.725 | accuracy: 0.0766 
---------------------------------------------------------------------------


Now move the output csvs from `results/protein_train50` to your CARE directory under `task1_baselines/results_summary/CLEAN` and continue with analysis.

In [17]:
# from CLEAN.infer import infer_maxsep
# infer_maxsep(train_set, test_set, report_metrics=True, pretrained=True)