In [1]:
import pandas as pd 

This notebook provides a brief description of the data that i used in the next notebooks

# notebook descriptions

`00_brief_data_description`: contains descriptions for all notebooks, datasets, folders hosted in the machine (PC) and 
m each notebook


`01_create_seq_dset`: shows how the data was collected together with an exploratory data analysis of protein sequences present in the Betalactamase DataBase (BLDB)


`02_generate_embeddings`: shows how to generate embeddings from 8 protein language models


`03_dim_redo_split_class`: shows the procedure of reduction of dimensionality using PCA, tSNE and UMAP


`04_create_functional_dsets_and_tanimoto`: shows how to merge the data of sequences retrived from BLDB and MICs from BLDB2 to create the main dataset. It also includes the procedure to create a second dataset of enzyme kinetics not analyzed in this master thesis. Finally shows how to compute the tanimoto similarity between 50 betalactam antibiotics.


`05_EDA_functional_dsets_50anti`: shows a simple Exploratory Data Analysis with functional (mics and kinetics) and chemical datasets


`06_map_seqspace_split_class_pt1`


`07_classA_FullDset_n_singal`


`08_map_seqspace_pt2_split_class`


`09_select_best_regressor`


`10_train_with_best_model`



# Folder description 

In [2]:
! tree -L 2 /home/gama/bla_analysis/

[34;42m/home/gama/bla_analysis/[00m
├── [34;42mdata[00m
│   ├── [01;34mancestors[00m
│   ├── [34;42mbldb[00m
│   ├── [01;34mbldb2[00m
│   ├── [01;34mlra5[00m
│   ├── [01;34mrecognized_fams[00m
│   ├── [01;34mrisso_consA[00m
│   └── [01;34mvarG[00m
├── [34;42mnotebooks[00m
│   ├── 00_brief_data_description.ipynb
│   ├── 01_create_seq_dset.ipynb
│   ├── [01;32m02_generate_embeddings.ipynb[00m
│   ├── 03_dim_redo_split_class.ipynb
│   ├── 04_create_functional_dsets_and_tanimoto.ipynb
│   ├── 05_EDA_functional_dsets_50anti.ipynb
│   ├── 06_map_seqspace_split_class_pt1.ipynb
│   ├── 07_classA_FullDset_n_singal.ipynb
│   ├── 08_map_seqspace_pt2_split_class.ipynb
│   ├── 09_select_best_regressor.ipynb
│   ├── 10_train_with_best_model.ipynb
│   ├── [01;34mCSN_tutorial[00m
│   ├── [01;34mborrame[00m
│   ├── [01;32mcv_example.ipynb[00m
│   ├── extract_accessions.ipynb
│   ├── netgpi.tsv
│   ├── [01;32mnetgpi_dataset.fasta[00m
│   ├── neura

# Datasets description 

## Notebook 1

df_annots contains all of the sequences used in thes master thesis, with their respective:
1. betalactamase class, subclass and family
2. biochemical properties
3. Taxonomy annotates with diamond agains GTDB_RS207
3. predicted signal peptide by SignalPv6

In [4]:
df_annots = pd.read_csv("../results/tables/df_annot_all.csv", sep = "\t")
print(df_annots.columns)
df_annots

Index(['#name', 'seq', 'length', 'filename', 'protein_name', 'protein_family',
       'bla_class', 'bla_subclass', 'protein_family_header', 'seq_id',
       'molecular_weight', 'aromaticity', 'instability', 'gravy',
       'isoelectric_point', 'entropy', 'helix', 'turn', 'sheet',
       'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class', 'Order',
       'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,#name,seq,length,filename,protein_name,protein_family,bla_class,bla_subclass,protein_family_header,seq_id,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,gi|5596421|emb|CAB51471.1|ACI-1| class A exten...,MKKFCFLFLIICGLMVFCLQDCQARQKLNLADLENKYNAVIGVYAV...,284,A-ACI-1-prot.fasta,ACI-1,ACI,Class A,Class A,ACI,seq_0,...,,LIPO,0.000157,0.499448,0.500170,0.000095,0.000088,0.000081,CS pos: 21-22. Pr: 0.4839,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
1,AHN92697.1|ACI-2| beta-lactamase [uncultured b...,MKKFCFLFLIICGLMFFCLQDCQARQKLNLADLENKYNAVIGVYAV...,284,A-ACI-2-prot.fasta,ACI-2,ACI,Class A,Class A,ACI,seq_1,...,,LIPO,0.000125,0.499552,0.500104,0.000080,0.000071,0.000069,CS pos: 21-22. Pr: 0.4476,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
2,EHO64028.1|ACI-3| hypothetical protein HMPREF9...,MKKFCFLFLIICGLMVFSLQDCQARQKLNLADLENKYNAVIGVYAV...,284,A-ACI-3-prot.fasta,ACI-3,ACI,Class A,Class A,ACI,seq_2,...,,LIPO,0.000129,0.499544,0.500129,0.000084,0.000075,0.000072,CS pos: 21-22. Pr: 0.4842,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
3,AHN92696.1|ACI-4| beta-lactamase [uncultured b...,MKKFCFLFLIICGLMVFCLQGCQARQKLNLADLENKYNAVIGVYAV...,284,A-ACI-4-prot.fasta,ACI-4,ACI,Class A,Class A,ACI,seq_3,...,,LIPO,0.000000,0.000000,1.000020,0.000000,0.000000,0.000000,CS pos: 21-22. Pr: 0.9938,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
4,AMP55835.1|ACI-5| classA [uncultured bacterium],MKKFCFLFLIICGLMVFCLQDCQARQKLNLADLENKYNAVIGVYAV...,284,A-ACI-5-prot.fasta,ACI-5,ACI,Class A,Class A,ACI,seq_4,...,,LIPO,0.000208,0.499119,0.500343,0.000120,0.000129,0.000109,CS pos: 21-22. Pr: 0.4842,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26018,WP_000778170.1 VarG family subclass B1-like me...,MKLSTLALAPIAAALFAFNVSANGHDHDNQRAIFFHGEKAPIAQTE...,373,VarG seq,VarG seq,VarG,VarG,VarG,VarG,seq_26018,...,,SP,0.000361,0.998432,0.000266,0.000355,0.000282,0.000309,CS pos: 22-23. Pr: 0.9734,NGHDHDNQRAIFFHGEKAPIAQTEVEPSATQTLKVGQKINNLYERQ...
26019,WP_000701356.1 VarG family subclass B1-like me...,MKIPTLALAPIAAALFAFNANAHEHKRSIYFPDETSSKVVQTEVEP...,370,VarG seq,VarG seq,VarG,VarG,VarG,VarG,seq_26019,...,s__Vibrio mimicus,SP,0.000308,0.998656,0.000250,0.000315,0.000242,0.000260,CS pos: 22-23. Pr: 0.9758,HEHKRSIYFPDETSSKVVQTEVEPSATKSLKLGQKINNLYDRQFDD...
26020,WP_000701355.1 VarG family subclass B1-like me...,MKIPTLALAPIAAALFAFNANAHEHKRSIYFPDETSSEVVQTEVEP...,370,VarG seq,VarG seq,VarG,VarG,VarG,VarG,seq_26020,...,s__Vibrio mimicus,SP,0.000304,0.998664,0.000250,0.000308,0.000237,0.000254,CS pos: 22-23. Pr: 0.9755,HEHKRSIYFPDETSSEVVQTEVEPSATKSLKLGQKINNLYDRQFDD...
26021,WP_000701301.1 VarG family subclass B1-like me...,MKIPNLALAPIAAALFAFNANAHEHKRSIYFPDETSSKVVQTEVEP...,370,VarG seq,VarG seq,VarG,VarG,VarG,VarG,seq_26021,...,s__Vibrio mimicus,SP,0.000273,0.998793,0.000229,0.000276,0.000213,0.000227,CS pos: 22-23. Pr: 0.9768,HEHKRSIYFPDETSSKVVQTEVEPSATKSLKIGQKINNLYDRQFDD...


## Notebook 

df_plm contains the embeddings from 7 protein language models and aminoacid composition vectors as well as the CARP640M likelihoods

In [2]:
df_plm = pd.read_pickle("../results/embeddings/all_plm.pkl")
df_plm

Unnamed: 0,seq_id,seq,esm1b,esm,onehot,t5xlu50,t5bfd,xlnet,bepler,carp640M,carp640M_logp
0,seq_0,MKKFCFLFLIICGLMVFCLQDCQARQKLNLADLENKYNAVIGVYAV...,"[0.1102982, 0.014699467, -0.14032564, 0.258299...","[-1.9012868, 0.22372644, 0.21296117, -0.166722...","[0.08450704, 0.024647888, 0.08450704, 0.045774...","[0.008956722, 0.02665889, 0.02402008, 0.014990...","[0.023663048, -0.0007369746, -0.0076412507, 0....","[-0.19811592, 0.051563144, -0.005958873, 0.039...","[0.08450704, 0.045774646, 0.07042254, 0.084507...","[8.911173, 9.043713, 2.4614172, 12.103228, -8....",-0.294909
1,seq_1,MKKFCFLFLIICGLMFFCLQDCQARQKLNLADLENKYNAVIGVYAV...,"[0.10703633, 0.017742604, -0.14941521, 0.25984...","[-1.91362, 0.15840982, 0.18624821, -0.169153, ...","[0.08098592, 0.024647888, 0.08450704, 0.042253...","[0.0076766736, 0.027014462, 0.027420135, 0.013...","[0.026570462, 0.0026666287, -0.006873358, -0.0...","[-0.21204372, 0.057212643, -0.011840693, 0.035...","[0.08098592, 0.045774646, 0.07042254, 0.084507...","[8.887652, 9.320681, 2.41787, 12.194232, -8.64...",-0.292038
2,seq_2,MKKFCFLFLIICGLMVFSLQDCQARQKLNLADLENKYNAVIGVYAV...,"[0.10849612, 0.017496616, -0.14619425, 0.26836...","[-1.8146169, 0.20834586, 0.23776329, -0.163906...","[0.08450704, 0.02112676, 0.08450704, 0.0457746...","[0.009324158, 0.02478514, 0.022525383, 0.01392...","[0.02341963, -0.004910054, -0.008731285, 0.005...","[-0.1872002, 0.043762427, -0.03699141, 0.03243...","[0.08450704, 0.045774646, 0.07042254, 0.084507...","[8.827172, 8.74543, 2.4937172, 12.3279085, -8....",-0.287659
3,seq_3,MKKFCFLFLIICGLMVFCLQGCQARQKLNLADLENKYNAVIGVYAV...,"[0.12845756, 0.017484514, -0.13535379, 0.24379...","[-1.8452643, 0.23816845, 0.23408653, -0.152660...","[0.08450704, 0.024647888, 0.08098592, 0.045774...","[0.004128871, 0.033653855, 0.028860169, 0.0124...","[0.014434682, -0.0026827375, -0.003042692, 0.0...","[-0.2078117, 0.05418699, 0.0056420905, 0.02040...","[0.08450704, 0.045774646, 0.07042254, 0.080985...","[8.501886, 9.030052, 2.3223486, 11.952613, -9....",-0.289380
4,seq_4,MKKFCFLFLIICGLMVFCLQDCQARQKLNLADLENKYNAVIGVYAV...,"[0.10706033, 0.013306678, -0.14212927, 0.25449...","[-1.8473366, 0.2200377, 0.23291773, -0.1457919...","[0.08450704, 0.024647888, 0.08450704, 0.045774...","[0.0077304444, 0.029016294, 0.023319585, 0.014...","[0.020652149, -0.00015262279, -0.0065017454, 0...","[-0.19173458, 0.04517981, -0.0330168, 0.023556...","[0.08450704, 0.045774646, 0.07042254, 0.084507...","[8.907578, 8.904117, 2.4000638, 12.001774, -9....",-0.287441
...,...,...,...,...,...,...,...,...,...,...,...
26018,seq_26017,MKLSTLALAPIAAALLTFNASAKGHDHDNQRAIFFPGETVQDTVKI...,"[0.033977885, 0.2099972, -0.03804998, 0.106422...","[-1.7197676, 0.03665721, 0.4062646, 0.27256438...","[0.07219251, 0.0, 0.07486631, 0.040106952, 0.0...","[0.007419476, 0.007811315, -0.021516398, 0.019...","[-0.013961067, -0.020913692, -0.023646962, 0.0...","[-0.1441164, -0.02171367, 0.11034167, 0.145684...","[0.07219251, 0.021390375, 0.06417112, 0.074866...","[5.539616, 9.056223, -0.5425722, 10.059907, -0...",-0.399117
26019,seq_26018,MKLSTLALAPIAAALFAFNVSANGHDHDNQRAIFFHGEKAPIAQTE...,"[0.040876266, 0.2043898, -0.015767168, 0.10294...","[-1.6569757, 0.103168234, 0.37261948, 0.298090...","[0.08042896, 0.0, 0.061662197, 0.048257373, 0....","[0.011291199, 0.003993584, -0.023162339, 0.025...","[-0.007916385, -0.015900034, -0.02986404, 0.03...","[-0.14152464, -0.020039544, 0.056525566, 0.157...","[0.08042896, 0.021447722, 0.077747986, 0.06166...","[5.2507486, 9.655591, -0.83597654, 8.914883, 0...",-0.391041
26020,seq_26019,MKIPTLALAPIAAALFAFNANAHEHKRSIYFPDETSSKVVQTEVEP...,"[0.044713225, 0.1989737, -0.016214604, 0.09130...","[-1.8027551, -0.054471187, 0.356291, 0.2774920...","[0.08108108, 0.0, 0.06216216, 0.045945946, 0.0...","[0.009738536, 0.0032915308, -0.022912802, 0.02...","[4.8287977e-05, -0.01686286, -0.03582304, 0.02...","[-0.16496508, -0.0207604, 0.077249505, 0.15154...","[0.08108108, 0.021621622, 0.06486487, 0.062162...","[5.8736224, 8.736176, -0.58174294, 9.6299715, ...",-0.416431
26021,seq_26020,MKIPTLALAPIAAALFAFNANAHEHKRSIYFPDETSSEVVQTEVEP...,"[0.044318486, 0.19327989, -0.017119728, 0.0940...","[-1.8163025, -0.007135033, 0.3879411, 0.245572...","[0.08108108, 0.0, 0.06216216, 0.048648648, 0.0...","[0.011506303, 0.00066137884, -0.02221559, 0.02...","[0.004785993, -0.01697322, -0.0355139, 0.02028...","[-0.15620558, -0.031111313, 0.07699773, 0.1406...","[0.08108108, 0.021621622, 0.067567565, 0.06216...","[5.918565, 8.701509, -0.7631483, 9.431531, 0.6...",-0.415698


## Notebook 

df_{dim_redo_algo} contains the seq dataset with annotations and the respective 2-axis or dimensions generated with PCA, tSNE or UMAP for serin (sbl) or emtalo (mbl) betalactamases. 

In [6]:
# serin 
df_pca_sbl = pd.read_csv("../results/dim_redo/splitted_classes/pca/all_plm_sbl.csv")
print(df_pca_sbl.columns)
df_pca_sbl

Index(['PC1_t5xlu50', 'PC2_t5xlu50', 'seq_id', 'PC1_t5bfd', 'PC2_t5bfd',
       'PC1_aa_composition', 'PC2_aa_composition', 'PC1_esm', 'PC2_esm',
       'PC1_bepler', 'PC2_bepler', 'PC1_xlnet', 'PC2_xlnet', 'PC1_esm1b',
       'PC2_esm1b', 'PC1_carp640M', 'PC2_carp640M', '#name', 'seq', 'length',
       'filename', 'protein_name', 'protein_family', 'bla_class',
       'bla_subclass', 'protein_family_header', 'molecular_weight',
       'aromaticity', 'instability', 'gravy', 'isoelectric_point', 'entropy',
       'helix', 'turn', 'sheet', 'is_clust90_rep', 'bitscore', 'Domain',
       'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species',
       'predicted_signal_peptide', 'Other_sigpept', 'SP(Sec/SPI)',
       'LIPO(Sec/SPII)', 'TAT(Tat/SPI)', 'TATLIPO(Sec/SPII)',
       'PILIN(Sec/SPIII)', 'CS Position', 'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,PC1_t5xlu50,PC2_t5xlu50,seq_id,PC1_t5bfd,PC2_t5bfd,PC1_aa_composition,PC2_aa_composition,PC1_esm,PC2_esm,PC1_bepler,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,-0.261651,-0.301408,seq_25809,0.708376,-0.152442,-0.101818,-0.034964,16.876826,-8.010551,-0.182967,...,,SP,0.337944,0.660240,0.000750,0.000413,0.000395,0.000265,CS pos: 20-21. Pr: 0.3357,RKGGRLGVAALDTTGRTVGYRADERFPMCSTFKALAAAAVLARVDA...
1,0.122469,0.096706,seq_25814,-0.144349,-0.446235,-0.005307,0.050003,-27.135322,-5.184893,0.003813,...,,SP,0.333512,0.666018,0.000158,0.000119,0.000108,0.000094,CS pos: 16-17. Pr: 0.6087,IVDAAIKPLMQQYDIPGMAVAVTVDGKPYFFNYGVASKETGQPVTE...
2,-0.198478,0.297920,seq_25815,0.501707,0.224767,0.060004,-0.034718,-7.878228,8.389995,-0.135011,...,,SP,0.000377,0.998711,0.000320,0.000234,0.000192,0.000187,CS pos: 17-18. Pr: 0.9675,TFVLYDLKTGKYYVYNKERAETRFSPASTFKIPNSLIGLETGVVKD...
3,-0.279257,0.135967,seq_0,0.562092,0.281681,0.067126,-0.039581,10.885861,5.198114,-0.140575,...,,LIPO,0.000157,0.499448,0.500170,0.000095,0.000088,0.000081,CS pos: 21-22. Pr: 0.4839,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
4,-0.272597,0.132858,seq_1,0.559413,0.274475,0.068621,-0.038422,11.495364,5.311646,-0.011053,...,,LIPO,0.000125,0.499552,0.500104,0.000080,0.000071,0.000069,CS pos: 21-22. Pr: 0.4476,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22682,0.319475,-0.460170,seq_25816,0.026779,0.090934,-0.028279,-0.016085,7.469034,-4.413349,-0.454162,...,,NO_SP,1.000003,0.000000,0.000000,0.000000,0.000000,0.000000,,MHPQTLEQIKESESQLSGRVGMVELDLASGRTLSYRADERFPMMST...
22683,-0.036200,-0.390047,seq_25817,0.445791,-0.011490,-0.064154,-0.028494,1.972817,-9.880397,-0.302004,...,,NO_SP,1.000019,0.000000,0.000000,0.000000,0.000000,0.000000,,AAQLSEQLAELEKRSGGRLGVAVLDTATGRRIAYRGDERFPMCSTF...
22684,0.009534,-0.388566,seq_25818,0.405283,0.018454,-0.053026,-0.026618,1.418307,-8.428426,-0.239582,...,,NO_SP,0.999994,0.000000,0.000000,0.000000,0.000000,0.000000,,MAAQLSEQLAELEKRSGGRVGVIVLDTATGRRIAYRGDERFPMMST...
22685,-0.041142,-0.365464,seq_25819,0.416195,-0.008874,-0.063963,-0.026689,3.798993,-8.789639,-0.288557,...,,NO_SP,1.000009,0.000000,0.000000,0.000000,0.000000,0.000000,,AAALSEQLAELEKRSGGRLGVAVLDTATGRRIAYRGDERFPMCSTF...


In [7]:
# metalo 
df_pca_mbl = pd.read_csv("../results/dim_redo/splitted_classes/pca/all_plm_mbl.csv")
print(df_pca_mbl.columns)
df_pca_mbl

Index(['PC1_t5bfd', 'PC2_t5bfd', 'seq_id', 'PC1_bepler', 'PC2_bepler',
       'PC1_esm1b', 'PC2_esm1b', 'PC1_esm', 'PC2_esm', 'PC1_t5xlu50',
       'PC2_t5xlu50', 'PC1_aa_composition', 'PC2_aa_composition', 'PC1_xlnet',
       'PC2_xlnet', 'PC1_carp640M', 'PC2_carp640M', '#name', 'seq', 'length',
       'filename', 'protein_name', 'protein_family', 'bla_class',
       'bla_subclass', 'protein_family_header', 'molecular_weight',
       'aromaticity', 'instability', 'gravy', 'isoelectric_point', 'entropy',
       'helix', 'turn', 'sheet', 'is_clust90_rep', 'bitscore', 'Domain',
       'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species',
       'predicted_signal_peptide', 'Other_sigpept', 'SP(Sec/SPI)',
       'LIPO(Sec/SPII)', 'TAT(Tat/SPI)', 'TATLIPO(Sec/SPII)',
       'PILIN(Sec/SPIII)', 'CS Position', 'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,PC1_t5bfd,PC2_t5bfd,seq_id,PC1_bepler,PC2_bepler,PC1_esm1b,PC2_esm1b,PC1_esm,PC2_esm,PC1_t5xlu50,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,0.322672,-0.348620,seq_25810,-0.903629,0.235284,-1.285285,-0.782621,-7.572600,-14.052075,-0.344300,...,,SP,0.000137,0.832758,0.166779,0.000130,0.000110,0.000104,CS pos: 15-16. Pr: 0.8213,LEITKLSDNVYVHTSYLETEGGKVPSNGLIVVTGKEAVLIDTPWDD...
1,0.065078,-0.351984,seq_25811,0.810333,0.487223,-0.659456,-0.303967,9.057257,3.028028,-0.143731,...,,SP,0.000162,0.999294,0.000136,0.000151,0.000122,0.000123,CS pos: 19-20. Pr: 0.9786,GKLSLKHLKGPVYVVEDDYYVQENSMVYIGADHVTVIGATWTPDTA...
2,-0.498740,-0.128038,seq_25812,-0.547966,-0.543544,0.674196,0.233531,-2.456959,3.376650,0.295581,...,,NO_SP,1.000007,0.000000,0.000000,0.000000,0.000000,0.000000,,MKLLLLAAAQEWNKPAPPFRIFGNLYYVGTCGLSAYLITTPEGHIL...
3,-0.099861,-0.318269,seq_25813,-0.280012,0.465049,-0.182208,0.615494,15.734534,9.621734,-0.011604,...,,NO_SP,1.000010,0.000000,0.000000,0.000000,0.000000,0.000000,,PFRILGNYYVGNGLLITTPKGHILIDTPWDSAPTEALIRWLGFKLK...
4,0.362291,0.077299,seq_13314,0.753634,0.714091,-0.924770,-0.261127,8.081576,-2.227204,-0.262180,...,s__Acinetobacter marinus,SP,0.000240,0.998998,0.000194,0.000219,0.000177,0.000185,CS pos: 24-25. Pr: 0.9701,EPDALIKPIPNTKSSSITQSTAKTVYQSEDLVITELAPNVYQHTSY...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3129,-0.105010,-0.252139,seq_16439,-0.519806,-0.399834,0.190802,0.916323,-2.916168,-3.162789,0.105222,...,s__46-32 sp001898405,SP,0.000192,0.999179,0.000176,0.000173,0.000144,0.000144,CS pos: 20-21. Pr: 0.9737,QKVMEPANVPAEWSKPYPAFQIAGNLYYVGTYDLASYLITTPQGHI...
3130,-0.120790,-0.227920,seq_16440,-0.420186,0.024951,0.184837,0.975495,0.349084,-0.266100,0.123011,...,s__Chitinophaga costaii,SP,0.000165,0.999229,0.000157,0.000169,0.000133,0.000140,CS pos: 14-15. Pr: 0.9704,QQVAEPANTNPEWSKPYPPFKIAGNLYYVGTYELACYLIVTGKGNI...
3131,-0.100708,-0.301275,seq_16441,-0.499148,0.120810,0.156052,1.021881,-0.509809,-1.166266,0.118810,...,s__Chitinophaga sp001975825,SP,0.000212,0.999106,0.000194,0.000181,0.000155,0.000148,CS pos: 21-22. Pr: 0.9684,QKVAEPPTTNNPEWSKPYEPFQIAGNLYYVGTYDLACYLIVTPRGN...
3132,-0.170427,-0.241142,seq_16442,-0.674393,-0.171530,0.156819,0.990984,-3.166446,-2.324971,0.125879,...,,SP,0.000190,0.999232,0.000160,0.000163,0.000131,0.000132,CS pos: 24-25. Pr: 0.9654,QKVFEPKDTPPEWSRPYKPFRIVGNLYYVGTYDLACYLVTTPEGNI...


In [8]:
# serin 
df_tsne_sbl = pd.read_csv("../results/dim_redo/splitted_classes/tsne/all_plm_sbl.csv")
print(df_tsne_sbl.columns)
df_tsne_sbl

Index(['tSNE1_t5bfd', 'tSNE2_t5bfd', 'seq_id', 'tSNE1_bepler', 'tSNE2_bepler',
       'tSNE1_aa_composition', 'tSNE2_aa_composition', 'tSNE1_carp640M',
       'tSNE2_carp640M', 'tSNE1_t5xlu50', 'tSNE2_t5xlu50', 'tSNE1_esm1b',
       'tSNE2_esm1b', 'tSNE1_xlnet', 'tSNE2_xlnet', 'tSNE1_esm', 'tSNE2_esm',
       '#name', 'seq', 'length', 'filename', 'protein_name', 'protein_family',
       'bla_class', 'bla_subclass', 'protein_family_header',
       'molecular_weight', 'aromaticity', 'instability', 'gravy',
       'isoelectric_point', 'entropy', 'helix', 'turn', 'sheet',
       'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class', 'Order',
       'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,tSNE1_t5bfd,tSNE2_t5bfd,seq_id,tSNE1_bepler,tSNE2_bepler,tSNE1_aa_composition,tSNE2_aa_composition,tSNE1_carp640M,tSNE2_carp640M,tSNE1_t5xlu50,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,-29.033724,-7.047474,seq_25809,-36.586930,-16.665535,-42.243176,-11.396341,-35.861507,-14.432572,-25.205770,...,,SP,0.337944,0.660240,0.000750,0.000413,0.000395,0.000265,CS pos: 20-21. Pr: 0.3357,RKGGRLGVAALDTTGRTVGYRADERFPMCSTFKALAAAAVLARVDA...
1,37.123615,1.660976,seq_25814,60.677860,-7.761523,14.958840,-27.312570,40.827442,-1.834042,31.757998,...,,SP,0.333512,0.666018,0.000158,0.000119,0.000108,0.000094,CS pos: 16-17. Pr: 0.6087,IVDAAIKPLMQQYDIPGMAVAVTVDGKPYFFNYGVASKETGQPVTE...
2,16.552319,24.847303,seq_25815,-30.163970,-17.247374,20.583840,20.162498,-3.822715,23.043743,16.548096,...,,SP,0.000377,0.998711,0.000320,0.000234,0.000192,0.000187,CS pos: 17-18. Pr: 0.9675,TFVLYDLKTGKYYVYNKERAETRFSPASTFKIPNSLIGLETGVVKD...
3,-2.078641,11.452413,seq_0,-32.508600,-6.162993,18.819221,30.933647,6.483561,14.063234,-11.488042,...,,LIPO,0.000157,0.499448,0.500170,0.000095,0.000088,0.000081,CS pos: 21-22. Pr: 0.4839,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
4,-2.077973,11.446576,seq_1,-32.509750,-6.151229,18.831842,30.913744,6.488060,14.064252,-11.484268,...,,LIPO,0.000125,0.499552,0.500104,0.000080,0.000071,0.000069,CS pos: 21-22. Pr: 0.4476,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22682,-10.746155,-41.972286,seq_25816,-36.237595,-17.005812,-17.100424,9.764537,-5.307428,-34.616585,-8.772803,...,,NO_SP,1.000003,0.000000,0.000000,0.000000,0.000000,0.000000,,MHPQTLEQIKESESQLSGRVGMVELDLASGRTLSYRADERFPMMST...
22683,-17.738310,-10.539475,seq_25817,-35.529907,-17.646555,-42.516064,2.234632,-36.037495,-28.916195,-17.752954,...,,NO_SP,1.000019,0.000000,0.000000,0.000000,0.000000,0.000000,,AAQLSEQLAELEKRSGGRLGVAVLDTATGRRIAYRGDERFPMCSTF...
22684,-17.691074,-10.481162,seq_25818,-35.975002,-17.689636,-27.336082,-1.132149,-35.987570,-28.915380,-17.704632,...,,NO_SP,0.999994,0.000000,0.000000,0.000000,0.000000,0.000000,,MAAQLSEQLAELEKRSGGRVGVIVLDTATGRRIAYRGDERFPMMST...
22685,-17.754385,-10.543181,seq_25819,-35.430300,-17.754396,-42.473106,2.300715,-36.019756,-28.930180,-17.751310,...,,NO_SP,1.000009,0.000000,0.000000,0.000000,0.000000,0.000000,,AAALSEQLAELEKRSGGRLGVAVLDTATGRRIAYRGDERFPMCSTF...


In [9]:
# metalo 
df_tsne_mbl = pd.read_csv("../results/dim_redo/splitted_classes/tsne/all_plm_mbl.csv")
print(df_tsne_mbl.columns)
df_tsne_mbl

Index(['tSNE1_bepler', 'tSNE2_bepler', 'seq_id', 'tSNE1_aa_composition',
       'tSNE2_aa_composition', 'tSNE1_t5bfd', 'tSNE2_t5bfd', 'tSNE1_carp640M',
       'tSNE2_carp640M', 'tSNE1_esm', 'tSNE2_esm', 'tSNE1_esm1b',
       'tSNE2_esm1b', 'tSNE1_xlnet', 'tSNE2_xlnet', 'tSNE1_t5xlu50',
       'tSNE2_t5xlu50', '#name', 'seq', 'length', 'filename', 'protein_name',
       'protein_family', 'bla_class', 'bla_subclass', 'protein_family_header',
       'molecular_weight', 'aromaticity', 'instability', 'gravy',
       'isoelectric_point', 'entropy', 'helix', 'turn', 'sheet',
       'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class', 'Order',
       'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,tSNE1_bepler,tSNE2_bepler,seq_id,tSNE1_aa_composition,tSNE2_aa_composition,tSNE1_t5bfd,tSNE2_t5bfd,tSNE1_carp640M,tSNE2_carp640M,tSNE1_esm,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,11.160006,8.020817,seq_25810,0.590570,13.623016,15.488024,3.058999,11.591803,12.254585,5.168570,...,,SP,0.000137,0.832758,0.166779,0.000130,0.000110,0.000104,CS pos: 15-16. Pr: 0.8213,LEITKLSDNVYVHTSYLETEGGKVPSNGLIVVTGKEAVLIDTPWDD...
1,4.931104,-4.019466,seq_25811,3.300360,0.218640,4.997390,5.330502,6.460155,-2.176258,0.086119,...,,SP,0.000162,0.999294,0.000136,0.000151,0.000122,0.000123,CS pos: 19-20. Pr: 0.9786,GKLSLKHLKGPVYVVEDDYYVQENSMVYIGADHVTVIGATWTPDTA...
2,-11.637862,-5.437962,seq_25812,-3.734423,-10.011774,-13.064893,4.187890,-4.320327,-3.223510,-5.527527,...,,NO_SP,1.000007,0.000000,0.000000,0.000000,0.000000,0.000000,,MKLLLLAAAQEWNKPAPPFRIFGNLYYVGTCGLSAYLITTPEGHIL...
3,2.905549,2.769480,seq_25813,-1.041104,0.374860,4.748127,6.516093,8.659981,-1.821597,-0.462570,...,,NO_SP,1.000010,0.000000,0.000000,0.000000,0.000000,0.000000,,PFRILGNYYVGNGLLITTPKGHILIDTPWDSAPTEALIRWLGFKLK...
4,11.248569,-2.515673,seq_13314,1.078179,6.731807,11.546330,-1.904159,12.982421,8.546435,8.071766,...,s__Acinetobacter marinus,SP,0.000240,0.998998,0.000194,0.000219,0.000177,0.000185,CS pos: 24-25. Pr: 0.9701,EPDALIKPIPNTKSSSITQSTAKTVYQSEDLVITELAPNVYQHTSY...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3129,-6.704741,-4.195854,seq_16439,4.373461,2.770175,-4.280950,5.621791,-2.599182,-6.889552,-2.745226,...,s__46-32 sp001898405,SP,0.000192,0.999179,0.000176,0.000173,0.000144,0.000144,CS pos: 20-21. Pr: 0.9737,QKVMEPANVPAEWSKPYPAFQIAGNLYYVGTYDLASYLITTPQGHI...
3130,-4.476643,-2.739342,seq_16440,3.840333,1.443490,-4.429151,5.933342,-2.634642,-6.274363,-3.167640,...,s__Chitinophaga costaii,SP,0.000165,0.999229,0.000157,0.000169,0.000133,0.000140,CS pos: 14-15. Pr: 0.9704,QQVAEPANTNPEWSKPYPPFKIAGNLYYVGTYELACYLIVTGKGNI...
3131,-3.712742,-2.864039,seq_16441,4.288371,2.206416,-4.301567,5.901580,-2.521488,-6.475953,-3.094878,...,s__Chitinophaga sp001975825,SP,0.000212,0.999106,0.000194,0.000181,0.000155,0.000148,CS pos: 21-22. Pr: 0.9684,QKVAEPPTTNNPEWSKPYEPFQIAGNLYYVGTYDLACYLIVTPRGN...
3132,-7.082638,-4.411496,seq_16442,4.440018,1.820756,-4.863302,5.123221,-2.806676,-6.575401,-2.983438,...,,SP,0.000190,0.999232,0.000160,0.000163,0.000131,0.000132,CS pos: 24-25. Pr: 0.9654,QKVFEPKDTPPEWSRPYKPFRIVGNLYYVGTYDLACYLVTTPEGNI...


In [10]:
# serin 
df_umap_sbl = pd.read_csv("../results/dim_redo/splitted_classes/umap/all_plm_sbl.csv")
print(df_umap_sbl.columns)
df_umap_sbl

Index(['umap1_carp640M', 'umap2_carp640M', 'seq_id', 'umap1_t5bfd',
       'umap2_t5bfd', 'umap1_esm1b', 'umap2_esm1b', 'umap1_t5xlu50',
       'umap2_t5xlu50', 'umap1_esm', 'umap2_esm', 'umap1_aa_composition',
       'umap2_aa_composition', 'umap1_bepler', 'umap2_bepler', 'umap1_xlnet',
       'umap2_xlnet', '#name', 'seq', 'length', 'filename', 'protein_name',
       'protein_family', 'bla_class', 'bla_subclass', 'protein_family_header',
       'molecular_weight', 'aromaticity', 'instability', 'gravy',
       'isoelectric_point', 'entropy', 'helix', 'turn', 'sheet',
       'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class', 'Order',
       'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,umap1_carp640M,umap2_carp640M,seq_id,umap1_t5bfd,umap2_t5bfd,umap1_esm1b,umap2_esm1b,umap1_t5xlu50,umap2_t5xlu50,umap1_esm,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,0.565765,10.055883,seq_25809,7.406665,5.729660,-0.994178,8.406624,0.484468,5.853086,-0.683621,...,,SP,0.337944,0.660240,0.000750,0.000413,0.000395,0.000265,CS pos: 20-21. Pr: 0.3357,RKGGRLGVAALDTTGRTVGYRADERFPMCSTFKALAAAAVLARVDA...
1,7.719544,7.213110,seq_25814,10.183612,1.211918,7.067026,7.495677,7.245858,2.424474,8.168442,...,,SP,0.333512,0.666018,0.000158,0.000119,0.000108,0.000094,CS pos: 16-17. Pr: 0.6087,IVDAAIKPLMQQYDIPGMAVAVTVDGKPYFFNYGVASKETGQPVTE...
2,1.364387,5.335743,seq_25815,5.406605,8.356627,1.015777,0.995464,0.172149,-0.304742,0.378465,...,,SP,0.000377,0.998711,0.000320,0.000234,0.000192,0.000187,CS pos: 17-18. Pr: 0.9675,TFVLYDLKTGKYYVYNKERAETRFSPASTFKIPNSLIGLETGVVKD...
3,0.721085,5.898996,seq_0,8.931684,6.633514,1.133505,12.531065,-1.445493,4.830019,2.034181,...,,LIPO,0.000157,0.499448,0.500170,0.000095,0.000088,0.000081,CS pos: 21-22. Pr: 0.4839,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
4,0.686704,5.812041,seq_1,8.735252,6.804988,1.175978,12.373275,-1.535052,4.967071,1.681483,...,,LIPO,0.000125,0.499552,0.500104,0.000080,0.000071,0.000069,CS pos: 21-22. Pr: 0.4476,CQARQKLNLADLENKYNAVIGVYAVDMENGKKICYKPDTRFSYCST...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22682,0.004976,8.574091,seq_25816,8.518968,12.824478,-0.651975,9.909252,0.199094,10.276077,-0.248034,...,,NO_SP,1.000003,0.000000,0.000000,0.000000,0.000000,0.000000,,MHPQTLEQIKESESQLSGRVGMVELDLASGRTLSYRADERFPMMST...
22683,0.916486,10.175001,seq_25817,7.816970,5.975520,-1.345385,8.268382,0.062902,6.169547,-1.327439,...,,NO_SP,1.000019,0.000000,0.000000,0.000000,0.000000,0.000000,,AAQLSEQLAELEKRSGGRLGVAVLDTATGRRIAYRGDERFPMCSTF...
22684,1.023403,10.151779,seq_25818,7.895245,5.975049,-1.182719,8.444343,0.011228,6.292199,-1.268524,...,,NO_SP,0.999994,0.000000,0.000000,0.000000,0.000000,0.000000,,MAAQLSEQLAELEKRSGGRVGVIVLDTATGRRIAYRGDERFPMMST...
22685,0.978685,10.042521,seq_25819,7.743480,6.016940,-1.353705,8.328938,0.075632,6.195929,-1.339367,...,,NO_SP,1.000009,0.000000,0.000000,0.000000,0.000000,0.000000,,AAALSEQLAELEKRSGGRLGVAVLDTATGRRIAYRGDERFPMCSTF...


In [11]:
# metalo 
df_umap_mbl = pd.read_csv("../results/dim_redo/splitted_classes/umap/all_plm_mbl.csv")
print(df_umap_mbl.columns)
df_umap_mbl

Index(['umap1_esm', 'umap2_esm', 'seq_id', 'umap1_aa_composition',
       'umap2_aa_composition', 'umap1_bepler', 'umap2_bepler', 'umap1_t5xlu50',
       'umap2_t5xlu50', 'umap1_t5bfd', 'umap2_t5bfd', 'umap1_esm1b',
       'umap2_esm1b', 'umap1_xlnet', 'umap2_xlnet', 'umap1_carp640M',
       'umap2_carp640M', '#name', 'seq', 'length', 'filename', 'protein_name',
       'protein_family', 'bla_class', 'bla_subclass', 'protein_family_header',
       'molecular_weight', 'aromaticity', 'instability', 'gravy',
       'isoelectric_point', 'entropy', 'helix', 'turn', 'sheet',
       'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class', 'Order',
       'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept'],
      dtype='object')


Unnamed: 0,umap1_esm,umap2_esm,seq_id,umap1_aa_composition,umap2_aa_composition,umap1_bepler,umap2_bepler,umap1_t5xlu50,umap2_t5xlu50,umap1_t5bfd,...,Species,predicted_signal_peptide,Other_sigpept,SP(Sec/SPI),LIPO(Sec/SPII),TAT(Tat/SPI),TATLIPO(Sec/SPII),PILIN(Sec/SPIII),CS Position,seq_without_sigpept
0,3.932276,8.107975,seq_25810,4.834155,9.727823,8.938259,5.733058,4.282558,7.726621,4.992811,...,,SP,0.000137,0.832758,0.166779,0.000130,0.000110,0.000104,CS pos: 15-16. Pr: 0.8213,LEITKLSDNVYVHTSYLETEGGKVPSNGLIVVTGKEAVLIDTPWDD...
1,5.796630,8.310654,seq_25811,6.301899,9.216569,7.857910,6.754916,5.337908,8.227730,4.126945,...,,SP,0.000162,0.999294,0.000136,0.000151,0.000122,0.000123,CS pos: 19-20. Pr: 0.9786,GKLSLKHLKGPVYVVEDDYYVQENSMVYIGADHVTVIGATWTPDTA...
2,7.116475,9.192446,seq_25812,7.040029,9.768353,7.559695,4.213998,8.706751,9.076670,1.644929,...,,NO_SP,1.000007,0.000000,0.000000,0.000000,0.000000,0.000000,,MKLLLLAAAQEWNKPAPPFRIFGNLYYVGTCGLSAYLITTPEGHIL...
3,5.936379,8.521885,seq_25813,6.007249,9.590334,8.226659,6.272790,5.377499,8.676945,4.310386,...,,NO_SP,1.000010,0.000000,0.000000,0.000000,0.000000,0.000000,,PFRILGNYYVGNGLLITTPKGHILIDTPWDSAPTEALIRWLGFKLK...
4,4.759280,8.776646,seq_13314,5.602692,9.298289,7.871040,5.689137,5.144113,7.018634,4.395852,...,s__Acinetobacter marinus,SP,0.000240,0.998998,0.000194,0.000219,0.000177,0.000185,CS pos: 24-25. Pr: 0.9701,EPDALIKPIPNTKSSSITQSTAKTVYQSEDLVITELAPNVYQHTSY...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3129,7.050779,9.610131,seq_16439,6.140892,9.016062,7.647017,4.869938,7.656897,9.333304,2.300631,...,s__46-32 sp001898405,SP,0.000192,0.999179,0.000176,0.000173,0.000144,0.000144,CS pos: 20-21. Pr: 0.9737,QKVMEPANVPAEWSKPYPAFQIAGNLYYVGTYDLASYLITTPQGHI...
3130,6.965789,9.616343,seq_16440,6.217888,9.051498,7.436782,5.051413,7.670532,9.315625,2.199916,...,s__Chitinophaga costaii,SP,0.000165,0.999229,0.000157,0.000169,0.000133,0.000140,CS pos: 14-15. Pr: 0.9704,QQVAEPANTNPEWSKPYPPFKIAGNLYYVGTYELACYLIVTGKGNI...
3131,7.073654,9.594904,seq_16441,6.152545,8.965333,7.486247,5.133616,7.650896,9.295524,2.278639,...,s__Chitinophaga sp001975825,SP,0.000212,0.999106,0.000194,0.000181,0.000155,0.000148,CS pos: 21-22. Pr: 0.9684,QKVAEPPTTNNPEWSKPYEPFQIAGNLYYVGTYDLACYLIVTPRGN...
3132,7.033867,9.564788,seq_16442,6.170478,9.013315,7.722808,4.756796,7.643445,9.268057,2.151616,...,,SP,0.000190,0.999232,0.000160,0.000163,0.000131,0.000132,CS pos: 24-25. Pr: 0.9654,QKVFEPKDTPPEWSRPYKPFRIVGNLYYVGTYDLACYLVTTPEGNI...


## Notebook 

df_mics contains the mics and folds values retrived and corrected from BLDB2

In [16]:
df_mics = pd.read_csv("../data/bldb2/all_mics_raw.csv", sep = "\t")
df_mics

Unnamed: 0,protein_name,bla_class,antibiotic,antibiotic_class,mic_without_bla,mic_with_bla,fold,log2_fold
0,SFO-1,Class A,Amoxicillin,Penicillins,1.000,256.00,256.00000,8.000000
1,SFO-1,Class A,Amoxicillin-CLA,Penicillins,2.000,256.00,128.00000,7.000000
2,SFO-1,Class A,Ceftazidime,Cephalosporins,0.125,16.00,128.00000,7.000000
3,SFO-1,Class A,Piperacillin,Penicillins,0.500,256.00,512.00000,9.000000
4,SFO-1,Class A,Imipenem,Carbapenems,0.250,0.50,2.00000,1.000000
...,...,...,...,...,...,...,...,...
2378,OXA-97,Class D,Ampicillin,Penicillins,4.000,512.00,128.00000,7.000000
2379,OXA-97,Class D,Imipenem,Carbapenems,0.060,1.00,16.66670,4.058897
2380,OXA-97,Class D,Ticarcillin,Penicillins,4.000,512.00,128.00000,7.000000
2381,OXA-97,Class D,Meropenem,Carbapenems,0.060,0.50,8.33333,3.058893


df_training contrain the mics dataset merged with the datasets of seqs with annotations and embeddings

In [18]:
df_training_mics = pd.read_pickle("../results/mics/training_dset_mics.pkl")
print(df_training_mics.columns)
df_training_mics

Index(['protein_name', 'antibiotic', 'antibiotic_class', 'mic_without_bla',
       'mic_with_bla', 'fold', 'log2_fold', '#name', 'seq', 'length',
       'filename', 'protein_family', 'bla_class', 'bla_subclass',
       'protein_family_header', 'seq_id', 'molecular_weight', 'aromaticity',
       'instability', 'gravy', 'isoelectric_point', 'entropy', 'helix', 'turn',
       'sheet', 'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class',
       'Order', 'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept', 'esm1b', 'esm', 'onehot', 't5xlu50', 't5bfd',
       'xlnet', 'bepler', 'carp640M', 'carp640M_logp'],
      dtype='object')


Unnamed: 0,protein_name,antibiotic,antibiotic_class,mic_without_bla,mic_with_bla,fold,log2_fold,#name,seq,length,...,seq_without_sigpept,esm1b,esm,onehot,t5xlu50,t5bfd,xlnet,bepler,carp640M,carp640M_logp
0,SFO-1,Amoxicillin,Penicillins,1.000,256.00,256.00000,8.000000,gi|226440723|gb|ACO57221.1|SFO-1| class A exte...,MVKNTLRQTTLMVATVMPLLFGSAPLWAQSANAKANIQQQLSELEK...,295,...,QSANAKANIQQQLSELEKNSGGRLGVALIDTADNSQILYRGDERFP...,"[0.12884726, -0.014014147, -0.14221768, 0.2163...","[-1.3039657, 0.1604022, 0.17407279, 0.09927225...","[0.1220339, 0.0033898305, 0.050847456, 0.04745...","[0.015114492, 0.04356348, 0.019283863, 0.03084...","[0.0176283, 0.018968944, -0.04833036, 0.040485...","[-0.46188685, 0.24787414, -0.09181613, -0.2707...","[0.1220339, 0.040677965, 0.050847456, 0.050847...","[9.180886, 8.3046465, 2.616951, 13.753964, -8....",-0.227483
1,SFO-1,Amoxicillin-CLA,Penicillins,2.000,256.00,128.00000,7.000000,gi|226440723|gb|ACO57221.1|SFO-1| class A exte...,MVKNTLRQTTLMVATVMPLLFGSAPLWAQSANAKANIQQQLSELEK...,295,...,QSANAKANIQQQLSELEKNSGGRLGVALIDTADNSQILYRGDERFP...,"[0.12884726, -0.014014147, -0.14221768, 0.2163...","[-1.3039657, 0.1604022, 0.17407279, 0.09927225...","[0.1220339, 0.0033898305, 0.050847456, 0.04745...","[0.015114492, 0.04356348, 0.019283863, 0.03084...","[0.0176283, 0.018968944, -0.04833036, 0.040485...","[-0.46188685, 0.24787414, -0.09181613, -0.2707...","[0.1220339, 0.040677965, 0.050847456, 0.050847...","[9.180886, 8.3046465, 2.616951, 13.753964, -8....",-0.227483
2,SFO-1,Ceftazidime,Cephalosporins,0.125,16.00,128.00000,7.000000,gi|226440723|gb|ACO57221.1|SFO-1| class A exte...,MVKNTLRQTTLMVATVMPLLFGSAPLWAQSANAKANIQQQLSELEK...,295,...,QSANAKANIQQQLSELEKNSGGRLGVALIDTADNSQILYRGDERFP...,"[0.12884726, -0.014014147, -0.14221768, 0.2163...","[-1.3039657, 0.1604022, 0.17407279, 0.09927225...","[0.1220339, 0.0033898305, 0.050847456, 0.04745...","[0.015114492, 0.04356348, 0.019283863, 0.03084...","[0.0176283, 0.018968944, -0.04833036, 0.040485...","[-0.46188685, 0.24787414, -0.09181613, -0.2707...","[0.1220339, 0.040677965, 0.050847456, 0.050847...","[9.180886, 8.3046465, 2.616951, 13.753964, -8....",-0.227483
3,SFO-1,Piperacillin,Penicillins,0.500,256.00,512.00000,9.000000,gi|226440723|gb|ACO57221.1|SFO-1| class A exte...,MVKNTLRQTTLMVATVMPLLFGSAPLWAQSANAKANIQQQLSELEK...,295,...,QSANAKANIQQQLSELEKNSGGRLGVALIDTADNSQILYRGDERFP...,"[0.12884726, -0.014014147, -0.14221768, 0.2163...","[-1.3039657, 0.1604022, 0.17407279, 0.09927225...","[0.1220339, 0.0033898305, 0.050847456, 0.04745...","[0.015114492, 0.04356348, 0.019283863, 0.03084...","[0.0176283, 0.018968944, -0.04833036, 0.040485...","[-0.46188685, 0.24787414, -0.09181613, -0.2707...","[0.1220339, 0.040677965, 0.050847456, 0.050847...","[9.180886, 8.3046465, 2.616951, 13.753964, -8....",-0.227483
4,SFO-1,Imipenem,Carbapenems,0.250,0.50,2.00000,1.000000,gi|226440723|gb|ACO57221.1|SFO-1| class A exte...,MVKNTLRQTTLMVATVMPLLFGSAPLWAQSANAKANIQQQLSELEK...,295,...,QSANAKANIQQQLSELEKNSGGRLGVALIDTADNSQILYRGDERFP...,"[0.12884726, -0.014014147, -0.14221768, 0.2163...","[-1.3039657, 0.1604022, 0.17407279, 0.09927225...","[0.1220339, 0.0033898305, 0.050847456, 0.04745...","[0.015114492, 0.04356348, 0.019283863, 0.03084...","[0.0176283, 0.018968944, -0.04833036, 0.040485...","[-0.46188685, 0.24787414, -0.09181613, -0.2707...","[0.1220339, 0.040677965, 0.050847456, 0.050847...","[9.180886, 8.3046465, 2.616951, 13.753964, -8....",-0.227483
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2378,OXA-97,Ampicillin,Penicillins,4.000,512.00,128.00000,7.000000,gi|132252066|gb|ABO33299.1|OXA-97| OXA-58 fami...,MKLLKILSLVCLSISIGACAEHSMSRAKTSTIPQVNNSIIDQNVQA...,280,...,CAEHSMSRAKTSTIPQVNNSIIDQNVQALFNEISGDAVFVTYDGQN...,"[0.12914298, 0.13434997, -0.19863299, 0.234990...","[-1.4116695, 0.5572757, 0.36231503, -0.5086213...","[0.08214286, 0.007142857, 0.05357143, 0.05, 0....","[0.012257758, 0.04736982, 0.019818751, 0.01624...","[0.013028533, -0.02087584, -0.029086696, 0.013...","[-0.2422296, 0.07161489, -0.10969193, -0.01573...","[0.08214286, 0.035714287, 0.035714287, 0.05357...","[9.83884, 8.274292, 3.2852623, 12.469421, -6.4...",-0.255275
2379,OXA-97,Imipenem,Carbapenems,0.060,1.00,16.66670,4.058897,gi|132252066|gb|ABO33299.1|OXA-97| OXA-58 fami...,MKLLKILSLVCLSISIGACAEHSMSRAKTSTIPQVNNSIIDQNVQA...,280,...,CAEHSMSRAKTSTIPQVNNSIIDQNVQALFNEISGDAVFVTYDGQN...,"[0.12914298, 0.13434997, -0.19863299, 0.234990...","[-1.4116695, 0.5572757, 0.36231503, -0.5086213...","[0.08214286, 0.007142857, 0.05357143, 0.05, 0....","[0.012257758, 0.04736982, 0.019818751, 0.01624...","[0.013028533, -0.02087584, -0.029086696, 0.013...","[-0.2422296, 0.07161489, -0.10969193, -0.01573...","[0.08214286, 0.035714287, 0.035714287, 0.05357...","[9.83884, 8.274292, 3.2852623, 12.469421, -6.4...",-0.255275
2380,OXA-97,Ticarcillin,Penicillins,4.000,512.00,128.00000,7.000000,gi|132252066|gb|ABO33299.1|OXA-97| OXA-58 fami...,MKLLKILSLVCLSISIGACAEHSMSRAKTSTIPQVNNSIIDQNVQA...,280,...,CAEHSMSRAKTSTIPQVNNSIIDQNVQALFNEISGDAVFVTYDGQN...,"[0.12914298, 0.13434997, -0.19863299, 0.234990...","[-1.4116695, 0.5572757, 0.36231503, -0.5086213...","[0.08214286, 0.007142857, 0.05357143, 0.05, 0....","[0.012257758, 0.04736982, 0.019818751, 0.01624...","[0.013028533, -0.02087584, -0.029086696, 0.013...","[-0.2422296, 0.07161489, -0.10969193, -0.01573...","[0.08214286, 0.035714287, 0.035714287, 0.05357...","[9.83884, 8.274292, 3.2852623, 12.469421, -6.4...",-0.255275
2381,OXA-97,Meropenem,Carbapenems,0.060,0.50,8.33333,3.058893,gi|132252066|gb|ABO33299.1|OXA-97| OXA-58 fami...,MKLLKILSLVCLSISIGACAEHSMSRAKTSTIPQVNNSIIDQNVQA...,280,...,CAEHSMSRAKTSTIPQVNNSIIDQNVQALFNEISGDAVFVTYDGQN...,"[0.12914298, 0.13434997, -0.19863299, 0.234990...","[-1.4116695, 0.5572757, 0.36231503, -0.5086213...","[0.08214286, 0.007142857, 0.05357143, 0.05, 0....","[0.012257758, 0.04736982, 0.019818751, 0.01624...","[0.013028533, -0.02087584, -0.029086696, 0.013...","[-0.2422296, 0.07161489, -0.10969193, -0.01573...","[0.08214286, 0.035714287, 0.035714287, 0.05357...","[9.83884, 8.274292, 3.2852623, 12.469421, -6.4...",-0.255275


Contains the dataser of enzyme kinetics retrived and curated from BLDB2

In [20]:
df_kins = pd.read_csv("../results/kinetics/kinetics_dset_clean.csv")
df_kins

Unnamed: 0,protein_name,bla_class,antibiotic,antibiotic_class,kcat,km,catalytic_efficiency
0,TEM-109,Class A,Amoxicillin,Penicillins,66.0,15.0,4.00
1,TEM-109,Class A,Cefuroxime,Cephalosporins,22.0,93.0,0.20
2,TEM-109,Class A,Ticarcillin,Penicillins,42.0,14.0,3.00
3,TEM-109,Class A,Benzylpenicillin,Penicillins,171.0,20.0,8.00
4,TEM-109,Class A,Piperacillin,Penicillins,152.0,64.0,2.00
...,...,...,...,...,...,...,...
838,OXA-63,Class D,Ampicillin,Penicillins,6.6,43.0,0.15
839,OXA-63,Class D,Benzylpenicillin,Penicillins,4.9,19.0,0.25
840,OXA-63,Class D,Oxacillin,Penicillins,113.0,115.0,0.98
841,OXA-63,Class D,Carbenicillin,Penicillins,1.1,17.0,0.07


df_training_kins contains the kinetics dataset merged with seq annotations ande embeddings

In [21]:
df_training_kins = pd.read_pickle("../results/kinetics/training_dset_kinetics.pkl")
print(df_training_kins.columns)
df_training_kins

Index(['protein_name', 'antibiotic', 'antibiotic_class', 'kcat', 'km',
       'catalytic_efficiency', '#name', 'seq', 'length', 'filename',
       'protein_family', 'bla_class', 'bla_subclass', 'protein_family_header',
       'seq_id', 'molecular_weight', 'aromaticity', 'instability', 'gravy',
       'isoelectric_point', 'entropy', 'helix', 'turn', 'sheet',
       'is_clust90_rep', 'bitscore', 'Domain', 'Phylum', 'Class', 'Order',
       'Family', 'Genus', 'Species', 'predicted_signal_peptide',
       'Other_sigpept', 'SP(Sec/SPI)', 'LIPO(Sec/SPII)', 'TAT(Tat/SPI)',
       'TATLIPO(Sec/SPII)', 'PILIN(Sec/SPIII)', 'CS Position',
       'seq_without_sigpept', 'esm1b', 'esm', 'onehot', 't5xlu50', 't5bfd',
       'xlnet', 'bepler', 'carp640M', 'carp640M_logp'],
      dtype='object')


Unnamed: 0,protein_name,antibiotic,antibiotic_class,kcat,km,catalytic_efficiency,#name,seq,length,filename,...,seq_without_sigpept,esm1b,esm,onehot,t5xlu50,t5bfd,xlnet,bepler,carp640M,carp640M_logp
0,TEM-109,Amoxicillin,Penicillins,66.0,15.0,4.00,gi|48734422|gb|AAT46413.1|TEM-109| inhibitor-r...,MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIE...,286,A-TEM-109-prot.fasta,...,HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLST...,"[0.099508114, -0.019877695, -0.12678668, 0.244...","[-1.956021, -0.2566141, 0.22891153, -0.4786245...","[0.1013986, 0.01048951, 0.055944055, 0.0664335...","[0.05116419, 0.104610845, 0.03149046, 0.047825...","[0.04448658, 0.057673056, 0.006611364, 0.03816...","[-0.29250857, 0.26983652, 0.03196501, -0.25515...","[0.1013986, 0.062937066, 0.027972028, 0.055944...","[9.46786, 10.623341, 4.9245224, 15.516748, -7....",-0.213099
1,TEM-109,Cefuroxime,Cephalosporins,22.0,93.0,0.20,gi|48734422|gb|AAT46413.1|TEM-109| inhibitor-r...,MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIE...,286,A-TEM-109-prot.fasta,...,HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLST...,"[0.099508114, -0.019877695, -0.12678668, 0.244...","[-1.956021, -0.2566141, 0.22891153, -0.4786245...","[0.1013986, 0.01048951, 0.055944055, 0.0664335...","[0.05116419, 0.104610845, 0.03149046, 0.047825...","[0.04448658, 0.057673056, 0.006611364, 0.03816...","[-0.29250857, 0.26983652, 0.03196501, -0.25515...","[0.1013986, 0.062937066, 0.027972028, 0.055944...","[9.46786, 10.623341, 4.9245224, 15.516748, -7....",-0.213099
2,TEM-109,Ticarcillin,Penicillins,42.0,14.0,3.00,gi|48734422|gb|AAT46413.1|TEM-109| inhibitor-r...,MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIE...,286,A-TEM-109-prot.fasta,...,HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLST...,"[0.099508114, -0.019877695, -0.12678668, 0.244...","[-1.956021, -0.2566141, 0.22891153, -0.4786245...","[0.1013986, 0.01048951, 0.055944055, 0.0664335...","[0.05116419, 0.104610845, 0.03149046, 0.047825...","[0.04448658, 0.057673056, 0.006611364, 0.03816...","[-0.29250857, 0.26983652, 0.03196501, -0.25515...","[0.1013986, 0.062937066, 0.027972028, 0.055944...","[9.46786, 10.623341, 4.9245224, 15.516748, -7....",-0.213099
3,TEM-109,Benzylpenicillin,Penicillins,171.0,20.0,8.00,gi|48734422|gb|AAT46413.1|TEM-109| inhibitor-r...,MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIE...,286,A-TEM-109-prot.fasta,...,HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLST...,"[0.099508114, -0.019877695, -0.12678668, 0.244...","[-1.956021, -0.2566141, 0.22891153, -0.4786245...","[0.1013986, 0.01048951, 0.055944055, 0.0664335...","[0.05116419, 0.104610845, 0.03149046, 0.047825...","[0.04448658, 0.057673056, 0.006611364, 0.03816...","[-0.29250857, 0.26983652, 0.03196501, -0.25515...","[0.1013986, 0.062937066, 0.027972028, 0.055944...","[9.46786, 10.623341, 4.9245224, 15.516748, -7....",-0.213099
4,TEM-109,Piperacillin,Penicillins,152.0,64.0,2.00,gi|48734422|gb|AAT46413.1|TEM-109| inhibitor-r...,MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIE...,286,A-TEM-109-prot.fasta,...,HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLST...,"[0.099508114, -0.019877695, -0.12678668, 0.244...","[-1.956021, -0.2566141, 0.22891153, -0.4786245...","[0.1013986, 0.01048951, 0.055944055, 0.0664335...","[0.05116419, 0.104610845, 0.03149046, 0.047825...","[0.04448658, 0.057673056, 0.006611364, 0.03816...","[-0.29250857, 0.26983652, 0.03196501, -0.25515...","[0.1013986, 0.062937066, 0.027972028, 0.055944...","[9.46786, 10.623341, 4.9245224, 15.516748, -7....",-0.213099
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838,OXA-63,Ampicillin,Penicillins,6.6,43.0,0.15,gi|52854083|gb|AAU88145.1|OXA-63| oxacillin-hy...,MSKKNFILIFIFVILISCKNTEKISNETTLIDNIFTNSNAEGTLVI...,268,D-OXA-63-prot.fasta,...,CKNTEKISNETTLIDNIFTNSNAEGTLVIYNLNDDKYIIHNKERAE...,"[0.12508959, 0.132962, -0.2024833, 0.18717058,...","[-1.396667, 1.3622009, 0.42777172, -0.23479563...","[0.063432835, 0.0037313432, 0.05597015, 0.0746...","[0.009941773, 0.02417964, 0.03828127, 0.031959...","[0.043379027, -0.018720381, -0.018860383, 0.02...","[-0.2827446, -0.0847767, -0.179751, 0.26336384...","[0.063432835, 0.026119404, 0.09328358, 0.05597...","[9.563412, 8.137673, 2.5791569, 13.973948, -6....",-0.360572
839,OXA-63,Benzylpenicillin,Penicillins,4.9,19.0,0.25,gi|52854083|gb|AAU88145.1|OXA-63| oxacillin-hy...,MSKKNFILIFIFVILISCKNTEKISNETTLIDNIFTNSNAEGTLVI...,268,D-OXA-63-prot.fasta,...,CKNTEKISNETTLIDNIFTNSNAEGTLVIYNLNDDKYIIHNKERAE...,"[0.12508959, 0.132962, -0.2024833, 0.18717058,...","[-1.396667, 1.3622009, 0.42777172, -0.23479563...","[0.063432835, 0.0037313432, 0.05597015, 0.0746...","[0.009941773, 0.02417964, 0.03828127, 0.031959...","[0.043379027, -0.018720381, -0.018860383, 0.02...","[-0.2827446, -0.0847767, -0.179751, 0.26336384...","[0.063432835, 0.026119404, 0.09328358, 0.05597...","[9.563412, 8.137673, 2.5791569, 13.973948, -6....",-0.360572
840,OXA-63,Oxacillin,Penicillins,113.0,115.0,0.98,gi|52854083|gb|AAU88145.1|OXA-63| oxacillin-hy...,MSKKNFILIFIFVILISCKNTEKISNETTLIDNIFTNSNAEGTLVI...,268,D-OXA-63-prot.fasta,...,CKNTEKISNETTLIDNIFTNSNAEGTLVIYNLNDDKYIIHNKERAE...,"[0.12508959, 0.132962, -0.2024833, 0.18717058,...","[-1.396667, 1.3622009, 0.42777172, -0.23479563...","[0.063432835, 0.0037313432, 0.05597015, 0.0746...","[0.009941773, 0.02417964, 0.03828127, 0.031959...","[0.043379027, -0.018720381, -0.018860383, 0.02...","[-0.2827446, -0.0847767, -0.179751, 0.26336384...","[0.063432835, 0.026119404, 0.09328358, 0.05597...","[9.563412, 8.137673, 2.5791569, 13.973948, -6....",-0.360572
841,OXA-63,Carbenicillin,Penicillins,1.1,17.0,0.07,gi|52854083|gb|AAU88145.1|OXA-63| oxacillin-hy...,MSKKNFILIFIFVILISCKNTEKISNETTLIDNIFTNSNAEGTLVI...,268,D-OXA-63-prot.fasta,...,CKNTEKISNETTLIDNIFTNSNAEGTLVIYNLNDDKYIIHNKERAE...,"[0.12508959, 0.132962, -0.2024833, 0.18717058,...","[-1.396667, 1.3622009, 0.42777172, -0.23479563...","[0.063432835, 0.0037313432, 0.05597015, 0.0746...","[0.009941773, 0.02417964, 0.03828127, 0.031959...","[0.043379027, -0.018720381, -0.018860383, 0.02...","[-0.2827446, -0.0847767, -0.179751, 0.26336384...","[0.063432835, 0.026119404, 0.09328358, 0.05597...","[9.563412, 8.137673, 2.5791569, 13.973948, -6....",-0.360572


df_smlies_canon contains the canonical and isomerical smiles for 50 betalactam antibiotics

In [22]:
df_smlies_canon = pd.read_csv("../results/tanimoto/betalactam_antibiotics_smiles.csv")
df_smlies_canon

Unnamed: 0,antibiotic,antibiotic_class,anti_class_expand,canonical_smile,isomeric_smile
0,Amoxicillin,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=C(C=C3)O)N)C...,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)[C@@H]...
1,Ampicillin,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)[C@@H]...
2,PenicillinV,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)COC3=CC=CC=C3)C(=O)O)C,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)COC3=C...
3,PenicillinG,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)CC3=CC...
4,Dicloxacillin,Penicillins,Penicillins,CC1=C(C(=NO1)C2=C(C=CC=C2Cl)Cl)C(=O)NC3C4N(C3=...,CC1=C(C(=NO1)C2=C(C=CC=C2Cl)Cl)C(=O)N[C@H]3[C@...
5,Nafcillin,Penicillins,Penicillins,CCOC1=C(C2=CC=CC=C2C=C1)C(=O)NC3C4N(C3=O)C(C(S...,CCOC1=C(C2=CC=CC=C2C=C1)C(=O)N[C@H]3[C@@H]4N(C...
6,Oxacillin,Penicillins,Penicillins,CC1=C(C(=NO1)C2=CC=CC=C2)C(=O)NC3C4N(C3=O)C(C(...,CC1=C(C(=NO1)C2=CC=CC=C2)C(=O)N[C@H]3[C@@H]4N(...
7,Azlocillin,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)NC(=O)...,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)[C@@H]...
8,Carbenicillin,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)C(=O)O...,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)C(C3=C...
9,Methicillin,Penicillins,Penicillins,CC1(C(N2C(S1)C(C2=O)NC(=O)C3=C(C=CC=C3OC)OC)C(...,CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)C3=C(C...


df_tanimoto_canon contains the pairwise tanimoto similarity estimated using canonical smiles

In [25]:
df_tanimoto_canon = pd.read_csv("../results/tanimoto/tanimoto_sim_canon.csv")
df_tanimoto_canon

Unnamed: 0,antibiotic,Amoxicillin,Ampicillin,PenicillinV,PenicillinG,Dicloxacillin,Nafcillin,Oxacillin,Azlocillin,Carbenicillin,...,Carumonam,Nocardicin_a,Tigemonam,Imipenem,Meropenem,Biapenem,Doripenem,Ertapenem,Panipenem,antibiotic_class
0,Amoxicillin,1.0,0.736842,0.528736,0.547619,0.45098,0.5,0.484536,0.490385,0.593023,...,0.195122,0.296875,0.236842,0.174757,0.192982,0.166667,0.188034,0.226562,0.183486,Penicillins
1,Ampicillin,0.736842,1.0,0.679487,0.706667,0.46,0.542553,0.595506,0.62766,0.776316,...,0.178862,0.205882,0.219298,0.15534,0.175439,0.159292,0.17094,0.220472,0.165138,Penicillins
2,PenicillinV,0.528736,0.679487,1.0,0.767123,0.455446,0.586957,0.588889,0.524752,0.638554,...,0.196721,0.213235,0.238938,0.142857,0.163793,0.147826,0.159664,0.209302,0.153153,Penicillins
3,PenicillinG,0.547619,0.706667,0.767123,1.0,0.469388,0.571429,0.609195,0.540816,0.6625,...,0.181818,0.182482,0.223214,0.147059,0.168142,0.151786,0.163793,0.214286,0.157407,Penicillins
4,Dicloxacillin,0.45098,0.46,0.455446,0.469388,1.0,0.504673,0.698925,0.385246,0.438095,...,0.166667,0.16129,0.20155,0.125,0.162791,0.148438,0.141791,0.212766,0.144,Penicillins
5,Nafcillin,0.5,0.542553,0.586957,0.571429,0.504673,1.0,0.602041,0.448276,0.515152,...,0.176471,0.226027,0.212598,0.135593,0.164062,0.140625,0.151515,0.214286,0.145161,Penicillins
6,Oxacillin,0.484536,0.595506,0.588889,0.609195,0.698925,0.602041,1.0,0.486486,0.56383,...,0.171642,0.181208,0.208,0.12931,0.168,0.153226,0.146154,0.218978,0.14876,Penicillins
7,Azlocillin,0.490385,0.62766,0.524752,0.540816,0.385246,0.448276,0.486486,1.0,0.645833,...,0.183099,0.198718,0.208955,0.136,0.180451,0.166667,0.176471,0.244755,0.181102,Penicillins
8,Carbenicillin,0.593023,0.776316,0.638554,0.6625,0.438095,0.515152,0.56383,0.645833,1.0,...,0.2,0.233577,0.230769,0.148148,0.177966,0.152542,0.173554,0.259843,0.168142,Penicillins
9,Methicillin,0.505495,0.516854,0.52809,0.528736,0.556701,0.578947,0.56383,0.423423,0.489362,...,0.190476,0.173611,0.230769,0.137615,0.188034,0.152542,0.154472,0.221374,0.157895,Penicillins


df_tanimoto_canon contains the pairwise tanimoto similarity estimated using isomeric  smiles

In [26]:
df_tanimoto_iso = pd.read_csv("../results/tanimoto/tanimoto_sim_iso.csv")
df_tanimoto_iso

Unnamed: 0,antibiotic,Amoxicillin,Ampicillin,PenicillinV,PenicillinG,Dicloxacillin,Nafcillin,Oxacillin,Azlocillin,Carbenicillin,...,Carumonam,Nocardicin_a,Tigemonam,Imipenem,Meropenem,Biapenem,Doripenem,Ertapenem,Panipenem,antibiotic_class
0,Amoxicillin,1.0,0.736842,0.528736,0.547619,0.45098,0.5,0.484536,0.490385,0.593023,...,0.195122,0.296875,0.236842,0.174757,0.192982,0.166667,0.188034,0.226562,0.183486,Penicillins
1,Ampicillin,0.736842,1.0,0.679487,0.706667,0.46,0.542553,0.595506,0.62766,0.776316,...,0.178862,0.205882,0.219298,0.15534,0.175439,0.159292,0.17094,0.220472,0.165138,Penicillins
2,PenicillinV,0.528736,0.679487,1.0,0.767123,0.455446,0.586957,0.588889,0.524752,0.638554,...,0.196721,0.213235,0.238938,0.142857,0.163793,0.147826,0.159664,0.209302,0.153153,Penicillins
3,PenicillinG,0.547619,0.706667,0.767123,1.0,0.469388,0.571429,0.609195,0.540816,0.6625,...,0.181818,0.182482,0.223214,0.147059,0.168142,0.151786,0.163793,0.214286,0.157407,Penicillins
4,Dicloxacillin,0.45098,0.46,0.455446,0.469388,1.0,0.504673,0.698925,0.385246,0.438095,...,0.166667,0.16129,0.20155,0.125,0.162791,0.148438,0.141791,0.212766,0.144,Penicillins
5,Nafcillin,0.5,0.542553,0.586957,0.571429,0.504673,1.0,0.602041,0.448276,0.515152,...,0.176471,0.226027,0.212598,0.135593,0.164062,0.140625,0.151515,0.214286,0.145161,Penicillins
6,Oxacillin,0.484536,0.595506,0.588889,0.609195,0.698925,0.602041,1.0,0.486486,0.56383,...,0.171642,0.181208,0.208,0.12931,0.168,0.153226,0.146154,0.218978,0.14876,Penicillins
7,Azlocillin,0.490385,0.62766,0.524752,0.540816,0.385246,0.448276,0.486486,1.0,0.645833,...,0.183099,0.198718,0.208955,0.136,0.180451,0.166667,0.176471,0.244755,0.181102,Penicillins
8,Carbenicillin,0.593023,0.776316,0.638554,0.6625,0.438095,0.515152,0.56383,0.645833,1.0,...,0.2,0.233577,0.230769,0.148148,0.177966,0.152542,0.173554,0.259843,0.168142,Penicillins
9,Methicillin,0.505495,0.516854,0.52809,0.528736,0.556701,0.578947,0.56383,0.423423,0.489362,...,0.190476,0.173611,0.230769,0.137615,0.188034,0.152542,0.154472,0.221374,0.157895,Penicillins


## Notebook 

df_corr_mic_tani contains the pairs of 16 distincts antibiotics and their related tanimoto similarity, MIC spearman correlation and associated p-value

In [27]:
df_corr_mic_tani = pd.read_csv("../results/tanimoto/paired_tanimoto_mic_pval.csv")
df_corr_mic_tani

Unnamed: 0,pair1,pair2,tanimoto,mic,mic_pval
0,Amoxicillin,Amoxicillin,1.000000,1.000000,1.000000e+00
1,Amoxicillin,Ticarcillin,0.576471,0.872485,1.020054e-24
2,Amoxicillin,Piperacillin,0.447368,0.490132,2.219791e-06
3,Amoxicillin,Ampicillin,0.736842,0.907407,1.849315e-03
4,Amoxicillin,Aztreonam,0.228070,0.154695,1.549769e-01
...,...,...,...,...,...
251,Meropenem,Ceftazidime,0.147651,0.208500,3.937363e-02
252,Meropenem,Ceftriaxone,0.183099,0.205978,4.440674e-01
253,Meropenem,Cefotaxime,0.188976,0.174549,1.018389e-01
254,Meropenem,Cefepime,0.153285,0.077528,4.601198e-01


## Notebook 

## Notebook 

## Notebook 

## Notebook 

## Notebook 

# Main conclusion and plots


## notebook 1 

1. Creation of the dataset 

![image.png](attachment:image.png)

## notebook 2 

1. Creation of embeddings from protein language models 

![image-2.png](attachment:image-2.png)

## notebook 3

1. Comparison of the organization of the sequences by 7 distinct protein language model and aminoacid composition.
2. Comparion of 3 algorithms of reduction of dimencionality (PCA, tSNE and UMAP)
3. Hyperparameter effects in tSNE and UMAP

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)
![image-6.png](attachment:image-6.png)

## notebook 4 

1. The "manually curated" DB of 50 betalactam antibiotics contains errors
2. Penicillins are relatively more similar (based on SMILES codificacion) between them in relation with the other main classes of betalactam antibiotics

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

## notebook 5 

1. 3 outiers potins of the MICs dataset were removed because have extreme log2_fold values
2. The distrobutions of betalactam antibiotics follow the literature trends 
3. There are "functional redundancy" in cephalosporins and penicillins

![bokeh_plot%20%281%29.png](attachment:bokeh_plot%20%281%29.png)



![bokeh_plot%20%282%29.png](attachment:bokeh_plot%20%282%29.png)



![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

![image-2.png](attachment:image-2.png)
![image.png](attachment:image.png)

## notebook 6


## notebook 7

## notebook 8

## notebook 10


## notebook 10