# PLAPT affinities only for HRAS-P01112

Protein-Ligands Binding Affinity Prediction Using Pretrained Transformers

https://www.biorxiv.org/content/10.1101/2024.02.08.575577v3.full

https://github.com/trrt-good/WELP-PLAPT/tree/main

Predict the binding affinity of the complex ligand - protein (affinity) using SMILES code of the ligand and the protein sequence of the protein. 

We download the SMILES formulas (strings) from https://hmdb.ca/downloads# as one single file with the SDF format of all the compounds (hmdb_structures_v5.sdf). SDF = a format for ligands including the coordinates of atoms of ligands. We need only the fidls SMILES from each structure/ligand. This script consider that we have all the fields we need inside the multi-SDF file. If no SMILES, we need to modify the script to calculate for each ligand the SMILES.

With the pairs SMILES - protein sequence we can predict with the deep learning model PLAPT the logaritm and the binding affinity. The model is using pre-trained transformers like ProtBERT and ChemBERTa to transform the protein sequence and the SMILEs structure into embeddings that are used for the model.

This script is calculating the binding affinities using only one protein and multiple ligands. This can help to run multiple scripts for different target proteins.

Thus, these are the steps:

- Reading the SMILES and other info for ligands.
- Reading the sequence for the protein from FASTA file.
- Predict the binding affinities.

Due to commas from ligand descriptions, we shall use TAB separated files, not CSV.

## Import the libraries

In [1]:
import time
import pandas as pd
import re

import torch

## Settings

In [2]:
# info we need for ligands (extracted from multisdf_file)
LigandInfo = './ligands_hmdb.tsv'

# gene, protein and seq
BestPredProts = './InfoBestGenes.csv'

## Get protein sequences

In [3]:
df_BestProts = pd.read_csv(BestPredProts)

In [4]:
ProtSeqs = df_BestProts['V3'].tolist()
print(ProtSeqs[0])
print(len(ProtSeqs))

MELWRQCTHWLIQCRVLPPSHRVTWDGAQVCELAQALRDGVLLCQLLNNLLPHAINLREVNLRPQMSQFLCLKNIRTFLSTCCEKFGLKRSELFEAFDLFDVQDFGKVIYTLSALSWTPIAQNRGIMPFPTEEESVGDEDIYSGLSDQIDDTVEEDEDLYDCVENEEAEGDEIYEDLMRSEPVSMPPKMTEYDKRCCCLREIQQTEEKYTDTLGSIQQHFLKPLQRFLKPQDIEIIFINIEDLLRVHTHFLKEMKEALGTPGAANLYQVFIKYKERFLVYGRYCSQVESASKHLDRVAAAREDVQMKLEECSQRANNGRFTLRDLLMVPMQRVLKYHLLLQELVKHTQEAMEKENLRLALDAMRDLAQCVNEVKRDNETLRQITNFQLSIENLDQSLAHYGRPKIDGELKITSVERRSKMDRYAFLLDKALLICKRRGDSYDLKDFVNLHSFQVRDDSSGDRDNKKWSHMFLLIEDQGAQGYELFFKTRELKKKWMEQFEMAISNIYPENATANGHDFQMFSFEETTSCKACQMLLRGTFYQGYRCHRCRASAHKECLGRVPPCGRHGQDFPGTMKKDKLHRRAQDKKRNELGLPKMEVFQEYYGLPPPPGAIGPFLRLNPGDIVELTKAEAEQNWWEGRNTSTNEIGWFPCNRVKPYVHGPPQDLSVHLWYAGPMERAGAESILANRSDGTFLVRQRVKDAAEFAISIKYNVEVKHIKIMTAEGLYRITEKKAFRGLTELVEFYQQNSLKDCFKSLDTTLQFPFKEPEKRTISRPAVGSTKYFGTAKARYDFCARDRSELSLKEGDIIKILNKKGQQGWWRGEIYGRVGWFPANYVEEDYSEYC
23


In [5]:
df_BestProts

Unnamed: 0,gene,V1,V2,V3
0,VAV1,P15498,VAV_HUMAN Proto-oncogene vav OS=Homo sapiens O...,MELWRQCTHWLIQCRVLPPSHRVTWDGAQVCELAQALRDGVLLCQL...
1,TSC1,Q92574,TSC1_HUMAN Hamartin OS=Homo sapiens OX=9606 GN...,MAQQANVGELLAMLDSPMLGVRDDVTAVFKENLNSDRGPMLVNTLV...
2,TPR,P12270,TPR_HUMAN Nucleoprotein TPR OS=Homo sapiens OX...,MAAVLQQVLERTELNKLPKSVQNKLEKFLADQQSEIDGLKGRHEKF...
3,SMARCA4,P51532,SMCA4_HUMAN Transcription activator BRG1 OS=Ho...,MSTPDPPLGGTPRPGPSPGPGPSPGAMLGPSPGPSPGSAHSMMGPS...
4,SETD2,Q9BYW2,SETD2_HUMAN Histone-lysine N-methyltransferase...,MKQLQPQPPPKMGDFYDPEHPTPEEEENEAKIENVQKTGFIKGPMF...
5,RB1,P06400,RB_HUMAN Retinoblastoma-associated protein OS=...,MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVR...
6,PREX2,Q70Z35,PREX2_HUMAN Phosphatidylinositol 3_4_5-trispho...,MSEDSRGDSRAESAKDLEKQLRLRVCVLSELQKTERDYVGTLEFLV...
7,PPP2R1A,P30153,2AAA_HUMAN Serine/threonine-protein phosphatas...,MAAADGDDSLYPIAVLIDELRNEDVQLRLNSIKKLSTIALALGVER...
8,NBN,O60934,NBN_HUMAN Nibrin OS=Homo sapiens OX=9606 GN=NB...,MWKLLPAAGPAGGEPYRLLTGVEYVVGRKNCAILIENDQSISRNHA...
9,MUTYH,Q9UIF7,MUTYH_HUMAN Adenine DNA glycosylase OS=Homo sa...,MTPLVSRLSRLWAIMRKPRAAVGSGHRKQAASQEGRQKHAKNNSQA...


In [None]:
df_BestProts = df_BestProts[12] # only HRAS-P01112

## Get ligand SMILES

In [7]:
# Read the file with ligand SMILES
df_ligands = pd.read_csv(LigandInfo, sep='\t')

In [8]:
# get only the list of SMILES
smiles = list(df_ligands['SMILES'])

In [9]:
# check the number of ligands / SMILES
len(smiles)

217776

## Calculate affinity ligand - protein

In [None]:
for index, row in df_BestProts.iterrows():
    
    from plapt import Plapt
    
    xgene  = row['gene']
    xprot  = row['V1']
    xdescr = row['V2']
    xseq   = row['V3']
    print(f"\n-> {index} = Gene:{row['gene']}, Prot:{row['V1']}, Info:{row['V2']}, Seq:{row['V3']}")
    
    sequences = [xseq] * len(smiles)
    
    # set cuda for the calculations
    plapt = Plapt(device="cuda")
    
    # set a timer
    start_time = time.time()

    # calculate affinities for all pairs of protein - ligand using 2 list of sequences and smiles
    results = plapt.predict_affinity(sequences, smiles)

    end_time = time.time()
    execution_time = end_time - start_time

    print("Execution time:", execution_time, "seconds")
    print("Exec time in hours = ", execution_time/60/60)
    
    
    # get the results as dataframe
    data = {"smiles": smiles, "neg_log10_affinity_M": [d["neg_log10_affinity_M"] for d in results], "affinity_uM": [d["affinity_uM"] for d in results]}
    df_affinities = pd.DataFrame(data)
    
    # add ligand info columns to the affinity results
    df_affinities['DATABASE_ID']  = list(df_ligands['DATABASE_ID'])
    df_affinities['HMDB_ID']  = list(df_ligands['HMDB_ID'])
    df_affinities['GENERIC_NAME'] = list(df_ligands['GENERIC_NAME'])
    
    # sort the results by affinities
    df_affinities = df_affinities.sort_values(by='affinity_uM')
    
    # add a column with the index of the proteins in the list
    df_affinities['GeneID'] = xgene
    df_affinities['ProtID'] = xprot
    df_affinities['FastaDescription'] = xdescr
    df_affinities['ProtSequence'] = xseq
    
    outFile = './results/affinities_hmdb_HRAS-P01112.tsv'
    df_affinities.to_csv(outFile, sep='\t', na_rep='N/A', index=False)
    
    
    del sequences
    del results
    del data
    del df_affinities
    del plapt
    torch.cuda.empty_cache()