<a href="https://colab.research.google.com/github/kithmini-wijesiri/protein-structure-prediction-with-ESMFold/blob/main/human_GNAT1_structure_with_ESMFold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Protein Structure Prediction with ESMFold**

We use hugging face to import our model.

In [4]:
#install the requirements
!pip install torch
!pip install transformers
!pip install py3Dmol
!pip install accelerate



In [5]:
#load the tokenizer and model from hugging face
from transformers import AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

model = model.cuda()

##for running on CPU
#model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)
#model = model(cpu)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/121 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/8.44G [00:00<?, ?B/s]

Some weights of EsmForProteinFolding were not initialized from the model checkpoint at facebook/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Since ESMFold is quite a large model, there are some considerations regarding memory usage and performance. This step is important if you have RAM less than 16GB or you are using free colab notebook.

In [6]:
import torch
model.esm = model.esm.half()
torch.backends.cuda.matmul.allow_tf32 = True
model.trunk.set_chunk_size(64)

### **Folding Single Chain of Protein** <a name="sESM"></a>

Input Protein Sequence:

In [7]:
#This is the sequence for human GNAT1
test_protein = "MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"

Tokenize the protein sequence

In [8]:
#tokenize the input protein
tokenized_input = tokenizer([test_protein], return_tensors="pt", add_special_tokens=False)['input_ids']

#If you're using a GPU, you'll need to move the tokenized data to the GPU now.
tokenized_input = tokenized_input.cuda()

Now, we predict the 3D structure.

In [9]:
#generate 3d structure
with torch.no_grad():
    output = model(tokenized_input)

We save the predicted structure in a PDB file. It is important to do so to be able to use this structure for other task such as druggability assessment, functional domain prediction etc.

In [10]:
from transformers.models.esm.openfold_utils.protein import to_pdb, Protein as OFProtein
from transformers.models.esm.openfold_utils.feats import atom14_to_atom37

def convert_outputs_to_pdb(outputs):
    final_atom_positions = atom14_to_atom37(outputs["positions"][-1], outputs)
    outputs = {k: v.to("cpu").numpy() for k, v in outputs.items()}
    final_atom_positions = final_atom_positions.cpu().numpy()
    final_atom_mask = outputs["atom37_atom_exists"]
    pdbs = []
    for i in range(outputs["aatype"].shape[0]):
        aa = outputs["aatype"][i]
        pred_pos = final_atom_positions[i]
        mask = final_atom_mask[i]
        resid = outputs["residue_index"][i] + 1
        pred = OFProtein(
            aatype=aa,
            atom_positions=pred_pos,
            atom_mask=mask,
            residue_index=resid,
            b_factors=outputs["plddt"][i],
            chain_index=outputs["chain_index"][i] if "chain_index" in outputs else None,
        )
        pdbs.append(to_pdb(pred))
    return pdbs
pdb = convert_outputs_to_pdb(output)

In [11]:
# Save the PDB file
with open("./our_predicted_3D_tructure.pdb", "w") as f:
    f.write(pdb[0])

Visualise the Predicted Structure

In [12]:
!pip install nglview
!apt-get install -y libgl1-mesa-glx
import nglview as nv

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libgl1-mesa-glx is already the newest version (23.0.4-0ubuntu1~22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [13]:
!pip install py3Dmol



In [14]:
#visualise the protein structure
import py3Dmol
from IPython.display import display
view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js', width=800, height=400)
view.addModel("".join(pdb), 'pdb')
view.setStyle({'model': -1}, {"cartoon": {'color': 'spectrum'}})

<py3Dmol.view at 0x7817d82977c0>

The pLDDT (predicted Local Distance Difference Test) score is a key output of the ESMFold protein structure prediction model. It provides a measure of the confidence in the predicted structure at the per-residue level. It ranges from 0 to 1, with 1 indicating the highest confidence in the predicted structure for that residue.

In [15]:
# The plddt field is scaled from 0-1 on version of ESMFold used in this notebook but will be updated
# to match AlphaFold's scale of 0-100 in future versions.
#Blue indicates high confidence

if torch.max(output['plddt']) <= 1.0:
    vmin = 0.5
    vmax = 0.95
else:
    vmin = 50
    vmax = 95

view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min': vmin,'max': vmax}}})

<py3Dmol.view at 0x7817d82977c0>