# Protein Structure Prediction with the AlphaFold2 NIM

[reference notebook from NVIDIA](https://github.com/NVIDIA/bionemo-examples/blob/62aef816070399814e478234dc47eb2ccddfd1a0/examples/nims/alphafold2/AlphaFold2-NIM-example.ipynb)

[documentation of AlphaFold2 and endpoints reference](https://docs.nvidia.com/nim/bionemo/alphafold2/latest/endpoints.html)

This notebook assumes all requirements in `requirements.txt` are already installed on the client and the application is up and running in OCI.  

In [None]:
# import required packages
import py3Dmol
import ipywidgets as widgets
from IPython.display import display
from concurrent.futures import ThreadPoolExecutor

import json
import os
import requests
from enum import StrEnum, Enum
from typing import Tuple, Dict, Any, List
from pathlib import Path
from Bio import SeqIO

In [None]:
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY") or input("Paste Run Key: ")

The input data comes in the form of amino acid sequences. We use https://www.uniprot.org/ to gather all proteins with name "Dihydrofolate reductase" and focus on the Human species.

In [None]:
# get data from UniProt with the following filters:
# protein_name = Dihydrofolate reductase
# organism_id = 9606 (Homo Sapiens)
!wget "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28%28protein_name%3A%22Dihydrofolate+reductase%22%29+AND+%28organism_id%3A9606%29%29" -O dataset.gz
!gzip -d dataset.gz

In [None]:
# Get the public IP of the application
AF2_NIM_HOST = 'http://<Load_Balancer_IP>:8081'

In [None]:
# Create records
fasta_file = "dataset"
records = [rec for rec in SeqIO.parse(fasta_file, "fasta") if "Isoform" not in rec.description]

In [None]:
str(records[0].seq)

In [None]:
# check that the server is up and ready
response = requests.get(f'{AF2_NIM_HOST}/v1/health/ready')

In [7]:
response.text

'{"status":"ready"}'

In [None]:
# function to run querries against the endpoint alphafold2/predict-structure-from-sequence. It also writes a file locally for each protein structure found.
def predict_structure(elt): 
    try:       
        print(elt.id.replace("|","_"))
        protein = str(elt.seq)
        af2_response = requests.post(
            f'{AF2_NIM_HOST}/protein-structure/alphafold2/predict-structure-from-sequence',
            json={
                'sequence': protein,
                'databases': ['uniref90', 'mgnify', 'small_bfd'],
                'msa_algorithm': 'jackhmmer',
                'e_value': 0.0001,
                'bit_score': -1, # -1 means to fallback to the e-value
                'msa_iterations': 1,
                'relax_prediction': True,
            },timeout=None).json()
        folded_protein = af2_response[0]
        # [OPTIONAL STEP]: Write the structure coordinates to a file
        filename = elt.id.replace("|","_")+".pdb"
        with open(filename, 'w') as file:
            file.write(folded_protein) 
        return {str(elt.id): folded_protein}    
    except Exception as e:
        print('Request failed due to error:', e)
    

In [None]:
# sending 2 requests at once. Can be increased with the replica count. With A10's, a request can take 10-15mins
MAX_THREADS = 2
with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    prot_dict = list(executor.map(predict_structure, records))

At this stage, the protein structures are found and they can be visualised with Pymol

In [None]:
# replace with the correct file name
prot_file = "FILE.pdb"
with open(prot_file) as ifile:
    system = "".join([x for x in ifile])

In [21]:
view = py3Dmol.view(width=1200, height=900)
view.addModelsAsFrames(system)

# Set the style and color by B-factor (approximating colors for pLDDT scores)
view.zoomTo()
view.setStyle({'cartoon': {'colorscheme': {'prop': 'b', 'gradient': 'roygb', 'min': 40, 'max': 100}}})

<py3Dmol.view at 0x7f764c173080>