Run Protein Structure Design and Protein Structure Prediction

NOTE: The authors recommend running this notebook in Amazon SageMaker Studio with the following environment settings:  
* **PyTorch 1.13 Python 3.9 GPU-optimized** image  
* **Python 3** kernel  
* **ml.g4dn.xlarge** instance type 

Analyzing large macromolecules like proteins is an essential part of designing new therapeutics. Recently, a number of deep-learning based approaches have improved the speed and accuracy of protein structure analysis. Some of these methods are shown in the image below.

In this module, we will use several AI algorithms to design a protein binder to XY. 

* [RFDiffusion](https://github.com/RosettaCommons/RFdiffusion) is used to generate a small number of variant designs. We will only attempt to redesign parts of the variable region.
* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) is then used to discover novel sequences that are expected to fold to the novel structure.
* [ESMFold](https://github.com/facebookresearch/esm) is then used to score each of the candidate proteins. ESMFold returns the average predicted local distance difference test (pLDDT) score; which represents the confidence (averaged over all residues) in the predicted structure. This will be used to assess whether the predicted structure is likely to be correct.
For running ESMFold, we will use the ESMFold endpoint deployed in Module 1, so please ensure that you have run that module **before** running this one.

---
## 1. Setup and installation

Install RFDiffusion and it's dependencies

In [None]:
%pip install -U -q -r protein-design-requirements.txt --disable-pip-version-check

Download and extract the RFDiffusion and ProteinMPNN model weights (This will take several minutes)

In [None]:
%%bash
import os
!pip install py3Dmol
mkdir -p "data/weights/rfdiffusion" "data/weights/proteinmpnn" 
aws s3 cp --no-sign-request "s3://aws-batch-architecture-for-alphafold-public-artifacts/compressed/rfdiffusion_parameters_220407.tar.gz" "weights.tar.gz"
tar --extract -z --file="weights.tar.gz" --directory="data/weights/rfdiffusion" --no-same-owner
rm "weights.tar.gz"
wget -q -P "data/weights/proteinmpnn" https://github.com/dauparas/ProteinMPNN/raw/main/vanilla_model_weights/v_48_020.pt
#wget -q -P "data" https://files.rcsb.org/download/1N8Z.pdb

## Prep PDBS.

2.Generate Protein Binder backbone structures with RFDiffusion
We want to generate a Il7 mimetic that binds to Il7Ra (7OPB)
We pick the following hotspot residues in the interface between the Il7 and Il7Ra L65,I67,Y124,Y177,F178.

In [None]:
import py3Dmol
view = py3Dmol.view(width=600, height=400)
with open("/root/AI4PD_2023/pdbs/il7apo.pdb") as ifile:
    experimental_structure = "".join([x for x in ifile])
view.addModel(experimental_structure)
view.setStyle({"chain": "A"}, {"cartoon": {"color": "blue", "opacity": 0.6}})
view.setStyle(
    {"chain": "A", "resi": "65"}, {"stick": {"color": "orange", "opacity": 1.0}}
)
view.setStyle(
    {"chain": "A", "resi": "67"}, {"stick": {"color": "orange", "opacity": 1.0}}
)
view.setStyle(
    {"chain": "A", "resi": "124"}, {"stick": {"color": "orange", "opacity": 1.0}}
)
view.setStyle(
    {"chain": "A", "resi": "177"}, {"stick": {"color": "orange", "opacity": 1.0}}
)
view.setStyle(
    {"chain": "A", "resi": "178"}, {"stick": {"color": "orange", "opacity": 1.0}}
)
view.zoomTo()
view.show()


We can run 

TheRFDiffusion job will take about 5 minutes to complete on a ml.g4dn.xlarge instance type.

In [None]:
%%time
!mkdir -p data/results/rfdiffusion
from prothelpers.rfdiffusion import create_structures
create_structures(
    overrides=[
        "inference.input_pdb=/root/AI4PD_2023/pdbs/il7apo.pdb",
        "inference.output_prefix=data/results/rfdiffusion/rfdiffusion_result",
        "inference.model_directory_path=data/weights/rfdiffusion",
        "contigmap.contigs=[80-80/0 A1-196]",
        "ppi.hotspot_res=[A65,A67,A124,A177,A178]",
        "inference.num_designs=4",
    ]
)

Our new designs are in the `data/results/rfdiffusion` folder. Let's take a look at them.

In [None]:
rfdiffusion_results_dir = "data/results/rfdiffusion"

# Extract all PDB structures from the specified directory
def extract_structures_from_dir(directory):
    pdb_files = [f for f in os.listdir(directory) if f.endswith('.pdb')]
    structures = []
    for pdb_file in pdb_files:
        with open(os.path.join(directory, pdb_filea), 'r') as file:
            structures.append(file.read())
    return structures

structures = extract_structures_from_dir(rfdiffusion_results_dir)

view = py3Dmol.view()

# Add each structure to the view
for structure in structures:
    view.addModel(structure, format="pdb")

# Setting up the visualization styles
view.setStyle({"chain": "B"}, {"cartoon": {"color": "blue", "opacity": 0.6}})
# You can uncomment the following lines to apply additional styles
# view.setStyle({"chain": "B", "resi": "65"}, {"stick": {"color": "orange", "opacity": 1.0}})
# view.setStyle({"chain": "B", "resi": "67"}, {"stick": {"color": "orange", "opacity": 1.0}})
# view.setStyle({"chain": "B", "resi": "124"}, {"stick": {"color": "orange", "opacity": 1.0}})
# view.setStyle({"chain": "B", "resi": "177"}, {"stick": {"color": "orange", "opacity": 1.0}})
# view.setStyle({"chain": "B", "resi": "178"}, {"stick": {"color": "orange", "opacity": 1.0}})
view.setStyle({"chain": "A"}, {"cartoon": {"color": "green"}})
view.show()


## 3. Translate Structure into Sequence with ProteinMPNN
ProteinMPNN is a tool for **inverse protein folding**. In inverse protein folding, the input is a protien tertiary structure, while the output is a sequence (or sequences) that are predicted to fold in the specified structure. Here is a schematic for how it works:
<div style="text-align: left;">
    <img src="img/06.png" alt="A diagram of inverse protein folding" width="700" />
</div>
                        
*image credit: https://huggingface.co/spaces/simonduerr/ProteinMPNN.*        
                               
ProteinMPNN will returns the sequences in [FASTA format]


We gather the locations of the RFDiffusion output structures and submit them to ProteinMPNN. This will take about 15 seconds on a ml.g4dn.xlarge instance.

In [None]:
%%time
!mkdir -p data/results/proteinmpnn

from prothelpers import proteinmpnn
from prothelpers.sequence import list_files_in_dir

rfdiffusion_candidates = list_files_in_dir(rfdiffusion_results_dir, extension=".pdb")

proteinmpnn_results_dir = "data/results/proteinmpnn"

for path in rfdiffusion_candidates:
    proteinmpnn.design(
        pdb_path=path,
        out_folder=proteinmpnn_results_dir,
        num_seq_per_target=8,
        pdb_path_chains="A",
        path_to_model_weights="data/weights/proteinmpnn",
        batch_size=1,
        suppress_print=1,
    )

Let's look at the results

In [None]:
import os
from prothelpers.sequence import extract_seqs_from_dir

mpnn_dir = os.path.join(proteinmpnn_results_dir, "seqs")
mpnn_sequences = extract_seqs_from_dir(mpnn_dir, extension="fa")
print(mpnn_sequences)

## Run Inference on ESMFold
ProteinMPNN has generated 16 new sequences, 4 per predicted structure. But which one should we test in the lab? There are lots of metrics we could use, like isoelectric point, patent status, or homology with other sequences. For this example, we're going to measure the "foldability" of each sequence using ESMFold. This is a popular way to identify those sequences that are most similar to other experimentally-determined structures. Specifically, we'll use the average predicted local distance difference test (pLDDT) score, a measure of the ESMFold prediction confidence.

In [None]:
%store -r endpoint_name

In [None]:
import boto3

# Wait until ESMFold endpoint from module 1 is in service
waiter = boto3.client('sagemaker').get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

In [None]:
%%time
!mkdir -p data/results/esmfold

import json
from prothelpers.structure import get_mean_plddt
import pandas as pd
import sagemaker
from sagemaker.predictor import Predictor

predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session(),
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.StringDeserializer(),
)

metrics = []
for i, seq in enumerate(mpnn_sequences):
    print(f"Generating structure prediction {i} for {seq}")
    esmfold_output = json.loads(predictor.predict(seq))[0]
    mean_plddt = get_mean_plddt(esmfold_output)
    output_file = f"data/results/esmfold/prediction_{i}.pdb"
    with open(output_file, "w") as f:
        f.write(esmfold_output)
    metrics.append(
        {"seq": seq, "esmfold_result": output_file, "mean_plddt": mean_plddt}
    )

metrics_df = (
    pd.DataFrame(metrics)
    .sort_values(by="mean_plddt", ascending=False)
    .reset_index(drop=True)
)
metrics_df

You can see from the results above that the designed proteins have a PLDDT of 0.8 or greater, meaning that ESMFold has high confidence in the structures. The highest-scoring sequences are good candidates for synthesis and testing.

Here is a screenshot of one example of the designed antibody (blue) superimposed on the orignal antibody (green). The orange and red corresponds to the extracellular domain of HER2. Note that the structure of the designed antibody is similair, but not identical to the original.

![Picture of designed protein](img/03.png)


When you are finished with this module, uncomment and run the cell below to delete the ESMFold endpoint.

In [None]:
%store -z

try:
    predictor.delete_endpoint()
except:
    pass