Run Protein Structure Design and Protein Structure Prediction

NOTE: The authors recommend running this notebook in Amazon SageMaker Studio with the following environment settings:  
* **PyTorch 1.13 Python 3.9 GPU-optimized** image  
* **Python 3** kernel  
* **ml.g4dn.xlarge** instance type 

Analyzing large macromolecules like proteins is an essential part of designing new therapeutics. Recently, a number of deep-learning based approaches have improved the speed and accuracy of protein structure analysis. Some of these methods are shown in the image below.

In this module, we will use several AI algorithms to design a protein binder to XY. 

* [RFDiffusion](https://github.com/RosettaCommons/RFdiffusion) is used to generate a small number of variant designs. We will only attempt to redesign parts of the variable region.
* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) is then used to discover novel sequences that are expected to fold to the novel structure.
* [ESMFold](https://github.com/facebookresearch/esm) is then used to score each of the candidate proteins. ESMFold returns the average predicted local distance difference test (pLDDT) score; which represents the confidence (averaged over all residues) in the predicted structure. This will be used to assess whether the predicted structure is likely to be correct.
For running ESMFold, we will use the ESMFold endpoint deployed in Module 1, so please ensure that you have run that module **before** running this one.

---
## 1. Setup and installation

Install RFDiffusion and it's dependencies

In [None]:
%pip install -U -q -r protein-design-requirements.txt --disable-pip-version-check

Download and extract the RFDiffusion and ProteinMPNN model weights (This will take several minutes)

In [None]:
!pip install prody
!pip install py3Dmol

In [None]:
%%bash
mkdir -p "data/weights/rfdiffusion" "data/weights/proteinmpnn" 
aws s3 cp --no-sign-request "s3://aws-batch-architecture-for-alphafold-public-artifacts/compressed/rfdiffusion_parameters_220407.tar.gz" "weights.tar.gz"
tar --extract -z --file="weights.tar.gz" --directory="data/weights/rfdiffusion" --no-same-owner
rm "weights.tar.gz"
wget -q -P "data/weights/proteinmpnn" https://github.com/dauparas/ProteinMPNN/raw/main/vanilla_model_weights/v_48_020.pt
wget -q -P "data" https://files.rcsb.org/download/1N8Z.pdb

## Prep PDBS.

We can change the number in the congits=[], to specify the length of the protein that we want to generate by diffusion.

TheR FDiffusion job will take about 5 minutes to complete on a ml.g4dn.xlarge instance type.

In [None]:
%%time
!mkdir -p data/results/rfdiffusion
from prothelpers.rfdiffusion import create_structures
create_structures(
    overrides=[
        "inference.output_prefix=data/results/rfdiffusion/rfdiffusion_result",
        "inference.model_directory_path=data/weights/rfdiffusion",
        "contigmap.contigs=[100-100]",
        "inference.num_designs=4",
        "inference.input_pdb=/root/AI4PD_2023/RFdiffusion/examples/input_pdbs/1qys.pdb" \

    ]
)

Our new designs are in the `data/results/rfdiffusion` folder. Let's take a look at them.

In [None]:
import os
import py3Dmol

def extract_structures_from_dir(directory):
    pdb_files = [f for f in os.listdir(directory) if f.endswith('.pdb')]
    structures = []
    for pdb_file in pdb_files:
        with open(os.path.join(directory, pdb_file), 'r') as file:
            structures.append(file.read())
    return structures

structures = extract_structures_from_dir('data/results/rfdiffusion')  # replace with your directory

# Display each structure in a separate window
for structure in structures:
    view = py3Dmol.view()
    view.addModel(structure, format="pdb")
    view.setStyle({"chain": "A"}, {"cartoon": {"color": "blue", "opacity": 1.0}})
    view.zoomTo()
    view.show()


## 3. Translate Structure into Sequence with ProteinMPNN
ProteinMPNN is a tool for **inverse protein folding**. In inverse protein folding, the input is a protien tertiary structure, while the output is a sequence (or sequences) that are predicted to fold in the specified structure. Here is a schematic for how it works:
<div style="text-align: left;">
    <img src="img/06.png" alt="A diagram of inverse protein folding" width="700" />
</div>
                        
*image credit: https://huggingface.co/spaces/simonduerr/ProteinMPNN.*        
                               
ProteinMPNN will returns the sequences in [FASTA format]


We gather the locations of the RFDiffusion output structures and submit them to ProteinMPNN. This will take about 15 seconds on a ml.g4dn.xlarge instance.

In [None]:
%%time
!mkdir -p data/results/proteinmpnn

from prothelpers import proteinmpnn
from prothelpers.sequence import list_files_in_dir

rfdiffusion_candidates = list_files_in_dir(rfdiffusion_results_dir, extension=".pdb")

proteinmpnn_results_dir = "data/results/proteinmpnn"

for path in rfdiffusion_candidates:
    proteinmpnn.design(
        pdb_path=path,
        out_folder=proteinmpnn_results_dir,
        num_seq_per_target=8,
        pdb_path_chains="A",
        path_to_model_weights="data/weights/proteinmpnn",
        batch_size=1,
        suppress_print=1,
    )

Let's look at the results

In [None]:
import os
from prothelpers.sequence import extract_seqs_from_dir

mpnn_dir = os.path.join(proteinmpnn_results_dir, "seqs")
mpnn_sequences = extract_seqs_from_dir(mpnn_dir, extension="fa")
#print(mpnn_sequences)

def to_fasta(sequences):
    fasta_format = ""
    for i, seq in enumerate(sequences, 1):
        fasta_format += f">sequence_{i}\n{seq}\n"
    return fasta_format

fasta_output = to_fasta(mpnn_sequences)
print(fasta_output)

# If you want to save the FASTA format to a file:
with open("output.fasta", "w") as f:
    f.write(fasta_output)

## Run Inference with AlphaFold2 on AWS
Download the fasta file with the seqeunces you generated with ProteinMPNN.

Navigate to the alphfold -->


Upload you faste file and submit job.


## Compare RFdiffusion backbone to AlphaFold2 predicted structure

In [None]:
!pip install prody

In [None]:
prediction_pdb = '' # this is the path to your AF2/ESMFold Prediction
reference_pdb = '' # this is the path to your original de novo design (prior to sequence design)

In [None]:
def rmsd_calc(pred, ref):
    import prody
    prody.confProDy(verbosity='none')
    
    p = prody.parsePDB(pred,subset='ca')
    r = prody.parsePDB(ref,subset='ca')
    prody.calcTransformation(p,r).apply(p)

    return prody.calcRMSD(p, r), p, r

def visualize_pdb_overlay(mol1_path, mol2_path):
    # Create the viewer with specified dimensions
    view = py3Dmol.view(width=400, height=300)
    
    # Load the first PDB
    view.addModel(open(mol1_path, 'r').read(), 'pdb')
    
    # Load the second PDB
    view.addModel(open(mol2_path, 'r').read(), 'pdb')
    
    # Set style. For demonstration purposes, we'll use different colors for each structure.
    view.setStyle({'model': 0}, {"cartoon": {'color': 'blue'}})  # First structure in blue
    view.setStyle({'model': 1}, {"cartoon": {'color': 'red'}})   # Second structure in red
    
    # Set the background color and zoom to fit the structures
    view.setBackgroundColor('white')
    view.zoomTo()
    
    # Display the overlaid PDBs
    view.show()


In [None]:
%time rmsd, p, r = rmsd_calc(pdb_1,pdb_2)
print(rmsd)
visualize_pdb_overlay(prediction_pdb, reference_pdb)