# ESM3
ESM3 is a frontier generative model for biology, able to jointly reason across three fundamental biological properties of proteins: sequence, structure, and function. These three data modalities are represented as tracks of discrete tokens at the input and output of ESM3. You can present the model with a combination of partial inputs across the tracks, and ESM3 will provide output predictions for all the tracks.

ESM3 is a generative masked language model. You can prompt it with partial sequence, structure, and function keywords, and iteratively sample masked positions until all positions are unmasked. This iterative sampling is what the `.generate()` function does.

![image.png](https://github.com/evolutionaryscale/esm/blob/main/_assets/esm3_diagram.png?raw=true)

The ESM3 architecture is highly scalable due to its transformer backbone and all-to-all reasoning over discrete token sequences. At its largest scale, ESM3 was trained with 1.07e24 FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters.
Here we present `esm3-open-small`. With 1.4B parameters it is the smallest and fastest model in the family, trained specifically to be open sourced. ESM3-open is available under a non-commercial license.

# Imports

If you're running in Colab, you probably want to get a GPU runtime first (Runtime > Change runtime type > T4 GPU).

In [None]:
%set_env TOKENIZERS_PARALLELISM=false
!pip install esm
import numpy as np
import torch
!pip install py3Dmol
import py3Dmol
import huggingface_hub

from esm.utils.structure.protein_chain import ProteinChain
from esm.models.esm3 import ESM3
from esm.sdk import client
from esm.sdk.api import (
    ESMProtein,
    GenerationConfig,
)

#  Load `esm-open-small` on GPU

In [None]:
huggingface_hub.login()  # will prompt you to get an API key and accept the ESM3 license.
model =  ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

Alternatively, you could use the Forge API running the model remotely, and use the local `client` to call the API just like you're used to with the model running locally on your GPU:

In [None]:
# from getpass import getpass
# token = getpass("Token from Forge console: ")
# model = client(
#     model="esm3-lg-alpha1",
#     url="https://forge.evolutionaryscale.ai",
#     token=token,
# )

# Let's construct a prompt for ESM3, focusing on the task of scaffolding a motif from a natural protein

First, we can use the `ProteinChain` class from the `esm` sdk to grab a protein structure from the PDB.
We'll work with a human renal (kidney) dipeptidase (a protein that breaks up two amino acids bound together). Renal dipeptidases are of particular interest because they metabolize certain antibiotics.

In [None]:
pdb_id = "1ITU" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id)
# Alternatively, we could have used ProteinChain.from_pdb() to load a protein structure from a local PDB file

The `ProteinChain` class is a object that makes it easy to work with protein structures. It contains a `sequence` attribute that contains the amino acid sequence of the protein


In [None]:
print(renal_dipep_chain.sequence)

`ProteinChain` also contains an `atom37_positions` numpy array that contains the atomic coordinates of each of the residues in the protein. 

The shape of the array is `(n_residues, 37, 3)` where `n_residues` is the number of residues in the protein and 37 is the number of possible distinct atoms that may be present across all amino acids (e.g. the first three atoms are the N, C-alpha, and C atoms corresponding to the protein backbone). The 3 corresponds to the x, y, and z coordinates of each atom. The atom37 representation of protein structure allows us to use a single format to conveniently represent all amino acids -- **coordinates are only present for the atoms that are present in the amino acid and `nan` otherwise**.

In [None]:
print("atom37_positions shape: ", renal_dipep_chain.atom37_positions.shape)
print(renal_dipep_chain.atom37_positions[:3])

We can visualize the protein chain using the `py3Dmol` library

In [None]:
# First we can create a `py3Dmol` view object
view = py3Dmol.view(width=500, height=500)
# py3Dmol requires the atomic coordinates to be in PDB format, so we convert the `ProteinChain` object to a PDB string
pdb_str = renal_dipep_chain.to_pdb_string()
# Load the PDB string into the `py3Dmol` view object
view.addModel(pdb_str, "pdb")
# Set the style of the protein chain
view.setStyle({"cartoon": {"color": "spectrum"}})
# Zoom in on the protein chain
view.zoomTo()
# Display the protein chain
view.show()

Now, let's try to scaffold a motif from this protein using ESM3 -- we'll prompt the model with the sequence and structure of a helix-coil motif from renal dipeptidase and have the model generate a larger scaffold that includes the motif

In [None]:
motif_inds = np.arange(123, 146)
# `ProteinChain` objects can be indexed like numpy arrays to extract the sequence and atomic coordinates of a subset of residues
motif_sequence = renal_dipep_chain[motif_inds].sequence
motif_atom37_positions = renal_dipep_chain[motif_inds].atom37_positions
print("Motif sequence: ", motif_sequence)
print("Motif atom37_positions shape: ", motif_atom37_positions.shape)

We can also visualize the motif in the original chain using `py3Dmol`. We'll color the original chain in grey and the motif in blue

In [None]:
view = py3Dmol.view(width=500, height=500)
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "lightgrey"}})
motif_res_inds = (motif_inds + 1).tolist() # residue indices are 1-indexed in PDB files, so we add 1 to the indices
view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
view.zoomTo()
view.show()

Now, we can use the `ESMProtein` class to construct a prompt that will instruct ESM3 to scaffold the motif

In [None]:
prompt_length = 200
# First, we can construct a sequence prompt of all masks
sequence_prompt = ["_"]*prompt_length
# Then, we can randomly insert the motif sequence into the prompt (we randomly choose 72 here)
sequence_prompt[72:72+len(motif_sequence)] = list(motif_sequence)
sequence_prompt = "".join(sequence_prompt)
print("Sequence prompt: ", sequence_prompt)
print("Length of sequence prompt: ", len(sequence_prompt))

# Next, we can construct a structure prompt of all nan coordinates
structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
# Then, we can insert the motif atomic coordinates into the prompt, starting at index 72
structure_prompt[72:72+len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
print("Structure prompt shape: ", structure_prompt.shape)
print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

# Finally, we can use the ESMProtein class to compose the sequence and structure prompts into a single prompt that can be passed to ESM3
protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

Now, we can use the `generate` method of the model to iteratively sample a protein sequence based on the prompt. Under the hood, the model performs num_steps forward passes and samples a set of tokens at each step until the chosen track being generated is fully unmasked. 

In [None]:
# We'll have to first construct a `GenerationConfig` object that specifies the decoding parameters that we want to use
sequence_generation_config = GenerationConfig(
    track="sequence", # We want ESM3 to generate tokens for the sequence track
    num_steps=sequence_prompt.count("_") // 2, # We'll use num(mask tokens) // 2 steps to decode the sequence
    temperature=0.5, # We'll use a temperature of 0.5 to control the randomness of the decoding process
)

# Now, we can use the `generate` method of the model to decode the sequence
sequence_generation = model.generate(protein_prompt, sequence_generation_config)
print("Sequence Prompt:\n\t", protein_prompt.sequence)
print("Generated sequence:\n\t", sequence_generation.sequence)

We can also use the `generate` method to predict the structure of the generated sequence by iteratively sampling structure tokens.

In [None]:
structure_prediction_config = GenerationConfig(
    track="structure", # We want ESM3 to generate tokens for the structure track
    num_steps=len(sequence_generation) // 8,
    temperature=0.7, 
)
structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

Now, we can visualize the generated structure using `py3Dmol`. We'll visualize the generated structure (right, green) alongside the original structure (left, grey) from which the motif was drawn. The motif residues are colored in cyan.

In [None]:
# Convert the generated structure to a back into a ProteinChain object
structure_prediction_chain = structure_prediction.to_protein_chain()
# Align the generated structure to the original structure using the motif residues
motif_inds_in_generation = np.arange(72, 72+len(motif_sequence))
structure_prediction_chain.align(renal_dipep_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
crmsd = structure_prediction_chain.rmsd(renal_dipep_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
print("cRMSD of the motif in the generated structure vs the original structure: ", crmsd)

view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
view.addModel(pdb_str, "pdb", viewer=(0, 0))
view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
view.addStyle({"resi": (motif_inds_in_generation+1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
view.zoomTo()
view.show()

# Secondary Structure Editing Example: Helix Shortening

Now, we can try another generation task with ESM3. We'll use the secondary structure track, along with the sequence track, to shorten a helix-coil-helix region (residues 39-111) in a protein structure (colored in blue below)

In [None]:
helix_shortening_chain = ProteinChain.from_rcsb("7XBQ", "A")
view = py3Dmol.view(width=500, height=500)
view.addModel(helix_shortening_chain.to_pdb_string(), "pdb")
view.setStyle({"cartoon": {"color": "lightgrey"}})
helix_region = np.arange(38, 111) # zero-indexed
view.addStyle({"resi": (helix_region + 1).tolist()}, {"cartoon": {"color":"lightblue"}})
view.zoomTo()
view.show()
helix_shortening_ss8 = "CCCSHHHHHHHHHHHTTCHHHHHHHHHHHHHTCSSCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHTTCHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIIIIGGGCCSHHHHHHHHHHHHHHHHHHHHHCCHHHHHHHHHHHHHHHHHHHHHHHHHSCTTCHHHHHHHHHHHHHIIIIICCHHHHHHHHHHHHHHHHTTCTTCCSSHHHHHHHHHHHHHHHHHHHC"
print("Secondary structure of protein: (H: Alpha Helix, E: Beta Strand, C: Coil) \n\t", helix_shortening_ss8)

The helix-coil-helix region in the original protein is 73 residues long. We will try to shorten it to 45 residues by prompting the model with partial sequence and secondary structure

In [None]:
shortened_region_length = 45

# We'll construct a sequence prompt that masks the (shortened) helix-coil-helix region, but leaves the flanking regions unmasked
sequence_prompt = helix_shortening_chain.sequence[:helix_region[0]] + "_" * shortened_region_length + helix_shortening_chain.sequence[helix_region[-1] + 1:]
print("Sequence prompt:\n\t", sequence_prompt)

# We'll construct a secondary structure prompt that retains the secondary structure of the flanking regions, and shortens the lengths of helices in the helix-coil-helix region
ss8_prompt = helix_shortening_ss8[:helix_region[0]] + (((shortened_region_length - 3) // 2) * "H" + "C"*3 + ((shortened_region_length - 3) // 2) * "H") + helix_shortening_ss8[helix_region[-1] + 1:]
print("SS8 prompt:\n\t", ss8_prompt)
print("Proposed SS8 for shortened helix-coil-helix region:\n\t", " "*helix_region[0] + ss8_prompt[helix_region[0]:helix_region[0]+45])

print("")
print("Original sequence:\n\t", helix_shortening_chain.sequence)
print("Original SS8:\n\t", helix_shortening_ss8)
print("Original SS8 for helix-coil-helix region:\n\t", " "*helix_region[0] + helix_shortening_ss8[helix_region[0]:helix_region[-1]+1])


# We can again use the ESMProtein class to compose the sequence and secondary structure prompts into a single prompt that can be passed to ESM3
protein_prompt = ESMProtein(sequence=sequence_prompt, secondary_structure=ss8_prompt)

We can again use the `generate` method of the model to iteratively decode a protein sequence based on the prompt

In [None]:
print("Generating protein sequence...")
sequence_generation = model.generate(protein_prompt, GenerationConfig(track="sequence", num_steps=protein_prompt.sequence.count("_") // 2, temperature=0.5))
print("Folding protein...")
structure_prediction = model.generate(ESMProtein(sequence=sequence_generation.sequence), GenerationConfig(track="structure", num_steps=len(protein_prompt) // 4, temperature=0))

Now, we can visualize the generated structure using `py3Dmol`. We'll visualize the generated structure (right) alongside the original structure (left) from which the motif was drawn. The helix-coil-helix region in the original structure is colored in blue and the shortened region in the generated structure is colored in pink.

In [None]:
predicted_chain = structure_prediction.to_protein_chain()
predicted_chain = predicted_chain.align(helix_shortening_chain, mobile_inds=np.arange(len(predicted_chain) - 120, len(predicted_chain)), target_inds=np.arange(len(helix_shortening_chain) - 120, len(helix_shortening_chain)))
view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
view.addModel(helix_shortening_chain.to_pdb_string(), "pdb", viewer=(0, 0))
view.addModel(predicted_chain.to_pdb_string(), "pdb", viewer=(0, 1))
view.setStyle({"cartoon": {"color": "lightgrey"}})
view.addStyle({"resi": (helix_region + 1).tolist()}, {"cartoon": {"color":"lightblue"}},viewer=(0, 0))
view.addStyle({"resi": (np.arange(helix_region[0], helix_region[0] + 45) + 1).tolist()}, {"cartoon": {"color":"pink"}},viewer=(0, 1))
view.zoomTo()
view.show()

# SASA Editing Example: Exposing a buried helix

Let's grab 1LBS from the PDB and visualize it using `py3Dmol`. 1LBS has an alternating alpha-beta sandwich fold, with a buried helix in the center, highlighted in red

In [None]:
lipase_chain = ProteinChain.from_rcsb("1LBS", "A")
span_start = 105
span_end = 116
view = py3Dmol.view(width=500, height=500)
view.addModel(lipase_chain.to_pdb_string(), "pdb")
view.setStyle({"cartoon": {"color": "lightgrey"}})
view.addStyle({"resi": (np.arange(span_start, span_end) + 1).tolist()}, {"cartoon": {"color":"red"}})
view.zoomTo()
view.show()
lipase_ss8 = "CCSSCCCCSSCHHHHHHTEEETTBBTTBCSSEEEEECCTTCCHHHHHTTTHHHHHHHTTCEEEEECCTTTTCSCHHHHHHHHHHHHHHHHHHTTSCCEEEEEETHHHHHHHHHHHHCGGGGGTEEEEEEESCCTTCBGGGHHHHHTTCBCHHHHHTBTTCHHHHHHHHTTTTBCSSCEEEEECTTCSSSCCCCSSSTTSTTCCBTSEEEEHHHHHCTTCCCCSHHHHHBHHHHHHHHHHHHCTTSSCCGGGCCSTTCCCSBCTTSCHHHHHHHHSTHHHHHHHHHHSCCBSSCCCCCGGGGGGSTTCEETTEECCC"

We can construct a multimodal prompt for ESM3 to instruct it to expose the buried helix as follows:
1. Prompt with the **structure** of the buried helix highlighted in red -- this will prompt ESM3 to generate a protein that contains that same helix
2. Prompt with high **SASA** values for the residues in the buried helix -- this will prompt ESM3 to expose the helix to the surface of the protein

In [None]:
structure_prompt = torch.full((len(lipase_chain), 37, 3), torch.nan)
structure_prompt[span_start:span_end] = torch.tensor(lipase_chain[span_start:span_end].atom37_positions, dtype=torch.float32)   

sasa_prompt = [None]*len(lipase_chain)
sasa_prompt[span_start:span_end] = [40.0]*(span_end - span_start)

print("SASA prompt (just for buried region): ", sasa_prompt[span_start:span_end])

protein_prompt = ESMProtein(sequence="_"*len(lipase_chain), coordinates=structure_prompt, sasa=sasa_prompt)

This is a more difficult task, so you may need to sample more generations from ESM before you find a solution. We'll sample 32 here and sort by the generations with the highest predicted TM-score (pTM) by ESM3. 

In [None]:
generated_proteins = []
N_SAMPLES = 16
for i in range(N_SAMPLES):
    print("Generating protein sequence...")
    sequence_generation = model.generate(protein_prompt, GenerationConfig(track="sequence", num_steps=len(protein_prompt) // 8, temperature=0.7))
    print("Folding protein...")
    structure_prediction = model.generate(ESMProtein(sequence=sequence_generation.sequence), GenerationConfig(track="structure", num_steps=len(protein_prompt) // 32))
    generated_proteins.append(structure_prediction)

# Sort generations by ptm
generated_proteins = sorted(generated_proteins, key=lambda x: x.ptm.item(), reverse=True)

Let's visualize the top 4 generations by pTM, alongside with the original protein (on the left)

In [None]:
N_SAMPLES_TO_SHOW = 4
view = py3Dmol.view(width=1000, height=500, viewergrid=(1, N_SAMPLES_TO_SHOW+1))
view.addModel(lipase_chain.to_pdb_string(), "pdb", viewer=(0, 0))
for i in range(N_SAMPLES_TO_SHOW):
    print("PTM of generated protein {}: {:.2f}".format(i+1, generated_proteins[i].ptm.item()))
    view.addModel(generated_proteins[i].to_protein_chain().to_pdb_string(), "pdb", viewer=(0, i+1))
view.setStyle({"cartoon": {"color": "lightgrey"}})
view.addStyle({"resi": (np.arange(span_start, span_end) + 1).tolist()}, {"cartoon": {"color": "red"}})
view.zoomTo()
view.show()

In [1]:
# Import necessary libraries
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.utils.structure import ProteinChain
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

# Set environment variables and install required packages
%set_env TOKENIZERS_PARALLELISM=false
!pip install esm py3Dmol

# Log in to Hugging Face Hub
login(token="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")

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Active site indices (example indices for the 7 active sites)
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = beta_lactamase_chain[motif_inds].sequence
    motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = beta_lactamase_chain.to_pdb_string()
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()
!pip install --upgrade esm


ImportError: cannot import name 'ProteinChain' from 'esm.utils.structure' (unknown location)

In [2]:
!pip install --upgrade esm


[0m

In [3]:
from esm.utils.structure.protein_chain import ProteinChain


  @autocast(enabled=False)
  @autocast(enabled=False)
  @autocast(enabled=False)


In [4]:
!pip install esm py3Dmol numpy torch huggingface_hub


[0m

In [5]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.utils.structure import ProteinChain
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
!pip install esm py3Dmol numpy torch huggingface_hub biopython


ImportError: cannot import name 'ProteinChain' from 'esm.utils.structure' (unknown location)

In [6]:
!pip install esm py3Dmol numpy torch huggingface_hub biopython


[0m

In [9]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder


In [10]:
%set_env TOKENIZERS_PARALLELISM=false


env: TOKENIZERS_PARALLELISM=false


In [11]:
login(token="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")


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [12]:
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))


LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

In [13]:
# Import necessary libraries
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.utils.structure import ProteinChain
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

# Set environment variables and install required packages
%set_env TOKENIZERS_PARALLELISM=false
!pip install esm py3Dmol

# Log in to Hugging Face Hub
login(token="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")

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Active site indices (example indices for the 7 active sites)
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = beta_lactamase_chain[motif_inds].sequence
    motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = beta_lactamase_chain.to_pdb_string()
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()


ImportError: cannot import name 'ProteinChain' from 'esm.utils.structure' (unknown location)

In [14]:
# Import necessary libraries
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

# Set environment variables and install required packages
%set_env TOKENIZERS_PARALLELISM=false
!pip install esm py3Dmol

# Log in to Hugging Face Hub
login(token="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")

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Active site indices (example indices for the 7 active sites)
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = beta_lactamase_chain[motif_inds].sequence
    motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = beta_lactamase_chain.to_pdb_string()
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()


env: TOKENIZERS_PARALLELISM=false
[0mThe token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

In [15]:
# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Print protein sequence and atomic coordinates
print(beta_lactamase_chain.sequence)
print("atom37_positions shape: ", beta_lactamase_chain.atom37_positions.shape)
print(beta_lactamase_chain.atom37_positions[:3])

# Visualize protein structure using py3Dmol
view = py3Dmol.view(width=500, height=500)
pdb_str = beta_lactamase_chain.to_pdb_string()
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "spectrum"}})
view.zoomTo()
view.show()

# Extract and visualize a specific motif
motif_inds = np.arange(123, 146)
motif_sequence = beta_lactamase_chain[motif_inds].sequence
motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
print("Motif sequence: ", motif_sequence)
print("Motif atom37_positions shape: ", motif_atom37_positions.shape)

view = py3Dmol.view(width=500, height=500)
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "lightgrey"}})
motif_res_inds = (motif_inds + 1).tolist()
view.zoomTo()
view.show()

HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
atom37_positions shape:  (263, 37, 3)
[[[  2.033  -8.121  80.082]
  [  2.759  -8.052  81.378]
  [  4.206  -8.489  81.153]
  [  2.018  -8.919  82.407]
  [  4.446  -9.392  80.35 ]
  [  2.516  -8.823  83.811]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [  2.023  -8.175  84.903]
  [  3.638  -9.516  84.244]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [  3.848  -9.271  85.535]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan

Motif sequence:  TAFLHNMGDHVTRLDRWEPELNE
Motif atom37_positions shape:  (23, 37, 3)


In [16]:
# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Active site indices (example indices for the 7 active sites)
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = beta_lactamase_chain[motif_inds].sequence
    motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = beta_lactamase_chain.to_pdb_string()
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()


Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)


Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Length of sequence prompt:  200
Structure prompt shape:  torch.Size([200, 37, 3])
Indices with structure conditioning:  [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]


NameError: name 'model' is not defined

In [17]:
# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID

# Use Biopython to parse the PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"https://files.rcsb.org/download/{pdb_id}.pdb")
model_structure = structure[0]  # Get the first model
chain = model_structure[chain_id]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

FileNotFoundError: [Errno 2] No such file or directory: 'https://files.rcsb.org/download/7qlp.pdb'

In [18]:
# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Active site indices (example indices for the 7 active sites)
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = beta_lactamase_chain[motif_inds].sequence
    motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = beta_lactamase_chain.to_pdb_string()
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()


Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)


Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Length of sequence prompt:  200
Structure prompt shape:  torch.Size([200, 37, 3])
Indices with structure conditioning:  [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]


NameError: name 'model' is not defined

In [19]:
# Load protein structure from a local PDB file
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")  # Ensure the PDB file is in the same directory
model_structure = structure[0]  # Get the first model
chain = model_structure[chain_id]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

FileNotFoundError: [Errno 2] No such file or directory: '7qlp.pdb'

In [20]:
!pip install biopython


[0m

In [21]:
import os
from Bio import PDB

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDB.PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Check if the file was downloaded successfully
if not os.path.exists(pdb_file_path):
    raise FileNotFoundError(f"Failed to download PDB file for {pdb_id}")

# Now parse the local PDB file
parser = PDB.PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# ... rest of your code ...

# Clean up: remove the downloaded PDB file after processing
os.remove(pdb_file_path)


Downloading PDB structure '7qlp'...




In [22]:

# Load protein structure from a local PDB file
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")  # Ensure the PDB file is in the same directory
model_structure = structure[0]  # Get the first model
chain = model_structure[chain_id]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

FileNotFoundError: [Errno 2] No such file or directory: '7qlp.pdb'

In [23]:
import os
from Bio import PDB

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDB.PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Check if the file was downloaded successfully
if not os.path.exists(pdb_file_path):
    raise FileNotFoundError(f"Failed to download PDB file for {pdb_id}")

# Now parse the local PDB file
parser = PDB.PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# ... rest of your code ...

# Clean up: remove the downloaded PDB file after processing
os.remove(pdb_file_path)






# Load protein structure from a local PDB file
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")  # Ensure the PDB file is in the same directory
model_structure = structure[0]  # Get the first model
chain = model_structure[chain_id]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

Downloading PDB structure '7qlp'...




FileNotFoundError: [Errno 2] No such file or directory: '7qlp.pdb'

In [24]:
from Bio.PDB import PDBParser, PPBuilder, PDBList
# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


NameError: name 'PDBList' is not defined

In [32]:
from Bio.PDB import PDBParser, PPBuilder, PDBList

from esm.models.esm3 import ESM3
import torch

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


Downloading PDB structure '7qlp'...
Full protein sequence: HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
atom37_positions shape:  (478, 37, 3)

Processing Active Site 1
Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)




Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Length of sequence prompt:  200
Structure prompt shape:  torch.Size([200, 37, 3])
Indices with structure conditioning:  [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]


NameError: name 'model' is not defined

In [33]:
from Bio.PDB import PDBParser, PPBuilder, PDBList

from esm.models.esm3 import ESM3
import torch

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cpu")

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

In [34]:
from Bio.PDB import PDBParser, PPBuilder, PDBList

from esm.models.esm3 import ESM3
import torch

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cpu"))

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


SyntaxError: '(' was never closed (3971933513.py, line 7)

In [35]:
from Bio.PDB import PDBParser, PPBuilder, PDBList

from esm.models.esm3 import ESM3
import torch

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cpu"))

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

In [1]:
 Set up CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load ESM3 model
model_name = "facebook/esm3_t12_8M_UR50S"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)

# Load protein structure from a local PDB file
pdb_id = "7qlp"
chain_id = "A"

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")
model_structure = structure[0]
chain = model_structure[chain_id]

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Function to generate optimized sequence
def generate_optimized_sequence(input_sequence, num_variants=5):
    input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids, 
            max_length=len(input_sequence) + 10,
            num_return_sequences=num_variants,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return [tokenizer.decode(seq, skip_special_tokens=True) for seq in output]

# Function to calculate sequence similarity
def sequence_similarity(seq1, seq2):
    return sum(a == b for a, b in zip(seq1, seq2)) / len(seq1)

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    
    # Generate optimized sequences
    optimized_sequences = generate_optimized_sequence(motif_sequence)
    
    print(f"Generated {len(optimized_sequences)} potential inhibitor designs:")
    for j, opt_seq in enumerate(optimized_sequences):
        similarity = sequence_similarity(motif_sequence, opt_seq)
        print(f"  Design {j+1}: {opt_seq}")
        print(f"    Similarity to original: {similarity:.2f}")
    
    # Visualize the original motif
    view = py3Dmol.view(width=800, height=400)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo({"resi": motif_res_inds})
    view.show()


SyntaxError: expected 'else' after 'if' expression (2042777606.py, line 1)

In [2]:

# Set up CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load ESM3 model
model_name = "facebook/esm3_t12_8M_UR50S"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)

# Load protein structure from a local PDB file
pdb_id = "7qlp"
chain_id = "A"

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")
model_structure = structure[0]
chain = model_structure[chain_id]

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Function to generate optimized sequence
def generate_optimized_sequence(input_sequence, num_variants=5):
    input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids, 
            max_length=len(input_sequence) + 10,
            num_return_sequences=num_variants,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return [tokenizer.decode(seq, skip_special_tokens=True) for seq in output]

# Function to calculate sequence similarity
def sequence_similarity(seq1, seq2):
    return sum(a == b for a, b in zip(seq1, seq2)) / len(seq1)

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    
    # Generate optimized sequences
    optimized_sequences = generate_optimized_sequence(motif_sequence)
    
    print(f"Generated {len(optimized_sequences)} potential inhibitor designs:")
    for j, opt_seq in enumerate(optimized_sequences):
        similarity = sequence_similarity(motif_sequence, opt_seq)
        print(f"  Design {j+1}: {opt_seq}")
        print(f"    Similarity to original: {similarity:.2f}")
    
    # Visualize the original motif
    view = py3Dmol.view(width=800, height=400)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo({"resi": motif_res_inds})
    view.show()

    print("\nVisualization of the original motif shown above.")
    print("The cyan region represents the active site.")
    
    print(f"\nNote: Further analysis is required to determine the actual inhibition potential of these designs.")

print("\nDesign process completed for all 7 active sites.")

NameError: name 'torch' is not defined

In [3]:
# Install required packages
!pip install torch transformers biopython py3Dmol numpy

# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from Bio.PDB import PDBParser, PPBuilder
import py3Dmol
import numpy as np
from io import StringIO
from Bio.PDB import PDBIO

# Set up CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Rest of the code remains the same...

[0mUsing device: cuda


In [4]:
!pip install torch transformers biopython py3Dmol numpy

[0m

In [5]:
# Install required packages
!pip install torch transformers biopython py3Dmol numpy

# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from Bio.PDB import PDBParser, PPBuilder
import py3Dmol
import numpy as np
from io import StringIO
from Bio.PDB import PDBIO

# Set up CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load ESM3 model
model_name = "facebook/esm3_t12_8M_UR50S"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)

# Load protein structure from a local PDB file
pdb_id = "7qlp"
chain_id = "A"

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")
model_structure = structure[0]
chain = model_structure[chain_id]

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Function to generate optimized sequence
def generate_optimized_sequence(input_sequence, num_variants=5):
    input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids, 
            max_length=len(input_sequence) + 10,
            num_return_sequences=num_variants,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return [tokenizer.decode(seq, skip_special_tokens=True) for seq in output]

# Function to calculate sequence similarity
def sequence_similarity(seq1, seq2):
    return sum(a == b for a, b in zip(seq1, seq2)) / len(seq1)

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    
    # Generate optimized sequences
    optimized_sequences = generate_optimized_sequence(motif_sequence)
    
    print(f"Generated {len(optimized_sequences)} potential inhibitor designs:")
    for j, opt_seq in enumerate(optimized_sequences):
        similarity = sequence_similarity(motif_sequence, opt_seq)
        print(f"  Design {j+1}: {opt_seq}")
        print(f"    Similarity to original: {similarity:.2f}")
    
    # Visualize the original motif
    view = py3Dmol.view(width=800, height=400)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo({"resi": motif_res_inds})
    view.show()

    print("\nVisualization of the original motif shown above.")
    print("The cyan region represents the active site.")
    
    print(f"\nNote: Further analysis is required to determine the actual inhibition potential of these designs.")

print("\nDesign process completed for all 7 active sites.")

[0mUsing device: cuda


OSError: facebook/esm3_t12_8M_UR50S is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [6]:

# Load protein structure from a local PDB file
pdb_id = "7qlp"
chain_id = "A"

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"{pdb_id}.pdb")
model_structure = structure[0]
chain = model_structure[chain_id]

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Function to generate optimized sequence
def generate_optimized_sequence(input_sequence, num_variants=5):
    input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids, 
            max_length=len(input_sequence) + 10,
            num_return_sequences=num_variants,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return [tokenizer.decode(seq, skip_special_tokens=True) for seq in output]

# Function to calculate sequence similarity
def sequence_similarity(seq1, seq2):
    return sum(a == b for a, b in zip(seq1, seq2)) / len(seq1)

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    
    # Generate optimized sequences
    optimized_sequences = generate_optimized_sequence(motif_sequence)
    
    print(f"Generated {len(optimized_sequences)} potential inhibitor designs:")
    for j, opt_seq in enumerate(optimized_sequences):
        similarity = sequence_similarity(motif_sequence, opt_seq)
        print(f"  Design {j+1}: {opt_seq}")
        print(f"    Similarity to original: {similarity:.2f}")
    
    # Visualize the original motif
    view = py3Dmol.view(width=800, height=400)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo({"resi": motif_res_inds})
    view.show()

FileNotFoundError: [Errno 2] No such file or directory: '7qlp.pdb'

In [7]:
# Install required packages
!pip install torch transformers biopython py3Dmol numpy

[0m

In [8]:

# Fetch PDB file content
pdb_id = "7qlp"
url = f"https://files.rcsb.org/view/{pdb_id}.pdb"
response = requests.get(url)
if response.status_code != 200:
    raise Exception(f"Failed to fetch PDB file: {response.status_code}")
pdb_content = response.text

# Parse the PDB content
parser = PDBParser()
structure = parser.get_structure(pdb_id, StringIO(pdb_content))
model_structure = structure[0]
chain = model_structure["A"]

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to generate optimized sequence
def generate_optimized_sequence(input_sequence, num_variants=5):
    input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids, 
            max_length=len(input_sequence) + 10,
            num_return_sequences=num_variants,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return [tokenizer.decode(seq, skip_special_tokens=True) for seq in output]

# Function to calculate sequence similarity
def sequence_similarity(seq1, seq2):
    return sum(a == b for a, b in zip(seq1, seq2)) / len(seq1)

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    
    # Generate optimized sequences
    optimized_sequences = generate_optimized_sequence(motif_sequence)
    
    print(f"Generated {len(optimized_sequences)} potential inhibitor designs:")
    for j, opt_seq in enumerate(optimized_sequences):
        similarity = sequence_similarity(motif_sequence, opt_seq)
        print(f"  Design {j+1}: {opt_seq}")
        print(f"    Similarity to original: {similarity:.2f}")
    
    # Visualize the original motif
    view = py3Dmol.view(width=800, height=400)
    view.addModel(pdb_content, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo({"resi": motif_res_inds})
    view.show()

    print("\nVisualization of the original motif shown above.")
    print("The cyan region represents the active site.")
    
    print(f"\nNote: Further analysis is required to determine the actual inhibition potential of these designs.")

print("\nDesign process completed for all 7 active sites.")

NameError: name 'requests' is not defined

In [9]:
# Install required packages
!pip install torch transformers biopython py3Dmol numpy requests

# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from Bio.PDB import PDBParser, PPBuilder
import py3Dmol
import numpy as np
from io import StringIO
import requests  # Add this import

# Rest of the code remains the same...

[0m

In [10]:

# Fetch PDB file content
pdb_id = "7qlp"
url = f"https://files.rcsb.org/view/{pdb_id}.pdb"
response = requests.get(url)
if response.status_code != 200:
    raise Exception(f"Failed to fetch PDB file: {response.status_code}")
pdb_content = response.text

# Parse the PDB content
parser = PDBParser()
structure = parser.get_structure(pdb_id, StringIO(pdb_content))
model_structure = structure[0]
chain = model_structure["A"]

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to generate optimized sequence
def generate_optimized_sequence(input_sequence, num_variants=5):
    input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids, 
            max_length=len(input_sequence) + 10,
            num_return_sequences=num_variants,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return [tokenizer.decode(seq, skip_special_tokens=True) for seq in output]

# Function to calculate sequence similarity
def sequence_similarity(seq1, seq2):
    return sum(a == b for a, b in zip(seq1, seq2)) / len(seq1)

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    
    # Generate optimized sequences
    optimized_sequences = generate_optimized_sequence(motif_sequence)
    
    print(f"Generated {len(optimized_sequences)} potential inhibitor designs:")
    for j, opt_seq in enumerate(optimized_sequences):
        similarity = sequence_similarity(motif_sequence, opt_seq)
        print(f"  Design {j+1}: {opt_seq}")
        print(f"    Similarity to original: {similarity:.2f}")
    
    # Visualize the original motif
    view = py3Dmol.view(width=800, height=400)
    view.addModel(pdb_content, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo({"resi": motif_res_inds})
    view.show()

    print("\nVisualization of the original motif shown above.")
    print("The cyan region represents the active site.")
    
    print(f"\nNote: Further analysis is required to determine the actual inhibition potential of these designs.")

print("\nDesign process completed for all 7 active sites.")



Full protein sequence: HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW

Processing Active Site 1
Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN


NameError: name 'tokenizer' is not defined

In [11]:
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda" if torch.cuda.is_available() else "cpu"))


NameError: name 'ESM3' is not defined

In [12]:
%set_env TOKENIZERS_PARALLELISM=false


env: TOKENIZERS_PARALLELISM=false


In [13]:
login(token="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")


NameError: name 'login' is not defined

In [15]:
token="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"

In [16]:
# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)


NameError: name 'PDBList' is not defined

In [17]:
# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)


NameError: name 'PDBList' is not defined

In [18]:

# Download and parse PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


NameError: name 'PDBList' is not defined

In [19]:
from Bio.PDB import PDBParser, PPBuilder, PDBList

# ... (other imports and code)

# Download the PDB file
pdb_id = "7qlp"
pdbl = PDBList()
pdb_file_path = pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')


Structure exists: './pdb7qlp.ent' 


In [20]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os


# Download and parse PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


Structure exists: './pdb7qlp.ent' 
Full protein sequence: HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
atom37_positions shape:  (478, 37, 3)

Processing Active Site 1
Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)




Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Structure prompt shape:  torch.Size([200, 37, 3])


NameError: name 'ESMProtein' is not defined

In [21]:
from esm.sdk.api import ESMProtein, GenerationConfig


  @autocast(enabled=False)
  @autocast(enabled=False)
  @autocast(enabled=False)


In [23]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os


In [24]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os


# Download and parse PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


Structure exists: './pdb7qlp.ent' 
Full protein sequence: HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
atom37_positions shape:  (478, 37, 3)

Processing Active Site 1
Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)




Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Structure prompt shape:  torch.Size([200, 37, 3])


NameError: name 'model' is not defined

In [25]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os

# Load ESM3 model onto CUDA-enabled GPU or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
try:
    model = ESM3.from_pretrained("esm3_sm_open_v1", device=device)
except Exception as e:
    print(f"Error loading model: {e}")
    print("Attempting to load model from local cache...")
    model = ESM3.from_pretrained("esm3_sm_open_v1", device=device, local_files_only=True)

# Verify model is loaded
if model is None:
    raise ValueError("Failed to load the ESM3 model. Please check your internet connection and Hugging Face token.")

print("Model loaded successfully.")

# Download the PDB file
pdb_id = "7qlp"
pdbl = PDBList()
pdb_file_path = pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


Error loading model: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
Attempting to load model from local cache...


TypeError: ESM3.from_pretrained() got an unexpected keyword argument 'local_files_only'

In [26]:
from esm.models.esm3 import ESM3
import torch

# Load ESM3 model onto CUDA-enabled GPU or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
try:
    model = ESM3.from_pretrained("esm3_sm_open_v1", device=device)
except Exception as e:
    print(f"Error loading model: {e}")
    print("Unable to load the model. Please check your internet connection and Hugging Face token.")
    raise

# Verify model is loaded
if model is None:
    raise ValueError("Failed to load the ESM3 model.")

print("Model loaded successfully.")


Error loading model: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
Unable to load the model. Please check your internet connection and Hugging Face token.


LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

In [27]:
# Install required packages
!pip install esm py3Dmol numpy torch huggingface_hub biopython

# Import necessary libraries
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os

# Set environment variable
%env TOKENIZERS_PARALLELISM=false

# Retrieve the Hugging Face token from environment variables
hf_token = os.getenv("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")

# Log in to Hugging Face Hub
login(token=hf_token)

# Load ESM3 model onto CUDA-enabled GPU or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
try:
    model = ESM3.from_pretrained("esm3_sm_open_v1", device=device)
except Exception as e:
    print(f"Error loading model: {e}")
    print("Unable to load the model. Please check your internet connection and Hugging Face token.")
    raise

# Verify model is loaded
if model is None:
    raise ValueError("Failed to load the ESM3 model.")

print("Model loaded successfully.")

# Download the PDB file
pdb_id = "7qlp"
pdbl = PDBList()
pdb_file_path = pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


[0menv: TOKENIZERS_PARALLELISM=false


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Error loading model: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
Unable to load the model. Please check your internet connection and Hugging Face token.


LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

In [1]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os


# Download and parse PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


ModuleNotFoundError: No module named 'py3Dmol'

In [2]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os


# Download and parse PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]
chain = model_structure["A"]

# Extract sequence and atomic coordinates
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Process each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


ModuleNotFoundError: No module named 'py3Dmol'

In [3]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.PDB import PDBParser, PPBuilder

# Define the PDB ID and file path
pdb_id = "7qlp"
pdb_file_path = f"{pdb_id}.pdb"

# Download the PDB file if not already present
from Bio.PDB import PDBList
pdbl = PDBList()
pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)

# Extract the amino acid sequence
ppb = PPBuilder()
seq = ppb.build_peptides(structure)[0].get_sequence()

# Define the active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Function to get codons for a given amino acid sequence
def get_codons(aa_seq):
    # This is a simplified codon table. In reality, there are multiple codons for most amino acids.
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = seq[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()


ModuleNotFoundError: No module named 'Bio'

In [6]:
# Download the PDB file
pdb_id = "7qlp"
pdbl = PDBList()
pdb_file_path = pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)


NameError: name 'PDBList' is not defined

In [7]:
# Extract the amino acid sequence
ppb = PPBuilder()
sequence = ppb.build_peptides(structure)[0].get_sequence()
print("Full protein sequence:", sequence)

# Define the codon table
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Define active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()


NameError: name 'PPBuilder' is not defined

In [10]:
# Import necessary libraries
import numpy as np
import torch
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os

ModuleNotFoundError: No module named 'esm'

In [11]:
# Download the PDB file
pdb_id = "7qlp"
pdbl = PDBList()
pdb_file_path = pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)

# Extract the amino acid sequence
ppb = PPBuilder()
sequence = ppb.build_peptides(structure)[0].get_sequence()
print("Full protein sequence:", sequence)

# Define the codon table
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Define active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


NameError: name 'PDBList' is not defined

In [12]:
import numpy as np
import torch
from esm.utils.structure.protein_chain import ProteinChain
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

# Load protein structure from PDB
pdb_id = "7QLP"
chain_id = "A"
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Create a prompt for ESM3
motif_inds = np.arange(50, 75)
motif_sequence = beta_lactamase_chain[motif_inds].sequence
motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions

sequence_prompt = ["_"]*200
sequence_prompt[50:75] = list(motif_sequence)
sequence_prompt = "".join(sequence_prompt)

structure_prompt = torch.full((200, 37, 3), np.nan)
structure_prompt[50:75] = torch.tensor(motif_atom37_positions)

protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

# Generate new protein sequence
sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
sequence_generation = model.generate(protein_prompt, sequence_generation_config)

# Predict structure
structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

# Visualize generated structure
import py3Dmol
view = py3Dmol.view(width=1000, height=500)
view.addModel(beta_lactamase_chain.to_pdb_string(), "pdb")
view.addModel(structure_prediction.to_protein_chain().to_pdb_string(), "pdb")
view.setStyle({"cartoon": {"color": "lightgrey"}})
view.zoomTo()
view.show()


ModuleNotFoundError: No module named 'esm'

In [13]:
# Use an official Python runtime as a parent image
FROM python:3.8-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
    git \
    curl \
    && apt-get clean

# Install Jupyter
RUN pip install jupyter

# Install Hugging Face Transformers
RUN pip install transformers

# Set up environment variables
ENV TOKENIZERS_PARALLELISM=false

# Create a working directory
WORKDIR /app

# Copy the Jupyter notebook file
COPY your_notebook.ipynb /app

# Expose the port Jupyter runs on
EXPOSE 8888

# Run Jupyter notebook
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]


SyntaxError: invalid syntax (667033890.py, line 2)

In [14]:
# Use an official Python runtime as a parent image
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /usr/src/app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --no-cache-dir jupyterlab torch

# Install ESM package from Hugging Face
RUN pip install --no-cache-dir git+https://github.com/facebookresearch/esm.git

# Expose the port for Jupyter Notebook
EXPOSE 8888

# Run Jupyter Notebook
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]


SyntaxError: invalid syntax (1967147953.py, line 2)

In [15]:
!pip install torch
!pip install transformers
!pip install esm


[0mCollecting transformers
  Downloading transformers-4.44.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.24.5-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.7.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.44.0-py3-none-any.whl (9.5 MB

In [16]:
import torch
from transformers import EsmForProteinFolding, AutoTokenizer

# Load the ESM-3 model from Hugging Face
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1")
tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")

# Example protein sequence (replace with your specific sequence)
protein_sequence = "MGSSHHHHHHSSGLVPRGSHM...<Your Sequence>"

# Tokenize the input sequence
inputs = tokenizer(protein_sequence, return_tensors="pt", padding=True)

# Forward pass through the model to get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Extract motifs from the model's output (e.g., from the logits or attention)
motifs = outputs.logits.argmax(dim=-1)

# Decode the motifs
decoded_motifs = tokenizer.decode(motifs.squeeze().tolist())

# Display the motifs
print("Extracted Motifs:")
print(decoded_motifs)


RuntimeError: Failed to import transformers.models.esm.modeling_esmfold because of the following error (look up to see its traceback):
cannot import name 'is_sparse_any' from 'torch._subclasses.meta_utils' (/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py)

In [17]:
# Install necessary packages
!pip install pymol-open-source torch transformers esm

import torch
from transformers import EsmForProteinFolding, AutoTokenizer
import pymol
from pymol import cmd
from pymol.cgo import *

# Initialize PyMOL in the Jupyter Notebook
pymol.finish_launching(['pymol', '-cq'])

# Load the ESM-3 model from Hugging Face
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1")
tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")

# Full protein sequence for 7qlp
protein_sequence = "HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW"

# Tokenize the input sequence
inputs = tokenizer(protein_sequence, return_tensors="pt", padding=True)

# Forward pass through the model to get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Extract motifs from the model's output (using logits for example)
motifs = outputs.logits.argmax(dim=-1)

# Decode the motifs (in this case, we'll just map them back to the sequence)
decoded_motifs = [protein_sequence[i] for i in motifs.squeeze().tolist()]

# Load the protein structure from the PDB ID
cmd.fetch('7qlp', async_=0)

# Define a dummy function to extract the motif positions; this should be replaced with actual motif extraction logic
# Here we'll just assume positions 45-48 and 120-122 as example motif positions.
motif_residues = [45, 46, 47, 48, 120, 121, 122]

# Highlight the motifs in the protein structure
for resi in motif_residues:
    cmd.color('red', f'resi {resi}')
    cmd.show('sticks', f'resi {resi}')

# Visualize the protein structure with motifs
cmd.show('cartoon')
cmd.color('blue', 'all')
cmd.png('/mnt/data/protein_motifs_visualization.png', dpi=300)  # Save the image

# Quit PyMOL session
cmd.quit()


[31mERROR: Could not find a version that satisfies the requirement pymol-open-source (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pymol-open-source[0m[31m
[0m

RuntimeError: Failed to import transformers.models.esm.modeling_esmfold because of the following error (look up to see its traceback):
cannot import name 'is_sparse_any' from 'torch._subclasses.meta_utils' (/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py)

In [18]:
# Install necessary packages
!pip install torch esm pymol-open-source

import torch
import esm
from pymol import cmd
from pymol.cgo import *

# Initialize PyMOL in the Jupyter Notebook
pymol.finish_launching(['pymol', '-cq'])

# Load the ESM-3 model from Hugging Face using the esm package
model, alphabet = esm.pretrained.esmfold_v1()
batch_converter = alphabet.get_batch_converter()

# Full protein sequence for 7qlp
protein_sequence = "HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW"

# Prepare the input data
data = [("protein", protein_sequence)]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Run the model to get the predicted structure
with torch.no_grad():
    output = model(batch_tokens, repr_layers=[33], return_contacts=True)
    logits = output["logits"]
    contacts = output["contacts"]

# Extract motifs (for demonstration purposes, select certain residues as motifs)
motif_positions = [45, 46, 47, 48, 120, 121, 122]

# Load the protein structure from the PDB ID
cmd.fetch('7qlp', async_=0)

# Highlight the motifs in the protein structure
for pos in motif_positions:
    cmd.color('red', f'resi {pos}')
    cmd.show('sticks', f'resi {pos}')

# Visualize the protein structure with motifs
cmd.show('cartoon')
cmd.color('blue', 'all')
cmd.png('/mnt/data/protein_motifs_visualization.png', dpi=300)  # Save the image

# Quit PyMOL session
cmd.quit()


[31mERROR: Could not find a version that satisfies the requirement pymol-open-source (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pymol-open-source[0m[31m
[0m

ModuleNotFoundError: No module named 'pymol'

In [19]:
# Install necessary packages
!pip install torch esm

import torch
import esm

# Load the ESMFold model
model, alphabet = esm.pretrained.esmfold_v1()
batch_converter = alphabet.get_batch_converter()

# Full protein sequence for 7qlp
protein_sequence = "HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW"

# Prepare the input data
data = [("protein", protein_sequence)]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Run the model to get the predicted structure
with torch.no_grad():
    output = model(batch_tokens, repr_layers=[33], return_contacts=True)
    logits = output["logits"]
    contacts = output["contacts"]

# Select specific positions as motifs (example positions)
motif_positions = [45, 46, 47, 48, 120, 121, 122]

# Create a dictionary to map motif positions to amino acids
motif_sequence = {pos: protein_sequence[pos - 1] for pos in motif_positions}

# Save the motif information for visualization
with open('/mnt/data/motif_positions.txt', 'w') as f:
    for pos, aa in motif_sequence.items():
        f.write(f'{pos}: {aa}\n')

print("Motif positions and their respective amino acids have been saved to 'motif_positions.txt'.")


[0m

AttributeError: module 'esm' has no attribute 'pretrained'

In [20]:
import torch
import esm

# Load the ESM-3 model from Hugging Face
model, alphabet = esm.pretrained.esmfold_v1()
batch_converter = alphabet.get_batch_converter()

# Full protein sequence for 7qlp
protein_sequence = "HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW"

# Prepare the input data
data = [("protein", protein_sequence)]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Run the model to get the predicted structure
with torch.no_grad():
    output = model(batch_tokens, repr_layers=[33], return_contacts=True)
    logits = output["logits"]
    contacts = output["contacts"]

# Define specific positions as motifs (example positions)
motif_positions = [45, 46, 47, 48, 120, 121, 122]

# Create a dictionary to map motif positions to amino acids
motif_sequence = {pos: protein_sequence[pos - 1] for pos in motif_positions}

# Save the motif information for visualization
with open('/mnt/data/motif_positions.txt', 'w') as f:
    for pos, aa in motif_sequence.items():
        f.write(f'{pos}: {aa}\n')

print("Motif positions and their respective amino acids have been saved to 'motif_positions.txt'.")


AttributeError: module 'esm' has no attribute 'pretrained'

In [21]:
import torch
import esm

# Load the ESM-3 model
model = esm.pretrained.esmfold_v1()
batch_converter = model.alphabet.get_batch_converter()

# Full protein sequence for 7qlp
protein_sequence = "HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW"

# Prepare the input data
data = [("protein", protein_sequence)]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Run the model to get the predicted structure
with torch.no_grad():
    output = model.infer(batch_tokens)

# Example: Define specific positions as motifs (based on your research or predictions)
motif_positions = [45, 46, 47, 48, 120, 121, 122]

# Create a dictionary to map motif positions to amino acids
motif_sequence = {pos: protein_sequence[pos - 1] for pos in motif_positions}

# Save the motif information for visualization
with open('/mnt/data/motif_positions.txt', 'w') as f:
    for pos, aa in motif_sequence.items():
        f.write(f'{pos}: {aa}\n')

print("Motif positions and their respective amino acids have been saved to 'motif_positions.txt'.")


AttributeError: module 'esm' has no attribute 'pretrained'

In [22]:
# Import necessary libraries
import torch
from esm import pretrained, InverseFoldingModel, inverse_folding

# Load the ESM3 model
model, alphabet = pretrained.esm_if3_gvp4_t16_142m_UR50()
model.eval()  # Set model to evaluation mode

# Load the structure file (Make sure to replace with the correct path or ensure the file is in the same directory)
pdb_id = "7qld"
chain_id = "A"  # Assuming you're working with chain A

# Define the motif sequence based on the active site information provided
motif_sequence = "LCGAVLSRIDAGQEQLGRRIHYSQN"

# Extract atomic coordinates using PyMOL (or directly if you have preprocessed the PDB file)

# Assuming you already have the atom37_positions for the motif, based on the uploaded file data:
atom37_positions = torch.tensor([ # hypothetical example, this should match your specific data
    [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], ...],  # 37 atoms for each residue
])

# Use the InverseFoldingModel to design inhibitor motifs
inverse_folding_model = InverseFoldingModel(model)
designed_motif, _ = inverse_folding.inverse_fold(motif_sequence, atom37_positions, model=inverse_folding_model)

# Save the designed motif as a PDB file
with open("designed_motif.pdb", "w") as f:
    f.write(designed_motif)

# Output the designed sequence
print("Designed motif sequence:", designed_motif)


ImportError: cannot import name 'InverseFoldingModel' from 'esm' (/opt/conda/lib/python3.10/site-packages/esm/__init__.py)

In [23]:
# Import necessary libraries
import torch
from esm import pretrained

# Load the ESM3 model
model, alphabet = pretrained.esm_if3_gvp4_t16_142m_UR50()
model.eval()  # Set model to evaluation mode

# Define the motif sequence based on the active site information provided
motif_sequence = "LCGAVLSRIDAGQEQLGRRIHYSQN"

# Tokenize the motif sequence using the alphabet from the model
tokens = alphabet.get_batch_converter()([("", motif_sequence)])[2]

# Pass the tokens through the model to get the sequence embeddings
with torch.no_grad():
    results = model(tokens, repr_layers=[33], return_contacts=True)
    token_representations = results["representations"][33]

# Extract the embedding for the motif sequence
motif_embedding = token_representations.mean(1).squeeze()

# Output the motif embedding vector
print("Motif Embedding Vector:", motif_embedding)

# Optionally, save the embedding to a file for further analysis or use in inhibitor design
torch.save(motif_embedding, "motif_embedding.pt")


AttributeError: module 'esm.pretrained' has no attribute 'esm_if3_gvp4_t16_142m_UR50'

In [24]:
# Import necessary libraries
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder

# Set environment variable
%set_env TOKENIZERS_PARALLELISM=false

# Log in to Hugging Face Hub
login(token="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")

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID

# Use Biopython to parse the PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"https://files.rcsb.org/download/{pdb_id}.pdb")
model_structure = structure[0]  # Get the first model
chain = model_structure[chain_id]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()


ModuleNotFoundError: No module named 'py3Dmol'

In [25]:
!pip install esm py3Dmol numpy torch huggingface_hub biopython


Collecting py3Dmol
  Downloading py3Dmol-2.3.0-py2.py3-none-any.whl.metadata (1.9 kB)
Downloading py3Dmol-2.3.0-py2.py3-none-any.whl (7.0 kB)
Installing collected packages: py3Dmol
Successfully installed py3Dmol-2.3.0
[0m

In [31]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from Bio.PDB import PDBParser, PPBuilder
%set_env TOKENIZERS_PARALLELISM=false
login(token="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")


model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID

# Use Biopython to parse the PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, f"https://files.rcsb.org/download/{pdb_id}.pdb")
model_structure = structure[0]  # Get the first model
chain = model_structure[chain_id]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print(sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)
print(atom37_positions[:3])

view = py3Dmol.view(width=500, height=500)
pdb_str = structure_to_pdb_string(structure)
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "spectrum"}})
view.zoomTo()
view.show()

from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()




env: TOKENIZERS_PARALLELISM=false
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Fetching 22 files:   0%|          | 0/22 [00:00<?, ?it/s]

LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

In [32]:
# Extract the amino acid sequence
ppb = PPBuilder()
sequence = ppb.build_peptides(structure)[0].get_sequence()
print("Full protein sequence:", sequence)

# Define the codon table
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Define active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


NameError: name 'structure' is not defined

In [33]:
# Extract the amino acid sequence
ppb = PPBuilder()
sequence = ppb.build_peptides(structure)[0].get_sequence()
print("Full protein sequence:", sequence)

# Define the codon table
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Define active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


NameError: name 'structure' is not defined

In [34]:
# Import necessary libraries
import torch
from esm import pretrained

# Load the ESM3 model
model, alphabet = pretrained.esm_if3_gvp4_t16_142m_UR50()
model.eval()  # Set model to evaluation mode

# Define the motif sequence based on the active site information provided
motif_sequence = "LCGAVLSRIDAGQEQLGRRIHYSQN"

# Tokenize the motif sequence using the alphabet from the model
tokens = alphabet.get_batch_converter()([("", motif_sequence)])[2]

# Pass the tokens through the model to get the sequence embeddings
with torch.no_grad():
    results = model(tokens, repr_layers=[33], return_contacts=True)
    token_representations = results["representations"][33]

# Extract the embedding for the motif sequence
motif_embedding = token_representations.mean(1).squeeze()

# Output the motif embedding vector
print("Motif Embedding Vector:", motif_embedding)

# Optionally, save the embedding to a file for further analysis or use in inhibitor design
torch.save(motif_embedding, "motif_embedding.pt")


AttributeError: module 'esm.pretrained' has no attribute 'esm_if3_gvp4_t16_142m_UR50'

In [35]:
from Bio.PDB import PDBParser, PPBuilder
import os

# Define the PDB ID and file path
pdb_id = "7qld"
pdb_file_path = f"{pdb_id}.pdb"

# Load the PDB structure using Bio.PDB
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)

# Extract the amino acid sequence from the structure
ppb = PPBuilder()
sequence = ppb.build_peptides(structure)[0].get_sequence()
print("Full protein sequence:", sequence)

# Define the codon table
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Define active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()

# Clean up: remove the downloaded PDB file if necessary
if os.path.exists(pdb_file_path):
    os.remove(pdb_file_path)


FileNotFoundError: [Errno 2] No such file or directory: '7qld.pdb'

In [36]:
from Bio.PDB import PDBParser, PPBuilder, PDBList
import os

# Define the PDB ID
pdb_id = "7qld"
pdb_file_path = f"{pdb_id}.pdb"

# Check if the PDB file exists, if not, download it
if not os.path.exists(pdb_file_path):
    pdbl = PDBList()
    pdbl.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')
    # The retrieved file will have a prefix 'pdb', so rename it
    os.rename(f"pdb{pdb_id}.ent", pdb_file_path)

# Load the PDB structure using Bio.PDB
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)

# Extract the amino acid sequence from the structure
ppb = PPBuilder()
sequence = ppb.build_peptides(structure)[0].get_sequence()
print("Full protein sequence:", sequence)

# Define the codon table
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq)

# Define active site indices (adjust these based on your previous analysis)
active_site_indices = [
    range(50, 75),
    range(100, 125),
    range(150, 175),
    range(200, 225),
    range(250, 275),
    range(300, 325),
    range(350, 375)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()

# Clean up: remove the downloaded PDB file if necessary
if os.path.exists(pdb_file_path):
    os.remove(pdb_file_path)


Downloading PDB structure '7qld'...
Full protein sequence: HMATTINASSS
Active Site 1:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 2:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 3:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 4:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 5:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 6:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 7:
Amino Acid Sequence: 
Codon Sequence: 





In [37]:
# Define active site indices based on PyMOL observation (adjust as needed)
active_site_indices = [
    range(65, 90),    # Example range 1
    range(120, 145),  # Example range 2
    range(180, 205),  # Example range 3
    # Add more ranges as needed based on the visualization
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    codon_seq = get_codons(site_seq)
    
    print(f"Active Site {i}:")
    print(f"Amino Acid Sequence: {site_seq}")
    print(f"Codon Sequence: {codon_seq}")
    print()


Active Site 1:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 2:
Amino Acid Sequence: 
Codon Sequence: 

Active Site 3:
Amino Acid Sequence: 
Codon Sequence: 



In [38]:
# Active site ranges (for testing)
active_site_indices = [
    range(50, 75),    # Adjust as needed
    range(100, 125),  # Adjust as needed
    range(150, 175),  # Adjust as needed
    range(200, 225),  # Adjust as needed
    range(250, 275),  # Adjust as needed
    range(300, 325),  # Adjust as needed
    range(350, 375)   # Adjust as needed
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    
    if site_seq:  # Check if the sequence is non-empty
        codon_seq = get_codons(site_seq)
        
        print(f"Active Site {i} range {site_range.start}-{site_range.stop}:")
        print(f"Amino Acid Sequence: {site_seq}")
        print(f"Codon Sequence: {codon_seq}")
        print()
    else:
        print(f"Active Site {i} range {site_range.start}-{site_range.stop}: Sequence not found")


Active Site 1 range 50-75: Sequence not found
Active Site 2 range 100-125: Sequence not found
Active Site 3 range 150-175: Sequence not found
Active Site 4 range 200-225: Sequence not found
Active Site 5 range 250-275: Sequence not found
Active Site 6 range 300-325: Sequence not found
Active Site 7 range 350-375: Sequence not found


In [39]:
print("HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW", len(sequence))


HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW 11


In [40]:
# Define the codon table function
def get_codons(aa_seq):
    codon_table = {
        'A': 'GCT', 'C': 'TGT', 'D': 'GAT', 'E': 'GAA', 'F': 'TTT',
        'G': 'GGT', 'H': 'CAT', 'I': 'ATT', 'K': 'AAA', 'L': 'CTT',
        'M': 'ATG', 'N': 'AAT', 'P': 'CCT', 'Q': 'CAA', 'R': 'CGT',
        'S': 'TCT', 'T': 'ACT', 'V': 'GTT', 'W': 'TGG', 'Y': 'TAT'
    }
    return ''.join(codon_table[aa] for aa in aa_seq if aa in codon_table)

# Define new active site indices based on a total length of 263 residues
active_site_indices = [
    range(10, 30),   # Site 1
    range(50, 70),   # Site 2
    range(90, 110),  # Site 3
    range(130, 150), # Site 4
    range(170, 190), # Site 5
    range(210, 230), # Site 6
    range(250, 263)  # Site 7 (adjusted to the end of the sequence)
]

# Extract and print the sequences for each active site
for i, site_range in enumerate(active_site_indices, 1):
    site_seq = sequence[site_range.start:site_range.stop]
    
    if site_seq:  # Check if the sequence is non-empty
        codon_seq = get_codons(site_seq)
        
        print(f"Active Site {i} range {site_range.start}-{site_range.stop}:")
        print(f"Amino Acid Sequence: {site_seq}")
        print(f"Codon Sequence: {codon_seq}")
        print()
    else:
        print(f"Active Site {i} range {site_range.start}-{site_range.stop}: Sequence not found")
›

Active Site 1 range 10-30:
Amino Acid Sequence: S
Codon Sequence: TCT

Active Site 2 range 50-70: Sequence not found
Active Site 3 range 90-110: Sequence not found
Active Site 4 range 130-150: Sequence not found
Active Site 5 range 170-190: Sequence not found
Active Site 6 range 210-230: Sequence not found
Active Site 7 range 250-263: Sequence not found
