# ESM3
ESM3 is a frontier generative model for biology, able to jointly reason across three fundamental biological properties of proteins: sequence, structure, and function. These three data modalities are represented as tracks of discrete tokens at the input and output of ESM3. You can present the model with a combination of partial inputs across the tracks, and ESM3 will provide output predictions for all the tracks.

ESM3 is a generative masked language model. You can prompt it with partial sequence, structure, and function keywords, and iteratively sample masked positions until all positions are unmasked. This iterative sampling is what the `.generate()` function does.

![image.png](https://github.com/evolutionaryscale/esm/blob/main/_assets/esm3_diagram.png?raw=true)

The ESM3 architecture is highly scalable due to its transformer backbone and all-to-all reasoning over discrete token sequences. At its largest scale, ESM3 was trained with 1.07e24 FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters.
Here we present `esm3-open-small`. With 1.4B parameters it is the smallest and fastest model in the family, trained specifically to be open sourced. ESM3-open is available under a non-commercial license.

In [4]:
!pip install esm py3Dmol numpy torch huggingface_hub


[0m

In [5]:
import numpy as np
import torch
import py3Dmol
from huggingface_hub import login
from esm.utils.structure import ProteinChain
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
!pip install esm py3Dmol numpy torch huggingface_hub biopython


ImportError: cannot import name 'ProteinChain' from 'esm.utils.structure' (unknown location)

In [6]:
!pip install esm py3Dmol numpy torch huggingface_hub biopython


[0m

In [15]:
# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Print protein sequence and atomic coordinates
print(beta_lactamase_chain.sequence)
print("atom37_positions shape: ", beta_lactamase_chain.atom37_positions.shape)
print(beta_lactamase_chain.atom37_positions[:3])

# Visualize protein structure using py3Dmol
view = py3Dmol.view(width=500, height=500)
pdb_str = beta_lactamase_chain.to_pdb_string()
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "spectrum"}})
view.zoomTo()
view.show()

# Extract and visualize a specific motif
motif_inds = np.arange(123, 146)
motif_sequence = beta_lactamase_chain[motif_inds].sequence
motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
print("Motif sequence: ", motif_sequence)
print("Motif atom37_positions shape: ", motif_atom37_positions.shape)

view = py3Dmol.view(width=500, height=500)
view.addModel(pdb_str, "pdb")
view.setStyle({"cartoon": {"color": "lightgrey"}})
motif_res_inds = (motif_inds + 1).tolist()
view.zoomTo()
view.show()

HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
atom37_positions shape:  (263, 37, 3)
[[[  2.033  -8.121  80.082]
  [  2.759  -8.052  81.378]
  [  4.206  -8.489  81.153]
  [  2.018  -8.919  82.407]
  [  4.446  -9.392  80.35 ]
  [  2.516  -8.823  83.811]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [  2.023  -8.175  84.903]
  [  3.638  -9.516  84.244]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [  3.848  -9.271  85.535]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan]
  [    nan     nan     nan

Motif sequence:  TAFLHNMGDHVTRLDRWEPELNE
Motif atom37_positions shape:  (23, 37, 3)


In [16]:
# Load protein structure from PDB
pdb_id = "7qlp"  # PDB ID for the specific β-lactamase
chain_id = "A"  # Chain ID
beta_lactamase_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

# Active site indices (example indices for the 7 active sites)
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    motif_sequence = beta_lactamase_chain[motif_inds].sequence
    motif_atom37_positions = beta_lactamase_chain[motif_inds].atom37_positions
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = beta_lactamase_chain.to_pdb_string()
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(beta_lactamase_chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()


Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)


Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Length of sequence prompt:  200
Structure prompt shape:  torch.Size([200, 37, 3])
Indices with structure conditioning:  [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]


NameError: name 'model' is not defined

In [21]:
import os
from Bio import PDB

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDB.PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Check if the file was downloaded successfully
if not os.path.exists(pdb_file_path):
    raise FileNotFoundError(f"Failed to download PDB file for {pdb_id}")

# Now parse the local PDB file
parser = PDB.PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# ... rest of your code ...

# Clean up: remove the downloaded PDB file after processing
os.remove(pdb_file_path)


Downloading PDB structure '7qlp'...




In [32]:
from Bio.PDB import PDBParser, PPBuilder, PDBList

from esm.models.esm3 import ESM3
import torch

# Load ESM3 model onto CUDA-enabled GPU
model = ESM3.from_pretrained("esm3_sm_open_v1", device=torch.device("cuda"))

# Download the PDB file
pdb_id = "7qlp"
pdb_list = PDBList()
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir='.', file_format='pdb')

# Parse the local PDB file
parser = PDBParser()
structure = parser.get_structure(pdb_id, pdb_file_path)
model_structure = structure[0]  # Get the first model
chain = model_structure["A"]  # Get the specified chain

# Extract the sequence
ppb = PPBuilder()
sequence = "".join([str(pp.get_sequence()) for pp in ppb.build_peptides(chain)])
print("Full protein sequence:", sequence)

# Extract atomic coordinates
atom37_positions = []
for residue in chain:
    atom_positions = []
    for atom in residue:
        atom_positions.append(atom.get_coord())
    while len(atom_positions) < 37:
        atom_positions.append([np.nan, np.nan, np.nan])
    atom37_positions.append(atom_positions)
atom37_positions = np.array(atom37_positions)
print("atom37_positions shape: ", atom37_positions.shape)

# Define the 7 active site indices
active_site_indices = [
    np.arange(50, 75),
    np.arange(100, 125),
    np.arange(150, 175),
    np.arange(200, 225),
    np.arange(250, 275),
    np.arange(300, 325),
    np.arange(350, 375)
]

# Function to convert structure to PDB string
from io import StringIO
from Bio.PDB import PDBIO

def structure_to_pdb_string(structure):
    io = PDBIO()
    io.set_structure(structure)
    string_io = StringIO()
    io.save(string_io)
    return string_io.getvalue()

# Iterate over each active site
for i, motif_inds in enumerate(active_site_indices):
    print(f"\nProcessing Active Site {i+1}")
    
    motif_sequence = "".join([sequence[ind] for ind in motif_inds])
    motif_atom37_positions = atom37_positions[motif_inds]
    print(f"Active Site {i+1} Motif sequence: ", motif_sequence)
    print(f"Active Site {i+1} Motif atom37_positions shape: ", motif_atom37_positions.shape)

    # Visualize the motif
    view = py3Dmol.view(width=500, height=500)
    pdb_str = structure_to_pdb_string(structure)
    view.addModel(pdb_str, "pdb")
    view.setStyle({"cartoon": {"color": "lightgrey"}})
    motif_res_inds = (motif_inds + 1).tolist()
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}})
    view.zoomTo()
    view.show()

    # Generate sequence and structure prompts
    prompt_length = 200
    sequence_prompt = ["_"] * prompt_length
    sequence_prompt[72:72 + len(motif_sequence)] = list(motif_sequence)
    sequence_prompt = "".join(sequence_prompt)
    print("Sequence prompt: ", sequence_prompt)
    print("Length of sequence prompt: ", len(sequence_prompt))

    structure_prompt = torch.full((prompt_length, 37, 3), np.nan)
    structure_prompt[72:72 + len(motif_atom37_positions)] = torch.tensor(motif_atom37_positions)
    print("Structure prompt shape: ", structure_prompt.shape)
    print("Indices with structure conditioning: ", torch.where(~torch.isnan(structure_prompt).any(dim=-1).all(dim=-1))[0].tolist())

    protein_prompt = ESMProtein(sequence=sequence_prompt, coordinates=structure_prompt)

    # Generate sequence using ESM3
    sequence_generation_config = GenerationConfig(track="sequence", num_steps=sequence_prompt.count("_") // 2, temperature=0.5)
    sequence_generation = model.generate(protein_prompt, sequence_generation_config)
    print("Sequence Prompt:\n\t", protein_prompt.sequence)
    print("Generated sequence:\n\t", sequence_generation.sequence)

    # Predict structure using ESM3
    structure_prediction_config = GenerationConfig(track="structure", num_steps=len(sequence_generation) // 8, temperature=0.7)
    structure_prediction_prompt = ESMProtein(sequence=sequence_generation.sequence)
    structure_prediction = model.generate(structure_prediction_prompt, structure_prediction_config)

    # Convert the generated structure to a ProteinChain object and align it
    structure_prediction_chain = structure_prediction.to_protein_chain()
    motif_inds_in_generation = np.arange(72, 72 + len(motif_sequence))
    structure_prediction_chain.align(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    crmsd = structure_prediction_chain.rmsd(chain, mobile_inds=motif_inds_in_generation, target_inds=motif_inds)
    print(f"cRMSD of the motif in the generated structure vs the original structure for Active Site {i+1}: ", crmsd)

    # Visualize the original and generated structures
    view = py3Dmol.view(width=1000, height=500, viewergrid=(1, 2))
    view.addModel(pdb_str, "pdb", viewer=(0, 0))
    view.addModel(structure_prediction_chain.to_pdb_string(), "pdb", viewer=(0, 1))
    view.setStyle({"cartoon": {"color": "lightgrey"}}, viewer=(0, 0))
    view.setStyle({"cartoon": {"color": "lightgreen"}}, viewer=(0, 1))
    view.addStyle({"resi": motif_res_inds}, {"cartoon": {"color": "cyan"}}, viewer=(0, 0))
    view.addStyle({"resi": (motif_inds_in_generation + 1).tolist()}, {"cartoon": {"color": "cyan"}}, viewer=(0, 1))
    view.zoomTo()
    view.show()

# Clean up: remove the downloaded PDB file
os.remove(pdb_file_path)


Downloading PDB structure '7qlp'...
Full protein sequence: HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
atom37_positions shape:  (478, 37, 3)

Processing Active Site 1
Active Site 1 Motif sequence:  LCGAVLSRIDAGQEQLGRRIHYSQN
Active Site 1 Motif atom37_positions shape:  (25, 37, 3)




Sequence prompt:  ________________________________________________________________________LCGAVLSRIDAGQEQLGRRIHYSQN_______________________________________________________________________________________________________
Length of sequence prompt:  200
Structure prompt shape:  torch.Size([200, 37, 3])
Indices with structure conditioning:  [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]


NameError: name 'model' is not defined