# Post Translational Modification of a Protein - PROTOTYPE

In [None]:
# TODO: Re-write prose

This notebook demonstrates Open Force Field's prototype workflow for simulating a post-translationally modified protein.

We'd love to know your thoughts! Please submit feedback as an [issue] on the [`ptm_prototype` repository]

[`ptm_prototype` repository]: https://github.com/openforcefield/ptm_prototype
[issue]: https://github.com/openforcefield/ptm_prototype/issues/new

### Outline

1. Prepare a residue definition for the PTM residue
2. Load a PDB file containing the PTM residue
3. Solvate with the OpenFF PackMOL wrapper
4. Parametrize the solvated system using a combination of the Sage and FF14sb force fields and the NAGL graph charge package
5. Run a short simulation in OpenMM

To achieve a high-quality protein simulation, we apply as many parameters as possible from the SMIRNOFF port of Amber ff14sb. This includes charges, LJ parameters, and valence parameters for the canonical amino acids and NME/ACE caps. As with all SMIRNOFF-format force fields, these are applied via direct chemical perception as specified by SMARTS substructures. The remaining atoms LJ and valence parameters are filled in from Sage, while charges come from NAGL. NAGL allows AM1BCC partial charges for the entire protein to be computed from a graph neural network without having to run a much more expensive QC calculation on such a large molecule.

We'll perform most of our imports ahead of time, but API points prepared for this prototype will be imported in the cells in which they're used so they stand out. These API points may eventually be shipped in OpenFF packages after we get feedback on them!

In [None]:
import openmm
from ipywidgets import Image
from openff.toolkit import ForceField, Molecule, Topology
from openff.units import Quantity
from rdkit.Chem import Draw
from rdkit.Chem.rdChemReactions import ReactionFromSmarts

## Prepare a residue definition for the PTM residue

The new Pablo PDB loader uses a unified [`ResidueDefinition`] dataclass to specify how to load a particular residue. Residues are loaded by matching atom names from the PDB file to the residue definitions defined for that residue name, and using the chemical information from the `ResidueDefinition` to determine the detailed chemical information needed to load an OpenFF Topology. Multiple residue definitions can be provided for a given residue name; if they disagree about how to assign chemical information to a particular residue in a PDB file, an error is raised. Residue definitions are provided to the PDB loader as a mapping from residue names to lists of residue definitions:

```py
residue_database: Mapping[str, list[ResidueDefinition]]
```

The default residue database used by Pablo is `CCD_RESIDUE_DEFINITION_CACHE`. This object presents the `Mapping` interface, and so residues can be read just like from a dictionary. Behind the scenes, the cache downloads and caches CIF files from the CCD, processes them into residue definitions, and patches them to improve compatibility with diverse PDB files. For example, we can take a look at the CCD's cysteine definition:

[`ResidueDefinition`]: https://openff-pablo.readthedocs.io/en/latest/api/generated/openff.pablo.ResidueDefinition.html

In [None]:
from openff.pablo import STD_CCD_CACHE

cysteine_resdef = STD_CCD_CACHE["CYS"][0]
cysteine_resdef.visualize()

Residue definitions support being written to and read from OpenFF `Molecule` objects so that they can be visualized and prepared with existing tools. In this depiction, each atom is labeled by its index in the molecule, the possible names it may have in a PDB file, and then finally a caret ("^") if it is absent when a bond is formed between this residue and another.

Our post-translationally modified protein contains a cysteine residue that has been labeled with a fluorescein maleimide dye. Labelling occurs in the lab via a synthetic thiol-maleimide "click" reaction that is specific to cysteine residues in proteins. To prepare the residue definition for the post-translationally modified residue, we will use the following SMARTS reaction to prepare the PTM residue from the maleimide and cysteine:

In [None]:
reactants_smarts = [
    "[C:10]-[S:1]-[H:2]",
    "[N:3]1-[C:4](=[O:5])-[C:6](-[H:11])=[C:7](-[H:12])-[C:8](=[O:9])-1",
]
products_smarts = [
    "[N:3]1-[C:4](=[O:5])-[C:6](-[H:2])(-[H:11])-[C@:7](-[S:1]-[C:10])(-[H:12])-[C:8](=[O:9])-1",
]
thiol_maleimide_click_smarts = (
    ".".join(reactants_smarts)
    + ">>"
    + ".".join(products_smarts)
)

rxn = ReactionFromSmarts(thiol_maleimide_click_smarts)
d2d = Draw.MolDraw2DCairo(800, 300)
d2d.DrawReaction(
    ReactionFromSmarts(thiol_maleimide_click_smarts), highlightByReactant=True
)
Image(value=d2d.GetDrawingText())

Next, we'll load the maleimide from an SDF file. This could come from any of the usual sources of an OpenFF `Molecule`, including an RDKit molecule object or SMILES string. We'll also generate atom names; these would be appropriate to use if you haven't already prepared the PDB file and have access to the atom names that will be written out, but to load an existing PDB file they'll have to be changed later.

In [None]:
from openff.pablo import ResidueDefinition

maleimide_resdef = ResidueDefinition.anon_from_sdf("maleimide.sdf")
maleimide_resdef.visualize()

Now, we'll perform the reaction with the prototype `react()` function. `react()` takes a list of reactants and a reaction SMARTS and produces a list of the possible outcomes of the reaction, each represented by a list of product `Molecule` objects. For a single-product reaction that can only happen in a single way given the reactants, this is a single `Molecule` object wrapped in two lists.

In [None]:
dye_resdef = ResidueDefinition.react(
    reactants=[cysteine_resdef, maleimide_resdef],
    reactant_smarts=reactants_smarts,
    product_smarts=products_smarts,
)[0][0]
dye_resdef.visualize()

## Load a PDB file containing the PTM residue

Now, we just load the PDB file using the CCD residue definition cache augmented with our new residue. Note that at the moment, this only matches based on atom names, but we will soon add support for connectivity-based matches. This means that at the moment, every atom name in the residue must match a synonym from the residue definition, but in the future we will be able to identify this residue from CONECT records as an alternative.

In [None]:
from openff.pablo import topology_from_pdb

topology = topology_from_pdb(
    "3ip9_dye.pdb",
    additional_definitions=[dye_resdef],
)

We now have a standard OpenFF `Topology` object, which we can visualize with the familiar methods:

In [None]:
w = topology.visualize()
w.clear_representations()
w.add_cartoon()
w.add_line(opacity=0.5, crossSize=1.0)
w.add_licorice("DYE", radius=0.3)
w.add_unitcell()
w.center("DYE")
w

## Solvate with the OpenFF PackMOL wrapper

Now that we have an OpenFF `Topology` of the post-translationally modified protein, we can solvate it in familiar ways. For example, with the `solvate_topology` function from the experimental Interchange PackMOL wrapper:

In [None]:
from openff.interchange.components._packmol import (
    RHOMBIC_DODECAHEDRON,
    solvate_topology,
)

topology = solvate_topology(
    topology,
    nacl_conc=Quantity(0.1, "mol/L"),
    padding=Quantity(1.2, "nm"),
    box_shape=RHOMBIC_DODECAHEDRON,
)

In [None]:
w = topology.visualize()
w.clear_representations()
w.add_cartoon()
w.add_line(opacity=0.5, crossSize=1.0)
w.add_licorice("DYE", radius=0.3)
w.add_unitcell()
w.center("DYE")
w

Note that this box requires NPT equilibration before production simulation.

## Parametrize the solvated system using a combination of the Sage and FF14sb force fields and the NAGL graph charge package

The final new component of the prototype is the "swiss cheese" parametrization method. This refers to applying library charges to the parts of the protein for which they are defined in the ff14sb force field, and "filling in the holes" with NAGL graph charges. This is a streamlined stopgap to a more natural direct parametrization to a future force field that supports both proteins and NAGL charges natively. Note that this produces a bit of a Frankenstein's monster of a parametrization; while NAGL charges are philosophically compatible with both Sage and Amber force fields, and Sage and Amber force fields are philosophically compatible with each other, the actual quality of the resulting simulations has never been rigorously tested and might hold some surprises! If you perform such testing, please [let us know!]

This'll take a few minutes; graph charges are much faster than quantum chemical methods, but a protein is still a large molecule.

[let us know!]:https://github.com/openforcefield/ptm_prototype/issues/new

In [None]:
sage_ff14sb = ForceField("openff_no_water-3.0.0-alpha0.offxml", "opc3.offxml")

interchange = sage_ff14sb.create_interchange(topology)

## Run a short simulation in OpenMM

Now that we have an Interchange, we can prepare simulations in any of the usual output engines. Here we'll use OpenMM. We'll also save a copy of the system to disk so we have an exact record of what we simulated. We're not aiming to tell you how to run a simulation here, just demonstrate what we can do; you'll need much more substantial equilibration to clean up the PackMOL box.

In [None]:
temperature = 300 * openmm.unit.kelvin
pressure = 1 * openmm.unit.bar

timestep = 2 * openmm.unit.femtosecond
friction_coeff = 1 / openmm.unit.picosecond
barostat_frequency = 25

print("making OpenMM simulation ...")
simulation = interchange.to_openmm_simulation(
    integrator=openmm.LangevinMiddleIntegrator(
        temperature,
        friction_coeff,
        timestep,
    ),
    additional_forces=[
        openmm.MonteCarloBarostat(
            pressure,
            temperature,
            barostat_frequency,
        ),
    ],
)

dcd_reporter = openmm.app.DCDReporter("trajectory.dcd", 1000)
simulation.reporters.append(dcd_reporter)

print("serializing OpenMM system ...")
with open("system.xml", "w") as f:
    f.write(openmm.XmlSerializer.serialize(simulation.system))

Minimize the energy:

In [None]:
simulation.context.computeVirtualSites()
simulation.minimizeEnergy()
simulation.context.setVelocitiesToTemperature(simulation.integrator.getTemperature())

Run the simulation for a minute of wall time:

In [None]:
simulation.runForClockTime(1.0 * openmm.unit.minute)

Finally, visualize the resulting trajectory in NGLView!

In [None]:
import mdtraj
import nglview

traj = mdtraj.load(
    "trajectory.dcd", 
    top=mdtraj.Topology.from_openmm(simulation.topology),
)
        
widget = nglview.show_mdtraj(traj)

widget.clear_representations()
widget.add_cartoon()
widget.add_line(opacity=0.5, crossSize=1.0)
widget.add_licorice("DYE", radius=0.3)
widget.center("DYE")

widget