# Preparing Structures Using OpenMM

In the previous notebook, we demonstrated how to load and visualize a PDB file, either from a local source or directly from the RCSB Protein Data Bank. Most crystal structures, however, lack hydrogen atoms and solvents, which are critical for accurate molecular dynamics (MD) simulations. In this section, we'll address these missing components to prepare our structure for MD simulation.

It's worth noting that the presence of non-standard residues or ligands can complicate the application of force fields, as classical MD relies on predefined parameters (bonds, angles, etc.) for simulation. To simplify our example, we'll remove ligands and focus solely on the protein.

Software tools like Antechamber and PDB4amber can help correct structures, ensuring they're suitably prepared for MD simulations.

# OpenMM
OpenMM is a versatile molecular dynamics package capable of simulating condensed phase chemistry using a variety of force fields. Its main advantage is its ease of installation and a user-friendly Python interface, which we'll leverage in the following steps.

For a deeper understanding of OpenMM and its capabilities, refer to the user guide here: http://docs.openmm.org/latest/userguide/introduction.html# S

# Molecular dynamics

Molecular dynamics (MD) is a simulation method used to model the time-dependent behavior of atoms and molecules. In the realm of protein science, this technique is most commonly realized by using pre-tabulated forces that describe familiar chemical interactions, such as bonds, angles, dihedrals, and non-bonded interactions. These forces are usually derived from high-level simulations and then fitted into a classical mechanics framework, often representing bonds as a harmonic potential (akin to a spring potential).

This choice of bonding description is advantageous because its simplistic functional form allows for rapid computation. However, it typically lacks the ability to accurately represent the dissociation of molecular bonds, meaning that the connectivity at the start of the simulation is the same at the end. There exist several ways to enhance this model, such as ReaxFF, but they fall beyond the scope of this workbook.

Proteins, RNA, and DNA are particularly well-suited to this approach as they are typically composed of repeating molecular units, allowing for a robust construction and tabulation of force field parameters. However, the inherent conformational flexibility of these biomolecules can be a drawback, as they can adopt a multitude of conformations that are necessary to describe their activities.

One area where MD proves invaluable is in cases where a crystal structure doesn't illuminate a potential binding pocket. This is common in cryptic pockets, where the ground state structure doesn't represent the active form of the protein. In such cases, to investigate the protein more closely, it must be thermalized so that conformational changes can reveal potential target pockets.



### Prepping the protein for dynamics

The bellow cell takes a PDB and tries to clean it so that it is ready for simualtion.

Import required libraries: The code begins by importing the necessary libraries and modules, such as OpenMM for running Molecular Dynamics simulations and PDBFixer for fixing issues in the PDB file.

The cose is broken down as follows

1. **Import required libraries**: The code begins by importing the necessary libraries and modules, such as OpenMM for running Molecular Dynamics simulations and PDBFixer for fixing issues in the PDB file.
2. **Define input and output PDB files**: Here, the input PDB file (pdb_start) is specified, along with the cleaned output file (pdb_out).
3. **Clean PDB records using pdb4amber**: The pdb4amber utility is used to clean up records in the PDB file for compatibility with the Amber force field. The resulting cleaned PDB file is saved as 'pdb_out'.
4. **Add hydrogens using Reduce**: The Reduce program is used to add hydrogens to the cleaned PDB file according to Amber's preferences. The PDB file with added hydrogens is again saved as 'pdb_out'.
5. **Fix structural issues with PDBFixer**: OpenMM's PDBFixer is used to address any remaining issues in the PDB file, such as missing residues, nonstandard residues, and heterogens. Missing atoms and hydrogens are added based on a specified pH of 7.0. The final PDB file is saved as 'pdb_out'.

The result is a cleaned and fixed PDB file that is suitable for Molecular Dynamics simulations. The code ensures that the protein structure is compatible with the chosen force field, and that any missing or problematic atoms, residues, or heterogens have been addressed.

## This will produce a lot of output!!


In [2]:
from openmm.app import * 
from openmm import *
from openmm.unit import *
from openmm.openmm import *
from pdbfixer import PDBFixer
import subprocess

# PDB file that we will use as a starting structure
pdb_start = "assets/cookbook/7pav.pdb"

# PDB file that we will use as the cleaned output structure
pdb_out = 'assets/cookbook/cleaned_output.pdb'

# Use amber4pdb to clean up records for use with amber forcefield
out = subprocess.check_output(["pdb4amber", "--nohyd","--dry", pdb_start])#
with open(pdb_out, 'wb') as f:
    f.write(out)  
# Use reduce to add hydrogens according to ambers preferences    
try:
    out = subprocess.check_output(["reduce", "-build", "-nuclear", "assets/cookbook/cleaned_output.pdb"], stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
    print("Error message from reduce:", e.stderr.decode())

# Use OpenMMs pdbfixer to fix some final issues that can crop up
fixed_pdb = PDBFixer(filename=pdb_out)
fixed_pdb.findMissingResidues()
fixed_pdb.findNonstandardResidues()
#fixer.replaceNonstandardResidues()
fixed_pdb.removeHeterogens(True) # comment to run with ligand
fixed_pdb.findMissingAtoms()
fixed_pdb.addMissingAtoms()
fixed_pdb.addMissingHydrogens(7.0)
PDBFile.writeFile(fixed_pdb.topology, fixed_pdb.positions, open(pdb_out, 'w'))



Summary of pdb4amber for: assets/cookbook/pdbs/7oun.pdb

----------Chains
The following (original) chains have been found:
A
B

---------- Alternate Locations (Original Residues!))

The following residues had alternate locations:
CYS_40
TYR_56
VAL_68
ASN_96
CYS_114
-----------Non-standard-resnames

Traceback (most recent call last):
  File "/opt/conda/bin/pdb4amber", line 33, in <module>
    sys.exit(load_entry_point('pdb4amber==22.0', 'console_scripts', 'pdb4amber')())
  File "/opt/conda/lib/python3.9/site-packages/pdb4amber/pdb4amber.py", line 819, in main
    run(
  File "/opt/conda/lib/python3.9/site-packages/pdb4amber/pdb4amber.py", line 579, in run
    gaplist = pdbfixer.find_gaps()
  File "/opt/conda/lib/python3.9/site-packages/pdb4amber/pdb4amber.py", line 208, in find_gaps
    N_atom = parm.atoms[N_atoms[i + 1]]
IndexError: list index out of range


CalledProcessError: Command '['pdb4amber', '--nohyd', '--dry', 'assets/cookbook/pdbs/7oun.pdb']' returned non-zero exit status 1.

# Viewing the prepeared protein

The below cell will generate an image of the protein without the ligand present

In [3]:
import nglview as nv
view = nv.show_structure_file("assets/cookbook/cleaned_output.pdb")
view



NGLWidget()

In [1]:
!python -m openmm.testInstallation



OpenMM Version: 8.0
Git Revision: a7800059645f4471f4b91c21e742fe5aa4513cda

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.30666e-06
Reference vs. CUDA: 6.7397e-06
CPU vs. CUDA: 6.97494e-07

All differences are within tolerance.


## Setting up the forcefield and simualtion for protien dynamics


After having prepared a unified set of coordinates, a corresponding topology and a suitable force field to describe the physical interactions between the atoms and molecules, the next step is to define the physical conditions for the molecular dynamics (MD) simulation. In the code snippet provided, these parameters are specified as follows:

We have selected the following parameters below for the simulation for speed, you'll likley want to increase the non-bonded terms aand the number of tiemsteps at a minimum.

* Integrator : Langevin
* Nonbonded cutoff distance : 1 Nanometer
* friction coefficient : 0.1 picosecond
* Temperature : 300 kelvin
* Timestep : 0.004 picoseconds
* Number of timesteps : 2000 steps
* Total simulation time : 8 picoseconds
* Number of stpes between checkpoints : 100 steps
* Real world time between steps : 0.4 picoseconds

The chosen integrator, LangevinMiddleIntegrator, applies the Langevin dynamics, where a friction term is incorporated to model the interaction with the implicit solvent.

The nonbonded cutoff distance is set to 1 nanometer, which indicates the distance beyond which the nonbonded interactions between atoms are ignored. The friction coefficient, specified as 1/picosecond, sets the coupling strength between the system and the heat bath.

The simulation temperature is set at 300 Kelvin, and each simulation step is set to last 0.004 picoseconds. The simulation will run for a total of 3000 steps, totaling an approximate real-world time of 12 picoseconds.

To monitor the simulation progress and capture the system's evolving state, two reporters are added to the simulation instance:

DCDReporter: It records the trajectory, i.e., the atomic positions at each time point, into a DCD/XTC file. In this instance, the trajectory is saved every 10 steps.
StateDataReporter: It prints out the simulation step number, potential energy, and temperature every 100 steps to the standard output (your console).
The simulation is then set in motion with the simulation.step(3000) command. Depending on the complexity of the molecular system and the computational resources available, the simulation may take some time to complete.

For exploratory purposes and speed, the parameters chosen here are somewhat minimal. In a more rigorous study, one might want to consider a longer simulation time, a smaller timestep, and more frequent output for detailed analysis.

Note: While the DCD format is widely used to store molecular dynamics trajectories, the XTC format, native to the GROMACS simulation package, offers a significant advantage in terms of file size. XTC files typically have a smaller footprint than their DCD counterparts due to the more efficient compression scheme employed in the XTC format. This makes XTC files particularly well-suited for storing long trajectories that can span large timescales, helping to manage disk space usage more efficiently.

In [3]:
from sys import stdout
from openmm.app import ForceField
from mdtraj.reporters import XTCReporter


pdbfile = PDBFile("assets/cookbook/cleaned_output.pdb")
modeller = app.Modeller(pdbfile.topology, pdbfile.positions)
#forcefield = generate_forcefield(pdbfile)

forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')


# Uncomment the blow line to use GPU accelleration
#platform = Platform.getPlatformByName('CUDA')

# setting of the chemical system
system = forcefield.createSystem(modeller.topology, nonbondedMethod=NoCutoff,
        nonbondedCutoff=1*nanometer, constraints=HBonds)

# settings for how big the timestep should be
integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)

# Collect everthing together to make a simulation instance
simulation = Simulation(modeller.topology, system, integrator)

# Set starting ositions
simulation.context.setPositions(modeller.positions)
simulation.minimizeEnergy()

# File location to save output and how ofter to save
simulation.reporters.append(DCDReporter('assets/cookbook/first_output.dcd', 10))
simulation.reporters.append(XTCReporter('assets/cookbook/first_output.xtc', 10))


# Report the physical properties
simulation.reporters.append(StateDataReporter(stdout, 100, step=True,
        potentialEnergy=True, temperature=True))

# Number of steps to run
simulation.step(3000)

#"Step","Potential Energy (kJ/mole)","Temperature (K)"
100,-66195.91466961296,103.43620645819622
200,-58351.3835313319,171.4416356236536
300,-53507.00704744891,216.58331634592045
400,-50695.20875445424,246.48338548072894
500,-48219.45319869359,262.2715299365983
600,-47485.49734134712,278.0932186436287
700,-46570.62959330164,287.09594307364256
800,-46135.34883567411,295.9231184810379
900,-46016.1057402907,297.4766041447673
1000,-45686.53074091987,297.05753747597356
1100,-45824.16347793475,302.19764166825377
1200,-45949.522423677816,303.07563916235375
1300,-45811.015381954974,302.1533137376454
1400,-45964.567375071216,305.4288919104988
1500,-45915.188858292735,304.35700566673665
1600,-46409.45471201939,301.2026888360696
1700,-46280.75085425019,300.99990721542287
1800,-45814.33068081018,300.7316689891411
1900,-45966.6910447492,306.7858282673234
2000,-46610.80426938769,304.9087845274102
2100,-45923.685613718764,303.42564162349703
2200,-46285.29044090149,302.40355295231524
2300,-46791.91527

# Visualising the output

As you can see, every 100 steps a report is produced detailing the state of the system. This reporting frequency is determined by our settings for the StateDataReporter class. Increasing the frequency of reporting can result in more granular time resolution for subsequent analysis. However, this comes with a trade-off as it also slows the simulation due to increased file I/O operations and results in larger trajectory files. This is independant of the trajectory reporter which is set to eevry 25 frames (the DCD/XTC output).

In the context of Nanome, it may be beneficial to inspect the simulation every 100 timesteps under the conditions we've used here. The playback speed can then be adjusted as needed in VR. The user experience becomes a factor to consider, particularly in terms of how much the atoms move and the timestep interval.

It's also important to note that the number of steps loaded in VR significantly impacts the load time. Therefore, using PCVR is recommended when loading molecular dynamics data for an optimal user experience.

The output data is presented in the following format: Step number, Potential Energy (kJ/mol), Temperature (K).

The system takes roughly 3000 steps to reach the target temperature. Beyond this point, the thermostat continues to interact with the system, normalizing its energy to maintain the target temperature.

In [4]:
traj = nv.SimpletrajTrajectory("assets/cookbook/first_output.dcd", "assets/cookbook/cleaned_output.pdb")
print(f"Trajectory has {traj.n_frames} frames")
viewtraj = nv.show_simpletraj(traj)
viewtraj.add_unitcell()
viewtraj 

Trajectory has 300 frames


NGLWidget(max_frame=299)

# Periodic boundry conditions

The previous cell illustrates how the simulation coordinate system can intersect with the edge of the periodic boundary conditions of the system. The first frame displays the geometry derived from the initial PDB, but during the dynamics, some parts may cross over the boundary of the simulation box. This is merely a visualization artifact of the periodic system - the underlying simulation remains accurate as the forces are correctly applied across the periodic boundary.

Such visual artifacts can sometimes create the impression of a split protein, which may be disorienting or confusing.

There are a few methods to correct these visual anomalies. The easiest approach often involves using VMD's tools to modify the periodic boundary conditions with the 'wrap'/'unwrap' command. However, we can also address this issue programmatically within this notebook, as demonstrated in the following section.

In [15]:
from MDAnalysis import transformations
import MDAnalysis as mda
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# A custom atom group can be passed as an argument. In this case we will use all the atoms
# in the Universe labeled "u"
u = mda.Universe(pdbfile.topology, "assets/cookbook/first_output.dcd")

# Get details of A chain
prot = u.select_atoms("protein and name CA")
# Get details of B chain
protb = u.select_atoms("protein and name CB")

# Get the first residue
ag = u.residues[1].atoms

# Use mass to determine the center of the box
workflow = (transformations.unwrap(ag),
                   transformations.center_in_box(ag, center='mass'),
                   transformations.wrap(prot, compound='fragments'))
u.trajectory.add_transformations(*workflow)

   
view = nv.show_mdanalysis(u)
view.add_unitcell()
view

NGLWidget(max_frame=299)

# Exporting the new trajectory for Nanome

In [10]:
from MDAnalysis import  Writer

u = mda.Universe(pdbfile.topology, "assets/cookbook/first_output.dcd")
# Output only the residues taged as protien
# protein = u.select_atoms("protein")
with Writer("assets/cookbook/fist_output.xtc", u.trajectory) as W:
    for ts in u.trajectory:
        W.write(u)

In [16]:
traj = nv.SimpletrajTrajectory("assets/cookbook/first_output.xtc", "assets/cookbook/cleaned_output.pdb")
print(f"Trajectory has {traj.n_frames} frames")
viewtraj = nv.show_simpletraj(traj)
viewtraj.add_unitcell()
viewtraj

Trajectory has 300 frames


NGLWidget(max_frame=299)