# Protein in Water

You can run this notebook in your browser: 

[![Open On Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openmm/openmm_workshop_july2023/blob/main/section_1/protein_in_water.ipynb)


This notebook is a good place to start if you have not used OpenMM before. It covers one of the most basic procedures in molecular simulation, which includes:

1. Loading a protein structure from a PDB file
1. Solvating the protein in water
1. Equilibrating the system
1. Running NVT or NPT production simulations

Additionally, this notebook demonstrates how to setup simulation checkpoints, enabling you to restart simulations and perform long production runs on HPC clusters.

## Table of Contents

- Setup conda environment
- Download the protein file
- Load a PDB file into OpenMM
- Define the force field
- Solvate the protein with water and ions
- Setup system and integrator
- Run local minimization
- Setup reporting
- Run NVT equilibration
- Run NPT production molecular dynamics
- Basic analysis
- How to use checkpoints
- Visualization


The exercises are announced by the block

<div class="alert alert-block alert-info">
    ℹ️ <b>Exercise:</b> Description of the exercise.
</div>

and followed by an incomplete cell. Missing parts are indicated by:

```python
FIXME
```

which will throw an error when the cell is run.

## Setup
<a id="setup"></a>

### OpenMM on a local machine

- If you want to run OpenMM on your own machine, please take a look at the [setup instructions](../setup/conda_setup.md). 


### OpenMM on Colab
- If you are using Colab, you can run the cell below to install `mamba` in the Colab environment.

    <div class="alert alert-block alert-info">
    ⚠️ <b>First try and change runtime type to GPU</b>

    You can change to a GPU instance on Colab by clicking `runtime`→`change runtime type` and selecting `T4 GPU` from the `Hardware accelerator` dropdown menu. OpenMM runs on CPUs, but will be slower.
    </div>

- Remember that you can replace `mamba` with `conda` if you have not installed `mamba`.

- The first time you run a notebook in Colab, you will get a warning like this:

    ```
    Warning: This notebook was not authored by Google
    This notebook is being loaded from GitHub.
    ```

    You can ignore it and click the `run anyway` button.

In [None]:
# Execute this cell to install mamba in the Colab environment and then install openmm

if 'google.colab' in str(get_ipython()):
    print('Running on colab')
    !pip install -q condacolab
    import condacolab
    condacolab.install_mambaforge()
else:
    print('Not running on colab.')
    print('Make sure you create and activate a new conda environment!')

**Notes:** 
- During this step on Colab the kernel will be restarted. This will produce the error message:
"Your session crashed for an unknown reason. " This is normal and can be safely ignored.
- Installing the necessary packages may take several minutes.

### Install OpenMM

Now we can install `openmm` from the `conda-forge` repository:

In [None]:
!mamba install -y -c conda-forge openmm

Test the installation:

In [None]:
!python -m openmm.testInstallation

## Download the protein structure file
<a id="download"></a>

We will download the file from the workshop github repo

In [None]:
!wget https://raw.githubusercontent.com/openmm/openmm_workshop_july2023/main/section_1/villin.pdb

The protein is the villin headpiece. This is small fast folding protein commonly used as a toy system. Note that this PDB file has been cleaned up and is ready for use in OpenMM. If you try and use a PDB file directly from the protein data bank you may encounter errors. Please look at the [OpenMM FAQs](https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions) and [PDBfixer](https://github.com/openmm/pdbfixer).

![villin](./images/villin.png)
**Figure 1:** Villin headpiece protein.

## Load the PDB File into OpenMM
<a id="load"></a>

First we need to import OpenMM.
We then then load in the PDB file using the [PDBFile](http://docs.openmm.org/latest/api-python/generated/openmm.app.pdbfile.PDBFile.html#openmm.app.pdbfile.PDBFile) class.

In [None]:
from openmm.app import *
from openmm import *
from openmm.unit import *
from sys import stdout

# Load the pdb file
pdb = PDBFile('villin.pdb')

`PDBFile('file_name.pdb')` loads the PDB file from disk and puts the information into a `PDBFile` object which we have assign to the variable `pdb`. The object contains the molecular topology (atom names, residue types, bonds etc) and the atomic positions. These can be accessed as `pdb.topology` and `pdb.positions`. Take a look at the [API documentation](http://docs.openmm.org/latest/api-python/generated/openmm.app.pdbfile.PDBFile.html#openmm.app.pdbfile.PDBFile). All OpenMM classes have documentation available on the Python API reference: http://docs.openmm.org/latest/api-python/.

## Define the Force Field
<a id="ff"></a>

We need to define the force field we want to use. In this example, we will use the Amber14 forcefield and the TIP3P-FB water model. You can explore all the forcefields available by default in OpenMM in the [documentation](http://docs.openmm.org/latest/userguide/application/02_running_sims.html?highlight=forcefield#force-fields).

In [None]:
# Specify the forcefield
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')

Force fields in OpenMM are defined by XML files. The line above loads in specified files. You can look at them in the OpenMM source code (e.g. [`amber14/tip3pfb.xml`](https://github.com/openmm/openmm/blob/master/wrappers/python/openmm/app/data/amber14/tip3pfb.xml)). It is also possible to create your own XML force field file. You can find details in the [user guide](http://docs.openmm.org/latest/userguide/application/05_creating_ffs.html#creating-force-fields).

## Solvate
<a id="solvate"></a>

We can use the [`Modeller`](http://docs.openmm.org/latest/userguide/application/03_model_building_editing.html#model-building-and-editing) class to solvate the protein in a waterbox. 

In [None]:
# Create a Modeller object
modeller = Modeller(pdb.topology, pdb.positions)

# Solvate the protein in a box of water
modeller.addSolvent(forcefield, padding=1.0*nanometer)

This command creates a box that has edges at least 1 nm away from the solute and fills it with water molecules. Additionally, it adds the required number of Cl- and Na+ ions to neutralize the system's charge. Optionally, you can specify the ion concentration as an argument to [`addSolvent`](http://docs.openmm.org/latest/api-python/generated/openmm.app.modeller.Modeller.html#openmm.app.modeller.Modeller.addSolvent). 

Note that the `nanometer` variable is a unit definition that was imported from `openmm.unit`. This is part of the powerful units tracking and automatic conversion system built into the OpenMM Python API, which makes working with unit-bearing quantities both convenient and less error-prone. For example, we could have equivalently specified `10*angstrom` instead of `1*nanometer` to achieve the same result. You can read more about the units library [here](http://docs.openmm.org/latest/userguide/library/05_languages_not_cpp.html#units-and-dimensional-analysis).


## Setup System and Integrator
<a id='system'></a>

We now need to perform the following steps to create a simulation:

1. Combine the molecular topology and force field to create a complete system description using the [`ForceField`](http://docs.openmm.org/latest/api-python/generated/openmm.app.forcefield.ForceField.html#forcefield) object's [`createSystem()`](http://docs.openmm.org/latest/api-python/generated/openmm.app.forcefield.ForceField.html#openmm.app.forcefield.ForceField.createSystem) method.
1. Create an integrator to control the simulation dynamics.
1. Combine the integrator and system to create the `Simulation` object.
1. Set the initial atomic positions for the system.

In [None]:
# Create a system. Here we define some forcefield settings such as the nonbonded method
system = forcefield.createSystem(modeller.topology, nonbondedMethod=PME, nonbondedCutoff=1.0*nanometer, constraints=HBonds)

# Define the integrator. The Langevin integrator is also a thermostat
integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)

# Create the Simulation
simulation = Simulation(modeller.topology, system, integrator)
simulation.context.setPositions(modeller.positions)

The [System](http://docs.openmm.org/latest/api-python/generated/openmm.openmm.System.html#system) is an important object in OpenMM than contains the complete mathematical description of the system we want to simulate. Specifically, it contains four key bits of information:
 - The set of particles in the simulation
 - The forces acting on them
 - Details of any constraints
 - The dimensions of the periodic box

The `integrator` propagates the equations of motion. There are a variety of [integrators available in OpenMM](http://docs.openmm.org/latest/api-python/library.html#integrators). We are using the [LangevinMiddleIntegrator](http://docs.openmm.org/latest/api-python/generated/openmm.openmm.LangevinMiddleIntegrator.html#openmm.openmm.LangevinMiddleIntegrator), which performs Langevin dynamics.

The [`Simulation`](http://docs.openmm.org/latest/api-python/generated/openmm.app.simulation.Simulation.html#openmm.app.simulation.Simulation) object manages all the processes involved in running a simulation, such as advancing time and writing output. 


## Local Energy Minimization
<a id="minim"></a>

It is a good idea to run a local energy minimization at the start of a simulation. The initial coordinates of the system might be far from an energetically stable state, leading to very large forces that could cause the simulation to crash.

Because of how the minimizer is implemented, no progress information will be printed during the run. You’ll need to be patient and wait for it to finish. This minimization step typically takes around 1 minute on a CPU and only a few seconds on a GPU.

In [None]:
print("Minimizing energy")
simulation.minimizeEnergy()

## Setup Reporting
<a id="reporting"></a>

To obtain output from a simulation, you need to add "reporters" to your `Simulation` object. In OpenMM, we commonly use two types of reporters:

- **`DCDReporter`**: This reporter writes the coordinates of the simulation system to a file at specified intervals. For instance, you can configure `DCDReporter` to save the coordinates every 1000 timesteps to a file named `traj.dcd`. You can find more information about it in the [DCDReporter documentation](http://docs.openmm.org/latest/api-python/generated/openmm.app.dcdreporter.DCDReporter.html).

- **`StateDataReporter`**: This reporter outputs important simulation data, such as timestep, potential energy, temperature, and volume. It prints this data to the screen and also writes it to a file called `md_log.txt`. More details are available in the [StateDataReporter documentation](http://docs.openmm.org/development/api-python/generated/openmm.app.statedatareporter.StateDataReporter.html).

To add these reporters to your simulation, you need to modify the `simulation.reporters` list by appending the desired reporter objects.

In [None]:
# Write trajectory to a file called traj.dcd every 1000 steps
simulation.reporters.append(DCDReporter('traj.dcd', 1000))

# Print state information to the screen every 1000 steps
simulation.reporters.append(StateDataReporter(stdout, 1000, step=True,
        potentialEnergy=True, temperature=True, volume=True))

# Print the same info to a log file every 100 steps
simulation.reporters.append(StateDataReporter('md_log.txt', 100, step=True,
        potentialEnergy=True, temperature=True, volume=True))


## NVT Equilibration
<a id=nvt></a>

We are using a Langevin integrator, which means we are simulating in the NVT ensemble. To equilibrate the temperature, we just need to run the simulation for a given number of timesteps.

In [None]:
print('Running NVT')
simulation.step(10000)

## NPT Production MD
<a id=npt></a>

To run our simulation in the NPT ensemble, we need to add a barostat to the system to control the pressure. For this, we can use [`MonteCarloBarostat`](http://docs.openmm.org/latest/api-python/generated/openmm.openmm.MonteCarloBarostat.html#openmm.openmm.MonteCarloBarostat). The parameters are the pressure (1 bar) and the temperature (300 K). 

The barostat assumes that the simulation is being run at constant temperature, but it does not regulate the temperature itself. Therefore, it is critical to always use it along with a Langevin integrator or Andersen thermostat, ensuring you specify the same temperature for both the barostat and the integrator or thermostat. Failing to do so will result in incorrect results.

In [None]:
system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))

# It is important to call the reinitialize method on the simulation
# otherwise the modifications will not be applied.
simulation.context.reinitialize(preserveState=True)

We then run the simulation for 10000 steps.

<div class="alert alert-block alert-info">
ℹ️ <b>Exercise 1</b>

Replace the `FIXME` in the cell below with code to run the simulation for 10000 steps.
</div>

In [None]:
print('Running NPT')

# run for 10000 steps
FIXME

## Analysis
<a id=analysis></a>

We can now do some basic analysis using Python. We will plot the time evolution of the potential energy, temperature, and box volume. Remember that OpenMM itself is primarily an MD engine. For in-depth analysis of your simulations you can use other Python packages such as [MDtraj](https://www.mdtraj.org/) or [MDAnalysis](https://www.mdanalysis.org/).


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Load the data and extract 
data = np.loadtxt('md_log.txt', delimiter=',')
step = data[:,0]
potential_energy = data[:,1]
temperature = data[:,2]
volume = data[:,3]

# Potential Energy
plt.figure(figsize=(10, 10)) 
plt.subplot(3, 1, 1) 
plt.plot(step, potential_energy, color='b', linewidth=1.5)
plt.xlabel("Step")
plt.ylabel("Potential Energy (kJ/mol)")

# Temperature
plt.subplot(3, 1, 2) 
plt.plot(step, temperature, color='r', linewidth=1.5)
plt.xlabel("Step")
plt.ylabel("Temperature (K)")

# Volume
plt.subplot(3, 1, 3) 
plt.plot(step, volume, color='g', linewidth=1.5)
plt.xlabel("Step")
plt.ylabel("Volume (nm³)")
plt.tight_layout()
plt.show()

## Checkpointing
<a id=checkpoints></a>

When you run long simulations, it is useful to save checkpoints. This means you can restart the simulation in the case of a crash, and also resume it if you need to fit within the time constraints of a HPC job scheduler.

To run a resume a simulation we need to have 3 files saved to disk that we can load:

1. The topology file. This will be a PDB file of our solvated system.
2. A serialized `System`. This is an XML file that contains the force field settings.
3. A checkpoint file. This is a binary file that contains the positions, velocities, box vectors, and other internal data such as the states of the random number generators used in the simulation.

The first 2 files only need to be saved once, because they are constant throughout the simulation. The checkpoint file needs to be saved frequently. You can then resume the simulation from the timestep at which the checkpoint file was last saved.

### Setup the Checkpoint

We will create the topology file using `PDBFile` to write a PDB file of the system. We will use `XmlSerializer` of save the serialized system to an xml file. And we will use `CheckpointReporter` to regularly create checkpoint files.

In [None]:
# Save the toplogy as a PDB file.
with open('topology.pdb', 'w') as output:
    PDBFile.writeFile(simulation.topology, simulation.context.getState(getPositions=True).getPositions(),output)

# save a serialized version of the system. This stores the forcefield parameters.
with open('system.xml', 'w') as output:
    output.write(XmlSerializer.serialize(system))

# Setup a checkpoint reporter. This stores the positions, velocities, and box vectors. 
# It will save a checkpoint every 1000 timesteps.
simulation.reporters.append(CheckpointReporter('checkpoint.chk', 1000))

`CheckpointReporter` saves periodic checkpoints of a simulation. The checkpoints will overwrite one another, i.e., only the last checkpoint will be saved in the file. Loading a checkpoint will restore a simulation to a reasonably close, but usually not identical, state to when it was written. The checkpoint contains data that is highly specific to the `System`, `Platform`, and the hardware and software of the computer it was created on. If you try and load it on a computer with different hardware, it is likely to fail. Checkpoints created with different versions of OpenMM are often incompatible. 

For a more portable way of saving the state of a simulation, you can save the checkpoint as an XML state file. Read the [API docs](http://docs.openmm.org/development/api-python/generated/openmm.app.checkpointreporter.CheckpointReporter.html) for more information.

### Running for a Set Time Limit

We can run a simulation for a set amount of wall clock time using the `Simulation`'s [`runForClockTime`](http://docs.openmm.org/latest/api-python/generated/openmm.app.simulation.Simulation.html#openmm.app.simulation.Simulation.runForClockTime) method. By [wall clock](https://en.wikipedia.org/wiki/Elapsed_real_time) time, we mean the actual time a program runs for, as measured by looking at a clock on a wall (or a watch, or a timer, etc), as opposed to the simulated time.

In [None]:
# run for 30 seconds
simulation.runForClockTime(30.0*seconds)

### Resume From a Checkpoint

We now have the required files `'topology.pdb'`, `'system.xml'`, and `'checkpoint.chk'`. We will need to load them so we can resume the simulation from the last checkpoint. Note that we have to define the integrator again, as well as the simulation reporters. Furthermore, we have set the `append=True` flag to the DCD and StateData reporters.

<div class="alert alert-block alert-info">
ℹ️ <b>Exercise 2</b>

Add a line of code to make the simulation run for 30 seconds of wall time.
</div>

In [None]:
pdb = PDBFile('topology.pdb')

with open('system.xml') as input:
    system = XmlSerializer.deserialize(input.read())

# Define the integrator.
integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)

# Create the Simulation
simulation = Simulation(pdb.topology, system, integrator)

# set the positions, velocities, and box vectors from the checkpoint file
simulation.loadCheckpoint('checkpoint.chk')

# We still need to define the reporters again

# Write trajectory to a file called traj.dcd every 1000 steps
simulation.reporters.append(DCDReporter('traj.dcd', 1000, append=True))

# Print state information to the screen every 1000 steps
simulation.reporters.append(StateDataReporter(stdout, 1000, step=True,
        potentialEnergy=True, temperature=True, volume=True))

# Print the same info to a log file every 100 steps
simulation.reporters.append(StateDataReporter('md_log.txt', 100, step=True,
        potentialEnergy=True, temperature=True, volume=True, append=True))

# Setup a checkpoint reporter. This stores the positions, velocities, and box vectors.
simulation.reporters.append(CheckpointReporter('checkpoint.chk', 1000))

# write the code to run for 30 seconds of wall clock time
FIXME

### Resume Multiple Times

We can practice resuming multiple times. This is something you might have to do to fit a long simulation within the limits of a HPC job scheduler.

<div class="alert alert-block alert-info">
ℹ️ <b>Exercise 2</b>

Add the code required to create the `Simulation` object.
</div>

In [None]:
for i in range(3):
    print("Resuming from checkpoint iteration = ", i)

    pdb = PDBFile('topology.pdb')

    with open('system.xml') as input:
        system = XmlSerializer.deserialize(input.read())

    # Define the integrator.
    integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)

    # Create the Simulation
    # write the code to create the simulation object
    simulation = FIXME

    # set the positions, velocities, and box vectors from the checkpoint file
    simulation.loadCheckpoint('checkpoint.chk')

    # We still need to define the reporters again

    # Write trajectory to a file called traj.dcd every 1000 steps
    simulation.reporters.append(DCDReporter('traj.dcd', 1000, append=True))

    # Print state information to the screen every 1000 steps
    simulation.reporters.append(StateDataReporter(stdout, 1000, step=True,
            potentialEnergy=True, temperature=True, volume=True))

    # Print the same info to a log file every 100 steps
    simulation.reporters.append(StateDataReporter('md_log.txt', 100, step=True,
        potentialEnergy=True, temperature=True, volume=True, append=True))

    # Setup a checkpoint reporter. This stores the positions, velocities, and box vectors.
    simulation.reporters.append(CheckpointReporter('checkpoint.chk', 1000))

    # run for 30 seconds
    simulation.runForClockTime(30.0*seconds)


### Analysis

We can redo the analysis on the longer trajectory.

In [None]:
# Load the data and extract 
data = np.loadtxt('md_log.txt', delimiter=',')
step = data[:,0]
potential_energy = data[:,1]
temperature = data[:,2]
volume = data[:,3]

# Potential Energy
plt.figure(figsize=(10, 10)) 
plt.subplot(3, 1, 1) 
plt.plot(step, potential_energy, color='b', linewidth=1.5)
plt.xlabel("Step")
plt.ylabel("Potential Energy (kJ/mol)")

# Temperature
plt.subplot(3, 1, 2) 
plt.plot(step, temperature, color='r', linewidth=1.5)
plt.xlabel("Step")
plt.ylabel("Temperature (K)")

# Volume
plt.subplot(3, 1, 3) 
plt.plot(step, volume, color='g', linewidth=1.5)
plt.xlabel("Step")
plt.ylabel("Volume (nm³)")
plt.tight_layout()
plt.show()

## Visualization
<a id="viz"></a>

We can use the `nglview` package to view the simulation structures and trajectories in the Juyter notebook.

For more serious visualization and rendering, there is a variety of programs available (https://en.wikipedia.org/wiki/List_of_molecular_graphics_systems). A couple of the most popular ones are:
- [VMD](https://www.ks.uiuc.edu/Research/vmd/)
- [PyMol](https://pymol.org/)

<div class="alert alert-block alert-info">
⚠️ <b>Note this part does not currently work in Colab</b>
</div>

In [None]:
if 'google.colab' in str(get_ipython()):
    # https://github.com/googlecolab/colabtools/issues/3409
    import locale
    locale.getpreferredencoding = lambda: "UTF-8"

!mamba install -y -c conda-forge nglview mdtraj

In [None]:
import mdtraj
import nglview

traj = mdtraj.load("traj.dcd", top="topology.pdb")
view = nglview.show_mdtraj(traj)
view.add_representation('licorice',selection="water")
view

<div class="alert alert-block alert-info">
ℹ️ <b>Exercise 4</b>

Download the files `topology.pdb` and `traj.dcd` from Colab. You can find the files in the left side pane:

<img src="images/screenshot1.png" alt="screenshot of Colab interface">
</div>


<div class="alert alert-block alert-info">
ℹ️ <b>Exercise 5 (optional)</b>
    
Open `topology.pdb` with VMD or another program of your choice. In VMD, you can then go `file`→`load data into molecule` and select `traj.dcd`. You should now be able to view the trajectory.

When you first open `topology.pdb` with VMD, it will look like this:

<img src="images/screenshot2.png" alt="screenshot of Colab interface">

The blue region of atoms in the middle is the protein and the red atoms are the water box.
</div>

## Next
Now go to the protein-ligand complex [notebook](./protein_ligand_complex.ipynb).

## Solutions
<a id="solutions"></a>

*Exercise 1.* Run for 1000 steps:
```python
simulation.step(10000)
```

*Exercise 2.* Run for 30 seconds of wall time:
```python
simulation.runForClockTime(30.0*seconds)
```

*Exercise 3.* Create the simulation:
```python
simulation = Simulation(pdb.topology, system, integrator)
```