# Umbrella Sampling


## Umbrella sampling simulations

Most umbrella sampling simulations have three main steps:

1. Preparing the windows, this is usually done using steered MD.
2. Running the windows, these can be run in parallel.
3. Analysing the results, typically this involves computing a PMF with WHAM.

Often you will find in step 3 that the simulations do not produce a nicely converged PMF. You will need to go back to steps 1 and 2 and change some settings, this trial and improvement feedback loop is a normal part of the process. For this tutorial we will use settings already known to work.

## System

This tutorial will use umbrella sampling to compute the free energy profile of the end to end distance of deca-alanine in vacuum. Deca-aline is commonly used as a toy system [1], its equilibrium structure is a stable alpha-helix. we will start with the alpha-helix structure, as used in [1], perform a steered MD simulation to pull it from helix into a coil, and then run umbrella sampling windows to compute the free energy profile along the collective variable, which is the distance between the first and last alpha-carbon.


[1] Sanghyun Park, Fatemeh Khalili-Araghi, Emad Tajkhorshid, Klaus Schulten; Free energy calculation from steered molecular dynamics simulations using Jarzynski’s equality. J. Chem. Phys. 8 August 2003; 119 (6): 3559–3566. https://doi.org/10.1063/1.1590311

## Step 1 - setting up the windows with SMD

The script below does the following steps:
-   loads in a PDB file
-   Defines a collective variable between the first and last alpha-carbon
-   Adds a harmonic restraint to the CV
-   Runs a simulation where the location of the harmonic restraint is moved with constant velocity from 1.3nm to 3.5nm (This is called constant velocity steered MD).
-   Saves a configuration for each of the 24 equally spaced windows we have defined.

In [None]:
import openmm as mm
import openmm.app as app
import openmm.unit as unit
from sys import stdout
import numpy as np

pdb = app.PDBFile("deca-ala.pdb")

forcefield = app.ForceField('amber14-all.xml')

modeller = app.Modeller(pdb.topology, pdb.positions)
modeller.addHydrogens(forcefield)
system = forcefield.createSystem(modeller.topology, nonbondedMethod=app.NoCutoff,constraints=app.HBonds)
integrator = mm.LangevinMiddleIntegrator(300*unit.kelvin, 1/unit.picosecond, 0.002*unit.picoseconds)
simulation = app.Simulation(modeller.topology, system, integrator)
simulation.context.setPositions(modeller.positions)
simulation.reporters.append(app.PDBReporter('smd_traj.pdb', 10000))
simulation.reporters.append(app.StateDataReporter(stdout, 10000, step=True,time=True,
        potentialEnergy=True, temperature=True, speed=True))

# equilibrate
simulation.step(1000)

# define the CV as the distance between the CAs of the two end residues
index1 = 8
index2 = 98
cv = mm.CustomBondForce('r')
cv.addBond(index1, index2)

# now setup SMD

# starting value
r0=1.3 # nm

# force constant
fc_pull=1000.0 # kJ/mol/nm^2

# pulling speed
v_pulling=0.02 # nm / ps

# simulation time step
dt=0.002 # ps

# total number of steps
N=60000

# frequency to increment r0 (1 makes the simulation slow)
M=10

# define a harmonic restraint on the CV
# the location of the restrain will be moved as we run the simulation
# this is constant velocity steered MD
pullingForce = mm.CustomCVForce('0.5 * fc_pull * (cv-r0)^2')
pullingForce.addGlobalParameter('fc_pull', fc_pull)
pullingForce.addGlobalParameter('r0', r0)
pullingForce.addCollectiveVariable("cv",cv)
system.addForce(pullingForce)
simulation.context.reinitialize(preserveState=True)

# pulling loop, update CV target location every 10 steps
# save specific configurations when the current_cv_value first reaches the specified windows
windows = np.linspace(1.3,3.3,24)
window_coords = []
window_index = 0

for i in range(N//M):
    simulation.step(M)
    current_cv_value = pullingForce.getCollectiveVariableValues(simulation.context)
    
    if (i*M)%10000==0:
        print("r0 = ",r0, "r = ",current_cv_value)
    
    # increment the location of the CV based on the pulling velocity
    r0 += v_pulling*dt*M
    simulation.context.setParameter('r0',r0)

    # check if we should save this config as a window starting structure
    if (window_index < len(windows) and current_cv_value >= windows[window_index]):
        window_coords.append(simulation.context.getState(getPositions=True, enforcePeriodicBox=False).getPositions())
        window_index+=1

# save the window structures
i=0
for coords in window_coords:
    outfile = open("window_"+str(i)+".pdb","w")
    app.PDBFile.writeFile(simulation.topology,coords, outfile)
    outfile.close()
    i+=1

Once the script has completed running there will be 24 new pdb files called "window_n.pdb" where n in an integer from 0 to 23. 

We have now have the initial configurations for the umbrella sampling windows.

## Step 2 - running the windows

The script to run the windows is very similar to the script in step 1. The key differences are that we load in an initial structure that corresponds to each specific window and the harmonic restraint on the CV does not move. The script below defines a function to run one window.

In [None]:
def run_window(N):

    import openmm as mm
    import openmm.app as app
    import openmm.unit as unit
    import numpy as np

    windows = np.linspace(1.3,3.3,24)

    window_index = N

    print("running window", window_index)

    pdb = app.PDBFile("window_"+str(window_index)+".pdb")
    forcefield = app.ForceField('amber14-all.xml')
    system = forcefield.createSystem(pdb.topology, nonbondedMethod=app.NoCutoff,constraints=app.HBonds)
    integrator = mm.LangevinMiddleIntegrator(300*unit.kelvin, 1/unit.picosecond, 0.002*unit.picoseconds)
    simulation = app.Simulation(pdb.topology, system, integrator)
    simulation.context.setPositions(pdb.positions)

    # define the CV as the distance between the CAs of the two end residues
    index1 = 8
    index2 = 98

    cv = mm.CustomBondForce('r')
    cv.addBond(index1, index2)

    # fixed harmonic restraint at the window location

    # starting value
    r0 = windows[window_index]

    # force constant
    fc_pull=1000.0 # kJ/mol/nm^2

    # total number of steps
    N=100000 # 200 ps
    M=1000

    # define a harmonic restraint on the CV
    pullingForce = mm.CustomCVForce('0.5 * fc_pull * (cv-r0)^2')
    pullingForce.addGlobalParameter('fc_pull', fc_pull)
    pullingForce.addGlobalParameter('r0', r0)
    pullingForce.addCollectiveVariable("cv",cv)
    system.addForce(pullingForce)
    simulation.context.reinitialize(preserveState=True)

    # save cv location
    # record the CV timeseries
    cv_values=[]
    for i in range(N//M):
        simulation.step(M)

        # get the value of the cv
        current_cv_value = pullingForce.getCollectiveVariableValues(simulation.context)
        cv_values.append([i, current_cv_value[0]])

    np.savetxt("cv_values_window_"+str(window_index)+".txt",np.array(cv_values))

    print("Completed window", window_index)

We then run all 24 windows in parallel.

This takes ~3 minutes on an M2 Macbook

Note that we are using `ThreadPool` to do this because this method works in a jupyter notebook. You could just as easily use a bash script to launch a copy of the script for each window.

In [None]:

from multiprocessing.pool import ThreadPool

window_indices = list(range(0,24))
with ThreadPool(24) as p:
    p.map(run_window, window_indices)


Once all the window simulations have completed you will have the CV timeseries files: "cv_values_window_n.txt"

## Step 3 - analysis - compute the PMF

The first thing to check is that the histograms of the CV timeseries from the windows have good overlap. Here is an example:
![histogram](hist.png)


You can plot yours with the script below
(This script also produces the metafile we will need for the next step)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# plot the histograms
metafilelines=[]
windows = np.linspace(1.3,3.3,24)
for i in range(len(windows)):
    data=np.loadtxt("cv_values_window_"+str(i)+".txt")
    plt.hist(data[:,1])
    metafileline= "cv_values_window_"+str(i)+".txt "+str(windows[i])+ " 1000\n"
    metafilelines.append(metafileline)

plt.xlabel("r (nm)")
plt.ylabel("count")

with open("metafile.txt", "w") as f:
    f.writelines(metafilelines)

To compute the PMF we can use WHAM. An easy to use and widely compatible implementation is the WHAM program by Alan Grossfield, it can be downloaded here: http://membrane.urmc.rochester.edu/?page_id=126

In [None]:
!wget http://membrane.urmc.rochester.edu/sites/default/files/wham/wham-release-2.0.11.tgz
!tar xf wham-release-2.0.11.tgz
!cd wham/wham && make

To use wham we need a metadata file that lists the names of each CV timeseries file, the location of the harmonic restraints and the value of the spring constant. We created this in the histogram plotting script earlier.

The `wham` program is run using command line arguments, read the documentation to find out more: http://membrane.urmc.rochester.edu/sites/default/files/wham/doc.pdf

The command below will compute the PMF from our data in the range 1.3nm to 3.5nm, with 50 bins and a tolerance of 1e-6 for our simulated temperature of 300K.

In [None]:
!./wham/wham/wham 1.3 3.5 50 1e-6 300 0 metafile.txt pmf.txt > wham_log.txt

We can then plot the computed PMF. It should look something like this:
![pmf.png](pmf.png)

In [None]:
# plot the pmf
pmf=np.loadtxt("pmf.txt")
plt.plot(pmf[:,0], pmf[:,1])
plt.xlabel("r (nm)")
plt.ylabel("PMF (kJ/mol)")
plt.show()