# Molecular simulations with *Yaff*: The basics

Welcome in this notebook, where we will dive into using *Yaff* for our molecular simulations. In this *Python* notebook, we will cover:

1. The *MolMod* package
2. Instantiating a molecular system in *Yaff*
3. Geometry optimizations
4. *Velocity Verlet* as the molecular dynamics engine
5. Controlling temperature and pressure: Thermostats and barostats
6. Trajectory analysis
7. Restarting files

For a more detailed documentation, we refer to the online documentation pages of [MolMod](http://molmod.github.io/molmod/) and [Yaff](http://molmod.github.io/yaff/).

## Chapter 1: The *MolMod* package

*MolMod* is the underlying *Python* library for many projects at the CMM, such as *Yaff*. It contains several auxiliary modules for the development of molecular modeling programs. Here, we will focus on using *MolMod* as a library reference for *Yaff*. But first things first: installing the package. All installation details can be found [here](http://molmod.github.io/molmod/tutorial/install.html) (note that you can also easily install *MolMod* using *Anaconda*).

To check whether the installation was successful, try executing the next chunk of code.

In [None]:
import numpy as np
from molmod.units import kjmol

print(5*kjmol)

### 1.1. Conversion from and to atomic units

Internally, CMM codes work in atomic units. This unit system is consistent, like the SI unit system. Using this, one does not need conversion factors in the middle of a computation once all values are converted to atomic units. This facilitates the programming and reduces accidental bugs due to forgetting these conversion factors in the body of the code. To make life easier, *MolMod* provides a library of conversion factors called `molmod.units`:

In [None]:
from molmod.units import *

Now, if we for instance want to set the time step during the Verlet integretation to 0.5 fs, we need to know the conversion from femtosecond to the atomic unit of time. With *MolMod*, we can simply type

In [None]:
timestep = 0.5*femtosecond
print(timestep)

Now, timestep contains the time value, in atomic units, corresponding to 0.5 fs. Similarly, you can define distances, temperatures, masses, etc. A complete list of conversion constants can be found [here](http://molmod.github.io/molmod/reference/const.html#module-molmod.units).

### 1.2. Physical constants

In the `molmod.constants` package, the four main physical constants for atomic simulations are listed:

In [None]:
from molmod.constants import *

# Print out Avogadro's number N_A
print(avogadro)

# Print out the Boltzmann constant k_B
print(boltzmann)

# Print out the speed of light c
print(lightspeed)

# Print out Planck's constant h
print(planck)

### 1.3. The periodic table

Finally, *MolMod* is also very useful to access the chemical information contained in the periodic table. This information is stored in the `periodic` class of the `molmod.periodic` module:

In [None]:
from molmod.periodic import periodic

For instance, let's say we are interested in information about the carbon atom, only knowing its symbol `"C"`. We can then easily retrieve its atom number and mass:

In [None]:
print(periodic['C'].number)
print(periodic['C'].mass)

Conversely, once we know the atom number, we can retrieve for instance its symbol and mass:

In [None]:
print(periodic[6].symbol)
print(periodic[6].mass)

<br/>

## Chapter 2: Instantiating a molecular system in *Yaff*

A comprehensive installation guide for *Yaff* can be found [here](http://molmod.github.io/yaff/ug_install.html). To check its correct installation, please execute the following code.

In [None]:
from yaff.sampling.verlet import VerletIntegrator

### 2.1. Introduction

A `System` instance in *Yaff* contains all the physical properties of a molecular system plus some extra information that is useful to define a force field. Most properties are optional.

The **basic molecular properties** are:

* Atomic numbers
* Positions of the atoms
* 0, 1, 2, or 3 Cell vectors (optional)
* Atomic charges (optional)
* Atomic masses (optional)


Additionally, there are some optional auxiliary properties, such as bonds and atom types, which are useful to define force fields.

The positions and the cell parameters may change during the simulation. All other properties (including the number of atoms and the number of cell vectors) do not change during a simulation. If such changes seem to be necessary, one should create a new `System` instance instead of modifying an existing one.

The `System` constructor arguments can be specified with some *Python* code:

In [None]:
from yaff import *
import numpy as np

system = System(
    numbers=np.array([8, 1, 1]*2),
    pos=np.array([[-4.583, 5.333, 1.560], [-3.777, 5.331, 0.943],
                  [-5.081, 4.589, 1.176], [-0.083, 4.218, 0.070],
                  [-0.431, 3.397, 0.609], [0.377, 3.756, -0.688]])*angstrom,
    scopes=['WAT']*6,
    ffatypes=['O', 'H', 'H']*2,
    bonds=np.array([[(i//3)*3, i] for i in range(6) if i%3!=0]),
    rvecs=np.array([[9.865, 0.0, 0.0], [0.0, 9.865, 0.0], [0.0, 0.0, 9.865]])*angstrom,
)

where the `*angstrom` converts the numbers from angstrom to atomic units. 

One can also load the system from one or more files using the `from_file` function. For instance, if we want to read the atomic positions from the file `"initial.xyz"` and additionally provide the cell parameters, the following syntax can be used:

    system = System.from_file('initial.xyz', cell=np.identity(3)*9.865*angstrom)

The `from_file` class method accepts one or more files and any constructor argument from the `System` class as keyword arguments.

Conversely, a system can be easily stored to a file using the `to_file` method:

    system.to_file('last.chk')

where the `.chk`-format is the standard text-based checkpoint file format in *Yaff*. It can also be used in the `from_file` method.

### 2.2. Working with the *System* class

For production runs, we recommend that one writes a separate script to prepare a `System` instance, which is then written to the `.chk` format for later use in scripts that perform the actual simulation and/or analysis. The example below shows how this can be done, starting from a simple `.xyz` file with coordinates for a water box with 32 molecules.

The first step is to load the `.xyz` file and add some extra information, cell parameters in this example, through keyword arguments.

In [None]:
sys = System.from_file('waterbox.xyz', rvecs=np.identity(3)*9.865*angstrom)

In order to run a force field simulation, one has to identify covalent bonds in the system. We could have added these via keyword arguments of the `from_file` method. In this example, the `yaff.system.System.detect_bonds()` method is used:

In [None]:
sys.detect_bonds()
print('The number of bonds:', len(sys.bonds))
print(sys.bonds)

For the analysis of some simulations on crystals, it may be useful to align the unit cell vectors with the Cartesian frame. This can be done with the `yaff.system.System.align_cell()` method. The following will allign the 110 vector with the x-axis and the 001 vector with the z-axis:

In [None]:
sys.align_cell(np.array([[1, 1, 0], [0, 0, 1]]))

On several occasions, it is also useful to construct a supercell:

In [None]:
sys2 = sys.supercell(3, 3, 3)

Although one can assign arbitrary masses to each atom, one is typically interested in assigning standard atomic weights. This is done as follows:

In [None]:
sys2.set_standard_masses()

When the system is finally ready to be used as a starting point for a *Yaff* simulation, it is convenient to write it as a `.chk` file that can be easily loaded in subsequent scripts:

In [None]:
sys.to_file('waterbox.chk')
sys2.to_file('waterbox333.chk')

It is instructive to open this `.chk` file with a text editor. You will see that all attributes of the `System` class are present in this file.

### 2.3. Force-field models

Once the system is defined, as above, we need to specify the interactions between the different atoms. With *Yaff*, these interactions are modelled by analytical terms: a so-called **force field**. Each force-field term is defined based on its analytical form and the groups of atom or atom types they work on. **Atom types** are a set of names, one for each atom, that distinguishes between all different atoms and atomic configurations. More details can be found [here](http://molmod.github.io/yaff/ug_forcefield.html). For now, it is sufficient to know that we can simply use the atomic symbols as atom types for this simple water model. To set these atom types in the `.chk` file, the following syntax can be used:

In [None]:
sys.ffatypes = ['O', 'H']
sys.ffatype_ids = np.array([0, 1, 1]*32)
sys.to_file('waterbox.chk')

**Note**: We strongly recommend not to assign atom types in a `System` object after the `System` object has been initalized, but rather provide these atom types during construction of the `System` object. Only during the initialization of the `System` object it is checked whether the atom types make sense. For the sake of this tutorial, however, we have done this check for you, and you may assume that the lines above correctly assign atom types to the `System` object.

We have provided all relevant terms for a basic water force field in the force-field parameter file `parameters.txt`. This force field can be generated using the following syntax:

In [None]:
# This line ensures that warnings are not printed, as we have taken care of them
# Be careful when you do this
log.set_level(log.silent)
# Actually generate the force field
ff = ForceField.generate(sys, 'parameters.txt')

**Beware**: The atom types in the force-field parameter file and the system object should match. Otherwise, the force field will not know between which tuple of atoms the force field term acts.

<br/>

## Chapter 3: Geometry optimizations

Now that we have a representation of our system and the molecular interactions, we can finally start simulating. The goal of a molecular simulation is to extract macroscopic information about the system at hand. For this, the relevant parts of the total phase space, containing all atomic positions and momenta, need to be sampled during a simulation. Here, we distinguish between two types of simulations:
* Optimizations: During an **optimization**, one is interested to find a local **stable** structure. As a stable structure corresponds to a minimum in (free) energy, an optimization boils down to finding parameters in the phase space that correspond to a minimum in (free) energy. In this case, we only sample the region around a given initial configuration, looking to alter our initial configuration in a direction of lower (free) energy. If a minimum is found, the final structure of such an optimization yields a stable configuration of the material.
* Dynamic simulations: During a **dynamic** simulation, one is interested to sample the regions in phase space that are accessible at the given macroscopic parameters such as energy, temperature, and pressure. These macroscopic parameters define with which probability a given point in phase space will be visited. For instance, if the total energy of a system is fixed, one can (classically) never visit points in phase space corresponding to an energy larger than this total energy. For dynamic simulations, two large classes of methods exist: **Monte Carlo** (MC) methods and **molecular dynamics** (MD) methods, both of which you are already familiar with after reading *Understanding Molecular Simulation: From Algorithms to Applications*.

In this notebook, we will focus on geometry optimizations (this chapter) and molecular dynamics simulations (Chapters 4 to 7).

### 3.1. Introduction

A basic geometry optimization (with trajectory output in an HDF5 file) is implemented as follows.

In [None]:
# Import h5py (see further)
import h5py
import os

# Reset the logging to the normal level
log.set_level(log.medium)
# Define the degrees of freedom (DOF) object
dof = CartesianDOF(ff)
# Initialize a hook to write the output to the HDF5-file output.h5
with h5py.File('output.h5', mode = 'w') as f_h5:
    hdf5 = HDF5Writer(f_h5, step=10)
    # Define the optimizer
    opt = CGOptimizer(dof, hooks = hdf5)
    # Actually run the optimizer (i.e.: optimize)
    opt.run(250)

This standard output yields the following information:

1. The iteration counter: How many optimization steps have passed?
2. The convergence value: This is the highest ratio of a convergence criterion over its threshold (about convergence criteria: see later)
3. N: The number of convergence criteria that are not met
4. Worst: The name of the criterion that is the worst (highest ratio)
5. Energy: The total energy of the system at that step (this should decrease when the optimization progresses)
6. Walltime: How much time (in seconds) have passed since the start

The optimization above will end if either all convergence criteria are met (and N = 0), or if the predefined number of steps have been carried out. In the sections below, we will go into a bit more detail on the parameters in the optimization.

### 3.2. The degrees-of-freedom object

The degrees-of-freedom or DOF object, called `dof` in the chunk of code above, specifies a set of degrees of freedom. These degrees of freedom are those parameters, such as Cartesian coordinates and cell parameters, that need to be optimized. We distinguish between the following main DOF objects, contained in `yaff.sampling.dof`:

1. `CartesianDOF`: In this DOF object, only the Cartesian coordinates of the atoms are optimized. If a periodic system is provided, the cell parameters are *not* optimized. Using the `select` keyword, one can define the indices of the atoms whose Cartesian coordinates must be optimized, while those that are absent from this list will be kept fixed. If the `select` keyword is not provided, all atoms are selected.
3. `FullCellDOF`: In this DOF object, all fractional coordinates and the nine cell parameters are optimized.
4. `StrainCellDOF`: In this DOF object, all fractional coordinates and the six independent cell parameters are optimized (note that a cell matrix is uniquely defined by six parameters, the other three merely define the orientation of the coordinate system).
5. `AnisoCellDOF`: In this DOF object, all fractional coordinates and the lengths of the three cell vectors are optimized. The angles between the different cell vectors are kept constant.
6. `IsoCellDOF`: The same as above, except that all cell vectors are rescaled by the same parameter.

In the chunk of code above, we hence only allowed for an optimization of the Cartesian coordinates, while the cell parameters were kept constant.

Depending on the DOF object, different thresholds can be defined for convergence. However, this is optional, and default values are provided if absent.
1. `gpos_rms`: The root-mean-square of the norm of the gradients of the atoms.
2. `dpos_rms`: The root-mean-square of the norm of the displacements of the atoms.
3. `grvecs_rms`: The root-mean-square of the norm of the gradients of the cell vectors.
4. `drvecs_rms`: The root-mean-square of the norm of the displacements of the cell vectors.

In the cell below, we define a `CartesianDOF`, but with more stringent conditions than the default:

In [None]:
dof = CartesianDOF(ff, gpos_rms=1e-6, dpos_rms=1e-4)

### 3.3. The optimizer

Two main optimizers are currently defined in *Yaff* and contained in the `yaff.sampling.opt` module:

1. `CGOptimizer`: A conjugate gradient optimizer. During each iteration, this optimizer will look at the gradient of the energy in phase space. This gradient corresponds to the direction along which the energy increases the most from the given configuration. The conjugate gradient optimizer will then move along the negative of this direction, towards lower energy.
2. `QNOptimizer`: A quasi-Newton optimizer. During each iteration, the local gradient and an approximate Hessian is calculated, and the configuration is moved towards lower energy.

For both optimizers, a DOF object should be provided as an argument, and optional hooks can be defined (see below).

### 3.4. The hooks

A **hook** in *Yaff* is called after every predefined number of iterations during an optimization or MD simulation. Examples of hooks, for instance, are output methods (such as the HDF5 file we created above), or the thermostats and barostats that will be defined in Chapter 5.

A list of hooks can be provided to the optimizer. The most-used hooks are:

1. `OptScreenLog`: This is a hook that prints the basic information about the optimizer. In the introductory example we gave before, this `OptScreenLog` hook was called implicitly. If no hooks are provided to the optimizer instance, this hook is automatically added and printed every iteration.
2. `HDF5Writer`: This is a hook that writes the basic information about the optimization to a HDF5-file, such as in the example above. This is typically not done every step, as these files can become quite large (certainly for molecular dynamics simulations). This hook takes a `h5py` instance, such as illustrated above.
3. `XYZWriter`: This is a hook that writes the molecular configuration to a `.xyz`-file. Again, this is typically not done every step, as these files can become quite large (certainly for molecular simulations). This hook takes a string as input, defining the name of the file to which the data is written.

For every hook, you can define the first iteration it starts to act (using the `start` keyword, 0 by default) and the frequency with which it acts (using the `step` keyword, 1 by default). 

With this instance, we can define a quasi-Newton optimizer, using our DOF object created above, and with the three hooks above defined explicitly:

In [None]:
# First define the OptScreenLog
screenlog = OptScreenLog(step=10)
# Second, define the HDF5 writer
with h5py.File('output_2.h5', mode = 'w') as f_h5:
    h5 = HDF5Writer(f_h5, step=10)
    # Define the XYZ writer
    xyz = XYZWriter("output_2.xyz", step=10)

    # Define the optimizer
    opt = QNOptimizer(dof, hooks=[screenlog, h5, xyz])
    
    # Everything that is left, is to run the optimization:
    opt.run(1000)

To visualize the optimization, we will use *VMD*. *VMD* is a visualization program for molecular simulations, and can be downloaded [here](http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=VMD). Although you will need to make an account, the program itself is free.

Once downloaded, you can visualize the trajectory by going to the folder containing the output `output_2.xyz`, and typing in the terminal `vmd output_2.xyz`. This opens several screens, including a display of the system and the main interface. In the main interface, you can scroll through the iterations with the scrollbar, or automatically progress through the optimization using the bottom right button.

<br/>

## Chapter 4: *Velocity Verlet* as the molecular dynamics engine

In contrast to an optimization, an MD simulation much more closely resembles experiment. During an MD simulation, we follow the trajectory of the system, subject to macroscopic constraints on for instance the energy, temperature or pressure. If we then want to extract macroscopic properties from this system, for instance the average separation between two water molecules, or the average number of hydrogen bonds, we have to average these properties over the whole trajectory (assuming the system is equilibrated at the start). One main question arises, however: How is such a trajectory defined?

At every given point on the trajectory, we have the configuration of the system, as well as the forces acting on each atom during the force field definition. The main idea is to use this information to obtain the following configuration. Although the fact that several methods exist, we will only discuss the *Velocity Verlet* method, as it is one of the most efficient MD integrators.

During each *Velocity Verlet* iteration, let's say at a time $t$, the positions $r(t)$ and velocities $v(t)$ of each atom are updated to the next iteration, $t+\Delta t$ ($\Delta t$ is the timestep), using the following update algorithm:

$$ v\left(t+ \frac{\Delta t}{2} \right) = v(t) + \frac{f(t)}{2m} \Delta t $$
$$ r ( t + \Delta t ) = r(t) + v\left(t+ \frac{\Delta t}{2} \right)  \Delta t $$
$$v\left(t+ \Delta t \right) = v\left(t+ \frac{\Delta t}{2} \right)  + \frac{f\left(t+ \Delta t \right)}{2m} \Delta t $$

The **Verlet timestep** $\Delta t$ is an important parameter. It can be interpreted as the time in between two subsequent trajectory points. One may be tempted to increase this parameters in order to obtain longer total simulation times. However, the Nyquist theorem limits the maximum allowed timestep: the timestep should be sufficiently low to adequately sample every force field term. In practice, when O-H bonds are present, this limits the timestep for classical simulations to about 0.5 femtosecond.

In the example below, we initialize a short MD simulation of the water system defined earlier, using the *Velocity Verlet* update scheme. Once again, the `hooks` keyword can be added to write to output files or to print directly to the screen. Note that the screen logger for the `VerletIntegrator` is called `VerletScreenLog`.

In [None]:
# Define the screen logging
vsl = VerletScreenLog(step=10)
# Define the Velocity Verlet integrator
verlet = VerletIntegrator(ff, 0.5*femtosecond, hooks=vsl)
# Run the Velocity Verlet integrator
verlet.run(1000)

The screenlogger provides the following iteration:
1. Iteration number
2. Conserved error: During a simulation, an ensemble-specific conserved energy can be defined. As the name suggests, this conserved energy should be constant on the long-term, only showing possible short-term fluctuations. A nonconstant conserved energy is often the first indicator that something went wrong when defining the system or force field. Under the header `Cons. Err.`, *Yaff* outputs the standard deviation of this conserved energy, normalized on the standard deviation of the kinetic energy. If the conserved energy is indeed conserved, this ratio should be small (and vanish for a conserved energy that is conserved even at short time scales). 
3. Temperature: The instantaneous temperature, defined from the velocities of the atoms.
4. The root-mean-square displacement of the atoms.
5. The root-mean-square gradient of the energy.
6. The total walltime that has passed.

If no hooks are provided, this screenlogger is added automatically with `step=1`.

As outlined in *Understanding Molecular Simulation: From Algorithms to Applications*, this update scheme yields trajectories in the microcanonical or $NVE$ ensemble. This raises the question: how can we sample other ensembles, for instance defined by fixing the temperature or the pressure? Let's take a look at the next chapter.

<br/>

## Chapter 5: Controlling temperature and pressure: Thermostats and barostats

As explained above, the standard implementation of the *Velocity Verlet* update scheme yields trajectories in the microcanonical ensemble. To obtain trajectories in the iso-enthalpic, isothermal or isobaric-isothermal ensemble, additional steps need to be undertaken. In *Yaff*, this is done using specified hooks. We distinguish between hooks to control the temperature (so-called **thermostats**) and hooks to control the pressure (so-called **barostats**).

### 5.1. Thermostats to control the temperature

In the canonical or $NVT$ ensemble, we no longer restrict the energy (such as in the $NVE$ ensemble), but rather control the temperature $T$. Note that we say **controlling** the temperature and not **fixing** it: during an $NVT$ simulation, the instantaneous temperature (defined at every iteration) will fluctuate around a fixed average (the macroscopic temperature). However, the distribution of the instantaneous temperature during an $NVT$ simulation is fixed, and its exact form can be derived by requiring all velocities to satisfy the Maxwell-Boltzmann distribution at that temperature $T$.

To obtain a constant-temperature simulation, we have to alter the velocities before and/or after the *Velocity Verlet* update. In *Yaff*, several thermostat hooks are available in `yaff.sampling.nvt`:

* `AndersenThermostat`: The Andersen thermostat is a stochastic thermostat. During each iteration this hook is called, a subset of atoms are selected, and their velocities are rescaled with a constant factor to obtain the correct temperature.
* `BerendsenThermostat`: The Berendsen thermostat is a deterministic thermostat. During each iteration, the difference between the instantaneous temperature and the macroscopic temperature is exponentially damped by rescaling the velocities. Note that this thermostat *does not* yield the correct ensemble, but is rather used to equilibrate the system around a given temperature.
* `LangevinThermostat`: The Langevin thermostat is a stochastic thermostat. During each iteration, the system is subject to a drag and random force mimicking the effect of the system being immersed in a physical bath of small particles.
* `CSVRThermostat`: The CSVR (Canonical Sampling through Velocity Rescaling) thermostat is a stochastic thermostat based on the Langevin thermostat. It alters the Langevin drag and random forces in order to minimize the disturbance for the system.
* `NHCThermostat`: The NHC (Nosé-Hoover chain) thermostat is a deterministic thermostat. Here, the system is coupled to an external chain of heat baths which can supply energy to the system or take energy from the system, such that the correct temperature distribution is obtained. By controlling the optional `chainlength` keyword, the number of beads in the chain can be set (should be larger than 1; default=3).

All these thermostats take the temperature as a compulsory keyword. Furthermore, all thermostats (except for the Andersen thermostat) have an optional keyword, `timecon`, which is the time constant. This time constant defines the aggressiveness of the thermostat: the higher the time constant, the more time you allow the system to recuperate, and hence the less invasive the thermostat is. Conversely, the lower the time constant, the more aggressive the thermostat. This time constant has a system-dependent lower limit (it should not be too aggressive). While there is no theoretical upper limit, longer time constants also result in the need for longer simulation times. The default value of 100 femtosecond is often a good choice.

Below, we define a Nosé-Hoover chain thermostat, and run a water simulation in the $NVT$ ensemble at 300 Kelvin. Note the first line in the output, indicating we successfully coupled a thermostat.

In [None]:
nhc = NHCThermostat(temp=300*kelvin, timecon=100*femtosecond, chainlength=3)

In [None]:
# Define the screen logging
vsl = VerletScreenLog(step=10)
# Define the Velocity Verlet integrator, with thermostat hook
verlet = VerletIntegrator(ff, 0.5*femtosecond, hooks = [vsl, nhc])
# Run the Velocity Verlet integrator
verlet.run(1000)

When comparing the above output to the output obtained during the $NVE$ run, you should observe that the instantaneous temperature (third column) is now controlled to fluctuate around 300 K.

### 5.2. Barostats to control the pressure

Similar to thermostats, barostats are hooks that modify the *Velocity Verlet* update scheme to control the pressure $P$, yielding for instance the $NPH$ ensemble ($H$ being the enthalpy). Once again, the pressure is controlled rather than fixed, and several barostats are available in `yaff.sampling.npt`. They all alter the cell parameters of the system, as well as the atomic positions and momenta.

1. `BerendsenBarostat`: Similar to the thermostat, the Barostat barostat exponentially damps the difference between the instantaneous pressure and the macroscopic pressure.
2. `LangevinBarostat`: Similar to the thermostat, the Langevin barostat subjects the system to a combination of random and drag forces to alter the unit cell via a piston.
3. `MTKBarostat`: The Martyna-Tobias-Klein barostat, sometimes called the Martyna-Tobias-Tuckerman-Klein (MTTK) barostat, is a deterministic barostat. The basic idea is somewhat similar to the NHC thermostat.

For all these barostats, it is necessary to define the mean temperature and the mean pressure (note that the mean temperature of the barostat can be different from that of the system, but it is suggested to keep it the same), as well as provide the force field instance. Moreover, the optional `timecon` keyword can be provided (note that barostat time constants are usually one order of magnitude larger than thermostat time constants). Depending on the barostat, there are also two keywords which correspond to degrees of freedom:

1. The `anisotropic` keyword defines whether anisotropic cell fluctuations are allowed. If it is False, only the cell volume can fluctuate. By default, this parameter is True. Note that, for fluids, it doesn't make sense to allow anisotropic cell fluctuations, and this keyword should be False.
2. The `vol_constraint` keyword defines whether the volume is allowed to fluctuate. By default, this parameter is True. If set to False, the $(N, V, {\bf \sigma}_a = {\bf 0}, T)$ ensemble is sampled, which is useful to construct pressure-versus-volume equations of state. These equations of state can be used to predict the mechanical and thermal stability of materials at operating conditions.

Below, we define an MTK barostat, and run a water simulation in the $NPH$ ensemble at 1 bar. As this is a simulation of a fluid, the `anisotropic` keyword is set to False.

In [None]:
mtk = MTKBarostat(ff, temp=300*kelvin, press=1*bar, timecon=1000*femtosecond, anisotropic=False)

In [None]:
# Define the screen logging
vsl = VerletScreenLog(step=10)
# Define the Velocity Verlet integrator, with thermostat hook
verlet = VerletIntegrator(ff, 0.5*femtosecond, hooks=[vsl, mtk])
# Run the Velocity Verlet integrator
verlet.run(1000)

### 5.3. Putting it together: The $NPT$ ensemble

If we want to control both the temperature and the pressure, we need to initialize both a thermostat and a barostat. However, to ensure that both hooks are called at the right time, they need to be combined in a single hook. This can be done using the `TBCombination` hook, provided in the `yaff.sampling.npt` module. Simply pass both the thermostat and barostat to this instance:

In [None]:
tbc = TBCombination(nhc, mtk)

So, for finals, let's do a short $NPT$ simulation, and write the output to a HDF5 file we will use later on.

In [None]:
# Define the screen logging
vsl = VerletScreenLog(step=10)
# Define the HDF5 writer
with h5py.File('output_NPT.h5', mode = 'w') as f_h5:
    h5 = HDF5Writer(f_h5, step=10)
    # Define the Velocity Verlet integrator, with thermostat hook
    verlet = VerletIntegrator(ff, 0.5*femtosecond, hooks=[vsl, h5, tbc])
    # Run the Velocity Verlet integrator
    verlet.run(1000)

As you can probably observe above, $NPT$ simulations suffer from large fluctuations in the conserved energy in the first few steps. However, when extending the simulation time, these fluctuations will dampen out.

<br/>

## Chapter 6: Trajectory analysis

So, we've done it. We have sampled the phase space using MD, and outputted the trajectory to a HDF5 file. Now, how do we cash in on the results?

### 6.1 Basic analysis

A few basic analysis routines are provided in `yaff.analysis.basic` to quickly check the sanity of an MD simulation.

First off, `plot_energies` makes a plot of the kinetic and the total energy as a function of the time:

In [None]:
with h5py.File('output_NPT.h5', mode = 'r') as f:
    plot_energies(f)

The resulting plot is saved in the default filename `energies.png`. Likewise, we can plot the temperature and pressure as function of the time using the `plot_temperature` and `plot_pressure` functions.

`plot_temp_dist` plots the distribution of the instantaneous temperature and compares it with the expected analytical result for a constant-temperature ensemble. For example:

In [None]:
with h5py.File('output_NPT.h5', mode = 'r') as f:
    plot_temp_dist(f)

The result, shown in `temp_dist.png`, should indicate that the instantaneous temperature is indeed spread around a mean value of about 300 K.

### 6.2. More advanced analysis

More advanced analysis methods, such as radial distribution functions, are also provided in *Yaff*. An overview is given [here](http://molmod.github.io/yaff/ug_analysis.html). In addition, you can access the raw trajectory data directly from the HDF5 file, very similar to slicing lists in *Python*:

In [None]:
# The following line is only necessary to show the plots in the notebook and not as a pop-up screen
%matplotlib inline

# Open the HDF5 file
with h5py.File('output_NPT.h5', mode = 'r') as f:
    # Read in the temperature and the time stamps of the last 50 samples
    time = f['trajectory/time'][50:]
    temp = f['trajectory/temp'][50:]

# Import matplotlib
import matplotlib.pyplot as plt

# Simply plot the temperature as function of the time
plt.plot(time/femtosecond, temp/kelvin, 'k-')
plt.show()

### 6.3. But most importantly: visualize your simulation

While the basic and advanced analysis techniques mentioned above will help you to verify whether your simulation outcome makes sense, most incorrect simulations are revealed simply by visualizing the simulation, for instance in *VMD*.

As an example, we will perform here an $NPT$ simulation at 300 K and 100 MPa of MIL-53(Al), a metal-organic framework (MOF) for which an accurate force field was derived before. Below, we will run the integrator. However, note that, as we are performing 10000 iterations, the simulation may take a few tens of minutes. Pay attention to the conserved quantity: we noted before that, in the beginning of a simulation, this quantity may be prone to fluctuations. As a result, the `Cons. Err.` will be quite large in the beginning, but gradually decrease at increasing simulation time. 

Note that the code below throws a set of warnings, indicating that certain parameters are absent. You can ignore these warnings in this specific case. As we are performing 10000 iterations, the simulation may take a few tens of minutes. Pay attention to the conserved quantity: we noted before that, in the beginning of a simulation, this quantity may be prone to fluctuations. As a result, the `Cons. Err.` will be quite large in the beginning, but gradually decrease at increasing simulation time.

In [None]:
# Load in the modules
import h5py
from yaff import *
import numpy as np

# Generate the system object:
sys0 = System.from_file("MIL53_init.chk")
# Enlarge the cell along the y-axis
sys = sys0.supercell(1, 2, 1)
# Load the forcefield
ff = ForceField.generate(sys, "MIL53_pars.txt")

# Define the thermostat
nhc = NHCThermostat(temp=300*kelvin, timecon=100*femtosecond, chainlength=3)
# Define the barostat
mtk = MTKBarostat(ff, temp=300*kelvin, press=1e8*pascal, timecon=1000*femtosecond)
# Define the TB combination
tbc = TBCombination(nhc, mtk)

# Initialize the output: the screen log, the HDF5 file and the XYZ file
vsl = VerletScreenLog(step=100)
xyz = XYZWriter('output_MIL53.xyz')
with h5py.File('output_MIL53.h5', mode = 'w') as f:
    h5 = HDF5Writer(f, step=10)
    # Define the Velocity Verlet integrator, with the hooks defined before
    verlet = VerletIntegrator(ff, 0.5*femtosecond, hooks = [vsl, h5, xyz, tbc])
    # Run the Velocity Verlet integrator
    verlet.run(10000)

Before watching the `.xyz`-file using *VMD*, first think about what you expect to happen with a material under a pressure of 100 MPa. Now, visualize the output and watch the movie (you might want to rotate the view so that you're looking along the inorganic chain).

As you can see, the original, open structure of MIL-53(Al) is contracted towards a closed structure at these high pressures. MIL-53(Al) is a so-called flexible or stimulus-responsive MOF, which are characterized by their unique behavior of undergoing large-amplitude structural deformations upon external stimuli. It goes without saying that this unique behavior also opens a variety of possible applications in efficient gas storage, sensing, and shock absorption, among others.

<br/>

## Chapter 7: Restarting files

In some cases, you might want to use the last snapshot of a simulation to start a new one. In *Yaff*, this is possible using the `RestartWriter` hook and the `restart_h5` option in the `VerletIntegrator`.

First, let's start a simple $NPT$ simulation of MIL-53(Al) at 300 K and 1 MPa. Note that we instantiate a `RestartWriter` object, and provide it as a hook to the `VerletIntegrator`. This simulation may take again a few minutes to complete.

In [None]:
from molmod.units import *
from yaff import *
import numpy as np
import h5py
import os as os

if os.path.exists('traj_all.h5'):
    os.remove('traj_all.h5')
if os.path.exists('restart_all.h5'):
    os.remove('restart_all.h5')

sys0 = System.from_file('MIL53_init.chk')
system = sys0.supercell(1,2,1)
ff = ForceField.generate(system, 'MIL53_pars.txt', rcut=15*angstrom, alpha_scale=3.2, gcut_scale=1.5, smooth_ei=True)

temp = 300*kelvin
press = 1e6*pascal
timestep = 0.5*femtosecond

thermo = NHCThermostat(temp)
baro_thermo = NHCThermostat(temp)
baro = MTKBarostat(ff, temp, press)
tbc = TBCombination(thermo, baro)

vsl = VerletScreenLog(step=100)
with h5py.File('traj_all.h5', mode = 'w') as ftraj, h5py.File('restart_all.h5', mode = 'w') as frestart:
    hdf = HDF5Writer(ftraj, step=100)
    hdf_restart = RestartWriter(frestart, step=1000)
    md = VerletIntegrator(ff, timestep, hooks=[hdf, tbc, vsl, hdf_restart])
    md.run(6000)

Normally, we would use the file `restart_all.h5` to start a new simulation. However, as we want to compare the original file with the restarted file, we will first separate the first half of the simulation above, and store it in `restart_1.h5`. To do so, just execute the code below.

In [None]:
from molmod.units import *
from yaff import *
import numpy as np
import h5py
import os as os

with h5py.File('restart_all.h5') as f, h5py.File('restart_1.h5') as g:
    start, end, step = get_slice(f)
    loc = 2
    start = loc
    end = loc+1
    step = 1

    tgrp_f0 = f['restart']
    tgrp_f1 = f['system']
    tgrp_f2 = f['trajectory']

    tgrp_g0 = g.create_group('restart')
    tgrp_g1 = g.create_group('system')
    tgrp_g2 = g.create_group('trajectory')

    for key in tgrp_f0.keys():
        shape = tgrp_f0[key].shape
        data = tgrp_f0[key]
        tgrp_g0.create_dataset(key, shape, data=data)

    for key in tgrp_f1.keys():
        shape = tgrp_f1[key].shape
        data = tgrp_f1[key]
        tgrp_g1.create_dataset(key, shape, data = data)

    for key in tgrp_f2.attrs.keys():
        data = tgrp_f2.attrs.__getitem__(key)
        tgrp_g2.attrs.__setitem__(key, data)

    for key in tgrp_f2.keys():
        data = tgrp_f2[key][start:end:step]
        shape = data.shape
        tgrp_g2.create_dataset(key, shape, data = data)

Now, we will use this `restart_1.h5` file to start a new simulation. For this, we first create a `System` object based on the restart file, and generate the force field.

In [None]:
# Generate system from h5 file, then create force field object
with h5py.File('restart_1.h5', mode = 'r') as f_res:
    system = System.from_hdf5(f_res)
    ff = ForceField.generate(system, 'MIL53_pars.txt', rcut=15*angstrom, alpha_scale=3.2, gcut_scale=1.5, smooth_ei=True)

Second, we define the usual output hooks to generate an output HDF5 file and define the screen logger. Finally, we start a new simulation by calling the `VerletIntegrator`, using the `restart_h5` keyword to let *Yaff* know we want to continue our previous simulation. By doing so, *Yaff* will **automatically add** the necessary simulation details, such as the timestep, thermostats, and barostats. You can overwrite this by manually adding these hooks or keywords, but in that case, you will not exactly restart the simulation. You can check in the output below that *Yaff* automatically detects we want to simulate in the $NPT$ ensemble. It also starts counting from the last step in the previous file.

In [None]:
# Create output hooks
with h5py.File('traj_2.h5', mode = 'w') as f, h5py.File('restart_1.h5', mode = 'r') as f_res:
    hdf = HDF5Writer(f, step=100)
    vsl = VerletScreenLog(step=100)
    md = VerletIntegrator(ff, hooks=[hdf, vsl], restart_h5 = f_res)
    md.run(3000)

To validate the accuracy of the restart option, we will plot a few properties from the original simulation, and overlay the results of the restarted simulation. Here, we choose to plot the temperature, pressure, conserved energy and volume as a function of time.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def get_time(f, start, end, step):
    if 'trajectory/time' in f:
        label = 'Tijd [%s]' % log.time.notation
        time = f['trajectory/time'][start:end:step]/log.time.conversion
    else:
        label = 'Step'
        time = np.array(range(len(f['trajectory/epot'][:])), float)[start:end:step]
    return time, label


with h5py.File('traj_all.h5', mode = 'r') as f1, h5py.File('traj_2.h5', mode = 'r') as f2:
    start, end, step = get_slice(f1)

    time1, tlabel = get_time(f1, start, end, step)
    time2, tlabel = get_time(f2, start, end, step)
    time1 = np.array(time1)/1000
    time2 = np.array(time2)/1000

    temp1 = np.array(f1['trajectory/temp'][start:end:step])/kelvin
    temp2 = np.array(f2['trajectory/temp'][start:end:step])/kelvin

    press1 = np.array(f1['trajectory/press'][start:end:step])/(1e6*pascal)
    press2 = np.array(f2['trajectory/press'][start:end:step])/(1e6*pascal)

    econs1 = np.array(f1['trajectory/econs'][start:end:step])/kjmol
    econs2 = np.array(f2['trajectory/econs'][start:end:step])/kjmol

    vol1 = np.array(f1['trajectory/volume'][start:end:step])/(2*angstrom**3)
    vol2 = np.array(f2['trajectory/volume'][start:end:step])/(2*angstrom**3)
   
plt.clf()
plt.subplots_adjust(wspace = 0.8, hspace = 0.5)
plt.rc('xtick', labelsize=12) 
plt.rc('ytick', labelsize=12)

plt.subplot(221)
plt.plot(time1, temp1, color = '0.5', linewidth=2.5, label = 'original')
plt.plot(time2, temp2, color = 'k', linewidth=1, label='restart')
plt.xlim(time1[0], time1[-1])
plt.ylabel('Temperature [K]', fontsize=14)

plt.subplot(222)
plt.plot(time1, press1, color = '0.5', linewidth=2.5, label = 'original')
plt.plot(time2, press2, color = 'k', linewidth=1, label='restart')
plt.xlim(time1[0], time1[-1])
plt.ylabel('Pressure [MPa]', fontsize=14)
legend = plt.legend(loc=0, fontsize=14, frameon=False, bbox_to_anchor=(1.0, 0.83))

plt.subplot(223)
plt.plot(time1, econs1, color = '0.5', linewidth=2.5, label = 'original')
plt.plot(time2, econs2, color = 'k', linewidth=1, label='restart')
plt.xlim(time1[0], time1[-1])
plt.xlabel('Time [ps]', fontsize=14)
plt.ylabel('Econs [kJ/mol]', fontsize=14)

plt.subplot(224)
plt.plot(time1, vol1, color = '0.5', linewidth=2.5, label = 'original')
plt.plot(time2, vol2, color = 'k', linewidth=1, label='restart')
plt.xlim(time1[0], time1[-1])
plt.xlabel('Time [ps]', fontsize=14)
plt.ylabel('Volume [A**3]', fontsize=14)
legend = plt.legend(loc=0, fontsize=14, frameon=False, bbox_to_anchor=(1.0, 0.83))

plt.show()

If everything went right, the black restart curve should coincide with the gray original curve for the second half of the plots.

<br/>

Congratulations, you completed the notebook on basic MD in *Yaff*!