# Massively parallel Transition Path Sampling

## Notebook 4: Rerun TPS simulation with changed parameters/Recover crashed simulations

This is the fourth of a series of example notebooks on massively parallel transition path sampling. Here you will learn how you can rerun a TPS simulation from the folder structure and files on disk (possibly changing the reaction coordinate model architecture and/or `descriptor_transform`). Note, that the same setup/logic can also be used to recover and continue a simulation that has incomplete Monte Carlo steps, either due to the machine it has been running on crashing or you terminating the simulation during runtime. In both cases we will use the `reinitialize_from_workdir` method, which will take care of adding all exisiting Monte Carlo steps to the new storage and then finish all partially finished trials. This will result in a brain object that has the same internal state as if it would have ran the simulation up to the current step and that can be used to continue the simulation.

**This notebook should be run on a multi-core workstation preferably with a GPU**, otherwise you will have a very long coffee break and a very hot laptop.

**Required knowledge/recommended reading:** This notebooks assumes some familarity with the `asyncmd` (namely the [gromacs] engine and TrajectoryFunctionWrapper classes). Please see the example notebooks in `asyncmd` for an introduction.

## Imports and set working directory

In [1]:
%matplotlib inline

In [2]:
import os
import asyncio
import numpy as np
import matplotlib.pyplot as plt
import MDAnalysis as mda
# asyncmd for the engine and Trajectory class
import asyncmd
import asyncmd.gromacs as asyncgmx
from asyncmd import Trajectory
# aimmd for the TPS
import aimmd
import aimmd.distributed as aimmdd
# and pytorch for the reaction coordinate model
import torch.nn.functional as F
import torch

  from .autonotebook import tqdm as notebook_tqdm
Could not initialize SLURM cluster handling. If you are sure SLURM (sinfo/sacct/etc) is available try calling `asyncmd.config.set_slurm_settings()` with the appropriate arguments.
Tensorflow/Keras not available


In [3]:
# setup working directory
scratch_dir = "."

workdir = os.path.join(scratch_dir, "TransitionPathSampling_ala")

## Reinitialize the TPS simulation from state on disk

To reinitialize the TPS we need to create a fresh brain object using the usual ingridients, just that this time we must take care for many things to set them to the same values as in the initial simulation we are reinitializing:
 - We must set the number of markov chain Markov chain samplers to the number we used previously
 - We must create a new storage file to save our models, trainset and other simulation results
 - We must use the same metastable state definition as previously (otherwise we will break the Markov chain by changing the length of the transitions and potentially changing the state assignment of endstates)
 - We must define the underlying dynamics the be the same as in the previous simulation, i.e. we can only change engine options that do not change the propagator $p(x_{t + \Delta t} | x_{t}, \Delta t)$ (where $x_{t}$ is a phase space point on a trajectory at time $t$). That means we can not change the forcefield, temperature and pressure coupling, and many more, but we should be able to change engine options like the number of threads and many more to e.g. optimize efficiency on changed hardware.
 - There is no need to load the old initial transitions, except if you (like we do here) want to get the dimensionality of your `descriptor_transform` by applying it to them.
 - You can change both the `descriptor_transform` and the reaction coordinate model architecture. Note however, that the Markov chain acceptances in the chain will be unchanged (i.e. calculated by the old model) up to the last previously finished step.
 - You must define the same sampling scheme, i.e. use the same number of movers and corresponding probabilities.
 - You should create a new trainset into which we will add the simulation results (shooting outcomes) including the ones found on disk.
 - You can define different `Task`s to run after specified number of trials.

### Number of Markov chains

In [4]:
n_samplers = 5  # results in 2*n_samplers gmx engines

### Create storage file

In [5]:
storage = aimmd.Storage(os.path.join(workdir, "new_storage.h5"))

### State definition

We must use the same `alpha_R` and `C7_eq` state definitions as before!

In [6]:
# state functions
from state_funcs_mda import alpha_R, C7_eq

wrapped_alphaR = asyncmd.trajectory.PyTrajectoryFunctionWrapper(alpha_R)
wrapped_C7_eq = asyncmd.trajectory.PyTrajectoryFunctionWrapper(C7_eq)

### Underlying dynamics

Again, make sure you are not changing the propagator properties, we will just use the same options as in the notebooks before.

In [7]:
# Define the engine(s) for the PathMovers
# (they will all be the same)
gro = "gmx_infiles/conf.gro"
top = "gmx_infiles/topol_amber99sbildn.top"
ndx = "gmx_infiles/index.ndx"
mdp = asyncgmx.MDP("gmx_infiles/md.mdp")

gmx_engine_kwargs = {"mdconfig": mdp,
                     "gro_file": gro,
                     "top_file": top,
                     "ndx_file": ndx,
                     "output_traj_type": "XTC",
                     #"mdrun_extra_args": "-nt 2",
                     # use this for gmx sans (thread) MPI
                     "mdrun_extra_args": "-ntomp 2",
                     }
gmx_engine_cls = asyncgmx.GmxEngine

### Define reaction coordinate model and `descriptor_transform`

Here we will use `descriptor_func_psi_phi` now instead of the full internal coordinate representation this function only returns the $\psi$ and $\phi$ dihedral angles (which are a decent representation but not fully informative, so you might see a drop in prediction quality if you continue to run the TPS simulation).

In [8]:
# import descriptor_transform for the model
# descriptor_func_ic gives us an internal coordinate representation (i.e. bond lengths, angles and dihedrals)
# descriptor_func_psi_phi gives us the ψ and φ dihedral angles (we use it to project to a 2d space in which we can look at the TPE)
from state_funcs_mda import descriptor_func_ic, descriptor_func_psi_phi

# and as usual wrapp them to become awaitable
wrapped_transform = asyncmd.trajectory.PyTrajectoryFunctionWrapper(descriptor_func_ic, call_kwargs={"molecule_selection": "protein"})
wrapped_psi_phi = asyncmd.trajectory.PyTrajectoryFunctionWrapper(descriptor_func_psi_phi)

In [9]:
# model architecture definition
# we use a pyramidal ResNet as described in "Machine-guided path sampling to discover mechanisms of molecular self-organization" (Nat.Comput.Sci 2023)
# Note that now this is not pyramidal anymore as we only have 2 inputs, it is just n_lay_pyramid of resunits stacked ontop of each other

n_lay_pyramid = 5  # number of resunits
n_unit_top = 2  # number of units in the last layer before the log_predictor
n_unit_base = cv_ndim = 2 # descriptors_for_tp.shape[1]  # input dimension
# the factor by which we reduce the number of units per layer (the width) and the dropout fraction
fact = (n_unit_top / n_unit_base)**(1./(n_lay_pyramid))

# create a list of modules to build our pytorch reaction coodrinate model from
modules = []

for i in range(1, n_lay_pyramid + 1):
    print(f"ResUnit {i} is {max(n_unit_top, int(n_unit_base * fact**(i)))} units wide.")
    modules += [aimmd.pytorch.networks.ResNet(n_units=max(n_unit_top, int(n_unit_base * fact**i)),
                                              n_blocks=1)
                ]

torch_model = aimmd.pytorch.networks.ModuleStack(n_out=1,  # using a single output we will predict only p_B and use a binomial loss
                                                           # we could have also used n_out=n_states to use a multinomial loss and predict all states,
                                                           # but this is probably only worthwhile if n_states > 2 as it would increase the number of free parameters in the NN
                                                 modules=modules,  # modules is a list of initialized torch.nn.Modules from arcd.pytorch.networks
                                                 )

# move model to GPU if CUDA is available
if torch.cuda.is_available():
    torch_model = torch_model.to('cuda')

# choose and initialize an optimizer to train the model
optimizer = torch.optim.Adam(torch_model.parameters(), lr=1e-3)

ResUnit 1 is 2 units wide.
ResUnit 2 is 2 units wide.
ResUnit 3 is 2 units wide.
ResUnit 4 is 2 units wide.
ResUnit 5 is 2 units wide.


In [10]:
# wrapp the pytorch neural network model in a RCModel class
model = aimmd.pytorch.EEScalePytorchRCModelAsync(nnet=torch_model,
                                                 optimizer=optimizer,
                                                 states=[wrapped_C7_eq, wrapped_alphaR],
                                                 ee_params={'lr_0': 1e-3,  
                                                            'lr_min': 5e-5,  # lr_min = lr_0 / 20 is a good choice empirically
                                                            'epochs_per_train': 3,
                                                            'window': 100,
                                                            'batch_size': 8192,
                                                           },
                                                 descriptor_transform=wrapped_psi_phi,
                                                 cache_file=storage,
                                                 )

### Define the sampling scheme

We use the same sampling scheme as in the previous notebooks.

In [11]:
spselector = aimmdd.spselectors.RCModelSPSelectorFromTraj()

In [12]:
movers_cls = [aimmdd.pathmovers.TwoWayShootingPathMover]
movers_kwargs = [{'states': [wrapped_alphaR, wrapped_C7_eq],
                  'engine_cls': gmx_engine_cls,
                  'engine_kwargs': gmx_engine_kwargs,
                  # NOTE: we could chnage the walltime per part, this could e.g. optimize queueing times  
                  #'walltime_per_part': 0.000015625,  # 0.055125 s per part
                  'walltime_per_part': 0.00003125,  # 0.1125 s per part
                  #'walltime_per_part': 0.0000625,  # 0.225 s per part
                  #'walltime_per_part': 0.000125,  # 0.45 s per part
                  #'walltime_per_part': 0.001,  # 3.6 s per part
                  #'walltime_per_part': 0.004,  # 14.4 s per part
                  'T': mdp["ref-t"][0],
                  "sp_selector": spselector,  # use the spselctor we have defined above 
                  "max_steps": 500 * 10**5,  # 500 steps * dt (2 fs) = 1 ps
                  }
                 ]

### Trainset

We intiialize an empty trainset, but we could also use one that already contains shooting results. Just be careful with the results from the simulation we are reinitializing, otherwise they will be in there twice (at least if we use `reinitialize_from_workdir` with `run_tasks=True`).

In [13]:
trainset = aimmd.TrainSet(n_states=2)

### Brain tasks

We will use the same Tasks as in the notebooks before, but you could e.g. change the run intervals or saving intervals of certain tasks.

In [14]:
tasks = [
    aimmdd.pathsampling.TrainingTask(model=model, trainset=trainset),
    aimmdd.pathsampling.SaveTask(storage=storage, model=model, trainset=trainset),
    aimmdd.pathsampling.DensityCollectionTask(model=model,
                                              first_collection=100,
                                              recreate_interval=250,
                                              ),
         ]

In [15]:
# and initialize the brain as before
brain = aimmdd.Brain.samplers_from_moverlist(model=model, workdir=workdir, storage=storage,
                                             n_sampler=n_samplers,
                                             movers_cls=movers_cls, movers_kwargs=movers_kwargs,
                                             samplers_use_same_stepcollection=False,
                                             tasks=tasks)

## Reinitialize the brain from workdir

The coroutine `reinitialize_from_workdir` adds all previously finished trials to the new storage and brain. If we call it with `run_tasks=True` (the default) it will also run all of its attached tasks for those trials in the order they finished. This will add the trials to the trainingset and train the model as if it would have steered this simulation itself (except that it did not select the SPs, but it will still predict for them before observing the result and be trained according to its prediction quality). After adding all finished trials the coroutine will check for any unfinished trials and finish them. After that you can continue the simulation with the brain object and potentialy a new reaction coordinate model.

In [16]:
await brain.reinitialize_from_workdir(run_tasks=True)

After adding all finished steps we have a total of 10018 steps. Note that potential unfinished steps will only be finished when calling `Brain.run_for_n_steps()` or Brain.run_for_n_accepts()`.


In [17]:
brain.total_steps

10018

## Continue the TPS simulation

We can now continue the TPS simulation as usual.

In [18]:
import time

In [19]:
n_steps = 100
start = time.time()

await brain.run_for_n_steps(n_steps)

end = time.time()
print(f"Running for {n_steps} cummulative MCSteps took {end-start} s (= {(end-start)/60} min).")

Running for 100 cummulative MCSteps took 107.79982328414917 s (= 1.796663721402486 min).


In [20]:
brain.total_steps

10118

## Save the last model, trainset and brain to storage
As usual, save the last model, trainset and brain. Then close the storage.

In [21]:
storage.rcmodels["model_to_continue_with"] = model
storage.save_trainset(trainset)
storage.save_brain(brain)

In [22]:
storage.close()