<a href="https://colab.research.google.com/github/lorenzopallante/BiomeccanicaMultiscala/blob/main/LAB/06-Gromacs/06-Gromacs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Laboratorio 6
**Introduction to GROMACS**


Authors:
    
- Prof. Marco A. Deriu (marco.deriu@polito.it)
- Lorenzo Pallante (lorenzo.pallante@polito.it)
- Eric A. Zizzi (eric.zizzi@polito.it)
- Marcello Miceli (marcello.miceli@polito.it)
- Marco Cannariato (marco.cannariato@polito.it)

# Table of Contents

3. Gromacs Setup 
4. Molecular Dynamics (MD) Flow chart
5. GROMACS 1 (make_ndx, editconf, pdb2gmx)

**Learning outcomes:** 
- how to install gromacs and lauch it
- General flow chart of a MD simulation
- GROMACS: create an index file, generate a simulation box, create gmx starting files

# Setup

If you don't have GROMACS and NGLView installed on your machine or if you are using Google COLAB, run the following line

In [None]:
#@title Installing GROMACS
!apt install gromacs &> /dev/null

In [None]:
#@title Installing NGLview
!pip install nglview  &> /dev/null
from google.colab import output
output.enable_custom_widget_manager()

Clone files form GitHub if you are using COLAB

In [None]:
# IF YOU ARE USING COLAB EXECUTE THIS CELL (to copy over data repository)
!git clone https://github.com/lorenzopallante/BiomeccanicaMultiscala.git
!mv BiomeccanicaMultiscala/LAB/06-Gromacs/* .

# Introduction

**GROMACS is a major free, open-source, and fast code developed for Molecular Dynamics (MD) simulations**. Its continuous updates (1 major release/year), speed, efficiency and flexibility, along with the inbuilt availability of force fields specific for proteins, make it one of the most popular choices for biomolecular simulations.
(http://www.gromacs.org/).

Complete **GROMACS tutorial** available at: http://www.mdtutorials.com/gmx/

To see what gromacs version is installed on your PC:

In [None]:
!gmx -version

All gromacs commands can be used with the syntax:

`
    gmx <command>
`

And all the commands have a help message that can be displayed using the option “-h”, e.g.:

`
    gmx pdb2gmx -h    
`

For basic and advanced GROMACS tutorial, please visit: http://www.mdtutorials.com/gmx/


## Files and suggested directory tree
You will need the following files to run a simulation:
>1. Atom coordinates: *.pdb (or *.gro)
>2. Topology file: *.top
>3. MD parameters: *.mdp

**1. PDB**
- coordinates of atoms in the x,y,z space
- see also the lesson *03-Intro_LinuxBash* for further info

![title](imgs/pdb.png)

**2. TOP**
- topology of the systems, i.e. bonds, angles, diehedrals, non-bonded interactios and relative parameters 
- generated by GROMACS once you select the forcefield (or made by yourself when you're a pro!)


<img src="imgs/top2.png" width="1200" align="center">

**3. MDP**
- simulation parameters, such as integration time step, simulation duration, cut-off, temperature and pressure couplings, etc..

![mdp](imgs/mdp.png)

*We suggest you create a new folder for each simulation you run, put all the necessary files into the simulation folder itself, and let GROMACS write into that same folder.* 

<div class="alert alert-block alert-warning"> 

Keep in mind that GROMACS is a **command-line program**: 
    
if you don’t specify the full paths to the required input files, but only the file name (example call: $ gmx make_ndx -s struct.tpr -o index.ndx ), GROMACS will expect the input file (struct.tpr) to be in the folder you’re calling it from, and will write the output file (index.ndx) into the same current folder. If struct.tpr is not in the current folder, GROMACS will throw an error and fail! GROMACS error messages are pretty explicit, so chances are READING THE ERROR is enough to troubleshoot most issues (e.g. if a required file is missing).

<div/>

## Molecule: Cell-Penetrating Peptide

The discovery of Cell-Penetrating Peptides (CPPs) represents an important breakthrough for the delivery of large cargo molecules or nanoparticles for several clinical applications. A main feature of CPPs is the ability to
penetrate the cell membrane at low micromolar concentrations in vivo and in vitro, without binding any chiralreceptors and without causing significant membrane damage. This ability offers significant therapeutic potential,
as targeting areas normally difficult to access for drugs.
TAT peptide and Drosophila Antennapedia homeodomain-derived penetratin peptide (pAntp) are the most
extensively studied CPPs. In particular, the pAntp is a 16-residues long cationic peptide derived from the third
helix of the homeodomain of the Drosohila transcription factor Antennapedia. This amphipatic CPP, largely
unstructured in solution, is positively charged at neutral pH (since it contains four lysine and three arginine
residues). Interestingly, the pAntp is able to cross biological membranes and enter a hydrophobic environment
upon interaction with negatively-charged molecules, like phosphatic acid (PA) or phosphatidylserine (PS).
However, mechanisms by which pAntp comes into the cells have not been completely understood. Proposed
mechanisms of pAntp cellular uptake hypothesize direct crossing of the peptide through the membrane at low
peptide concentrations (1μM) and an endocytotic pathway at high concentrations (10μM). Several studies have
suggested that pAntp amphiphilicity may not be enough to drive the membrane penetration, indicating instead
tryptophan as key player. Replacement of thryptophan by phenylanine resulted in a loss of penetration activity
when interacting with membranes and bicelles. Moreover, a recent computational work has characterized the
binding mode of pAntp-DPPC bilayers, proposing arginine, lysine and tryptophan as driving the penetration
mechanism.
This extraordinary ability of CPPs to penetrate cell membranes has brought to designate them as perfect
functionalization molecules for drug delivery systems.
For example, pAntp might be employed to decorate Magnetic Nanoparticles (MNPs), thus combining their
fascinating physico-chemical properties with a cell-penetrating ability to design novel effective therapeutic
strategies as well as innovative biotechnology methodologies. Size, biocompatibility and excellent magnetic
properties, have made MAG and Silica-coated MAG the object of a remarkable amount of research in the last
decade and numerous biomedical applications have been reported. Recently, MNPs combined with magnetic
fields were used to enable cell positioning under non-permissive conditions, local gene therapy and/or
optimization of MNP- assisted lentiviral gene transfer.
Functionalization strategies comprise grafting with organic molecules, including small organic biomolecules such
as CPPs, and/or coating with an inorganic layer (e.g., silica).
Design and properties prediction of functionalization strategies may be addressed by computational molecular
modelling. In this context, Kubiak-Ossowska and coworkers have recently employed computational modelling
to investigate the adsorption of TAT peptides onto three silica surface models. This work has suggested that TAT-
Silica adsorption mechanism is driven by electrostatic and hydrophobic interactions mainly involving arginine
residues and the nanoparticle surface.

## Molecular Dynamics (MD) Simulations

Briefly, the general workflow of an MD simulation with GROMACS is
divided in the following steps (see also “MD_FlowChart.pdf”):

1. System Preparation
> - Retrieve starting structure (e.g. RCSB)
> - GROMACS structure conversion (from pdb to gro*)
>- Topology (and position restraints) generation
>- System box generation and addition of water and ions

2. Energy Minimization
3. Simulation
> - (Equilibration: Molecular Dynamics Simulation normally with position restraints)
> - Production: Molecular Dynamics Simulation

4. Analysis of the Simulation

*note that both .gro and .pdb files both contain atomic coordinates and
only differ for minor aspects:
> - .gro coordinates are in nm, whereas *.pdb coordinates in Å
> - .gro file can contain also atoms’ velocities

GROMACS can work with both types of files and does not strictly need the .gro file format for most operations.
It can be however handy to have a .gro file, for example, to quickly extract the box vector.

<img src="imgs/MD_FlowChart.png" width="500" align="center">

# STEP 1 – PDB conversion and system preparation

The protein of interest for the present tutorial is called penetratin.pdb (already present in the data/ folder)

Using a visualization software (for example VMD, PyMol, UCSF Chimera, Schrödinger Maestro, MOE, ...) you can view and rotate the structure. Here we will use **NGL View** for semplicity.

In [None]:
import nglview as nv
from IPython.display import IFrame

with open("data/penetratin.pdb") as f:
    view = nv.show_file(f, ext="pdb")
view

Let’s now have a look at the actual pdb file, which is nothing more than a text file!

In [None]:
!cat data/penetratin.pdb

You will see that the pdb is essentially a space-separated text file organized into columns. 
Briefly:
- The first column defines the row type (e.g. REMARK tells you that this row is a comment, ATOM tells you that this row contains actual atomic coordinates, etc.).
- Columns 7, 8 and 9 contain the x,y,z coordinates of the atoms of the system.

For more information: https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html.

## Make Index File (gmx make_ndx)


We now introduce a very powerful tool in GROMACS: index files. Index files are created starting from an atomic system (e.g. a pdb file), and split atoms into specific groups. These can be useful to restrict operations and analyses only to specific parts of the system, for example, if you want to see fluctuations of the alpha-Carbons only, you can select that specific group in the analysis tool.

To have detailed information about the command to make index files, just type:

In [None]:
!gmx make_ndx -h

Let’s first create an index file (it will not be created by default when you launch the simulation!):

In [None]:
!echo -e "q \n" | gmx make_ndx -f data/penetratin.pdb -o index.ndx

You will see that make_ndx will read the pdb file (option “-f”), create some default groups and then wait for your input. 

You can manually add groups by typing the atoms manually (we will see the selection syntax later on...), or simply accept the default groups and save them by typing q and hitting Enter. The final index file (index.ndx) will be saved with the name chosen in the option “-o”.

## Edit molecular configuration (gmx editconf)

As mentioned, the previous groups obtained with gmx make_ndx can be used to tell GROMACS to do operations on a specific subset of atoms only. Let’s say for example I want a pdb file containing only the alpha Carbons of my starting penetratin.pdb structure. I can achieve this by calling editconf (which, as the name suggests, is an editor for molecular configurations) and also including the index.ndx file as input:

In [None]:
!echo "C-alpha" | gmx editconf -f data/penetratin.pdb -o calpha.pdb -n index.ndx

In simpler words, this command is saying:

***“Call the editconf utility, read the penetratin.pdb and the index.ndx files, and write the result to calpha.pdb”***

Have a look to the only c-alpha structure: 

In [None]:
with open("calpha.pdb") as f:
    view = nv.show_file(f, ext="pdb")

view.add_representation("hyperball")    
view

Indeed, editconf will prompt for a group. Choose C-alpha by typing 3 and hitting Enter. Editconf will confirm the selection and write atoms belonging to the C-alpha group to the file calpha.pdb (check it by opening the file with a text editor!). Easy, right?

## Prepare file for Gromacs (gmx pdb2gmx)

We have to generate a **topology** for the system and convert it to the **.gro file format**. 

The tool for conversion is **pdb2gmx**. As always, you can use the -h flag to print some help:

In [None]:
!gmx pdb2gmx -h

The main options of the **pdb2gmx** tool are:
> - f: input file (coordinates, so .pdb or .gro)
> - p: topology output file (a text file, ending in .top)
> - o: structure output file (coordinates after processing, again .pdb or .gro)
> - i: position restraints output file (.itp)

In [None]:
!gmx pdb2gmx -f data/penetratin.pdb -i penetratin_posre.itp -p penetratin.top -o penetratin.gro -ff amber99sb-ildn  -ignh -heavyh -water tip3p

Let's have a look at all the options of the program: 

- **ff** : forcefield -> The force field will contain the information that will be written to the topology. This is a very important choice! You should always read thoroughly about each force field and decide which is most applicable to your situation.

- **water**: water model chosen among TIP3P, TIP4P, TIP4P-Ew, TIP5P, SPC, SPC/E

- **ignh**: it stands for “ignore hydrogens”, so it will ignore all the hydrogens in the input coordinate file

- **heavyh**: makes hydrogen atoms heavy to reduce oscillation frequency

Have a look at the generated files:

In [None]:
!ls -l

Have a look at the *.gro file with a text editor to see the differences with the pdb file!

## Define the simulation box

Now, we defined the parameters and water model to be used in the simulation. The next step in the preparation of the system is to actually put the system in a simulation box with the appropriate shape and dimension.

There are several possible shapes of the unit cell, such as the cubic or the dodecahedron. The dodecahedron is sometimes chosen to simulate globular systems, since its volume is ~71% of the cubic box with the same distance between periodic distance.

The volume of the box must be optimized in real-life problems because it allows to lower the computational cost of the simulation. Indeed, as we will see in the following steps, the box will be filled with water molecules and ions, thus increasing exponentially the number of atoms in the system.

<div class="alert alert-block alert-info"><b>Which terms of the potential energy definition are linked to the exponential growth of the computational cost with the number of atoms in the system?</b></div>

In this tutorial, we will put our protein in the center of a cubic box big enough to have a minimum distance between the protein and the box wall of 0.8 nm.

<center> <div class="alert alert-block alert-info"><b>Is this distance enough to respect the minimum image convention?</b></div>

The GROMACS command that allows us to do this operation is:
```bash
$ gmx editconf -f penetratin.gro -o box.gro -c -d 0.8 -bt cubic 
```
the above command take the atoms in the file `penetratin.gro`, places the origin of the reference system at their center (`-c`), then builds around them a cubic box (`-bt cubic`) with the constraint that the minimum distance between the atoms of the system and the box wall is 0.8 nm (`-d 0.8`).

The `editconf` command allows to manipulate the molecular systems is several different ways and it has a lot of different options, some of them are:

|Field|Type|Description|
|-----|---------|-----------|
|-f|.gro, .pdb| Input structure file |
|-n|.ndx| Input index file when you want to consider only a subset of the atoms |
|-o|.gro, .pdb| Output structure file |
|-bt| \<enum\> | Box type for -box and -d: triclinic, cubic, dodecahedron, octahedron |
|-box| \<vector\> | Box vector lengths. E.g. -box 10 5 20 |
|-angles| \<vector\> | Angles between box vectors. E.g. -angles 90 30 60 |
|-d| \<real\> | Distance between the solute and the box |
|-c| None | Center molecule in box (implied by -box and -d)|
|-align|\<vector\>| Align to target vector. E.g. -align 1 0 0. NB: you have to tell GROMACS what to align! |
|-translate|\<vector\>| Translation of the provided vector |
|-rotate|\<vector\>| Rotation around the X, Y and Z axes in degrees |
|-princ| None |  Orient molecule(s) along their principal axes. NB: you have to tell GROMACS what to orient and use to define the principal axis! |

In [None]:
!gmx editconf -f penetratin.gro -o box.gro -c -d 0.8 -bt cubic

If we now check the last line of the new file we can find the box dimensions:

In [None]:
!tail -n 1 box.gro

We can also check our new system using VMD. Onces you have opened `box.gro` in VMD show the box using the terminal or the Tk console

```bash
vmd > pbc box
```

## Solvation of the system

Now that the box has been set, we have to introduce the **environment**. We will model the environment as a simple **aqueous system**, but it is possible to use also **different solvents**, provided that good parameters are provided for the species involved. Here, we use an explicit model of water.

In the previous laboratory, we have decided to use the TIP3P water model to define the topology of the water molecule. Let's look for a moment at how such water model is defined in the built-in gromacs topologies:

In [None]:
%%bash
dd=$(gmx --version | grep "Data prefix:" | cut -d ":" -f 2)
cat $dd/share/gromacs/top/amber99sb-ildn.ff/tip3p.itp | head -n 10

- the default name, under the **[ moleculetype ]** section is **SOL**. Keep that in mind for a moment;
- the water molecule is made of three atoms, one oxygen and two hydrogens. You can observe that the definition of such atoms under the panel **[ atoms ]**. Notice that the two hydrogen are of the same _atom type_, which means that they are modelled with the same parameters. However, two different _atom names_ are present since they represent two atoms of the same "residue";
- the water "residue" is named **SOL**

In this topology you can also find the _bond_ and _angle_ paramters:

In [None]:
%%bash
dd=$(gmx --version | grep "Data prefix:" | cut -d ":" -f 2)
cat $dd/share/gromacs/top/amber99sb-ildn.ff/tip3p.itp | tail -n 12 | head -n -1

Here we can read the parameters to model the bonded interaction between the atoms of the **TIP3P water molecule**.
Notice that the two _bond_ parameters are identical since they represent the same bond between hydrogen and oxygen.


After this initial consideration, we can solvate the system, i.e. fill the box with water. The GROMACS command that allows us to do this operation is:
```bash
$ gmx solvate -cp box.gro -o solvated.gro -p topol.top [-cs spc216.gro] 
```
The above command take the system contained in the file `box.gro`, fills the unit cell with water molecules avoiding clasches between the inserted molecules and the atoms already present in the system, writes the output file in `solvated.gro`, then counts how many water molecules have been placed and updates the provided topology file (`-p topol.top`). The argument `-cs spc216.gro` in between brackets since it is optional: if nothing is told to GROMACS, it will use this simple equilibrated 3-point solvent model which can be used fro SPC and TIP3P water models. The file `spc216.gro` is present in the built-in gromacs libraries.

With the `solvate` command, you can also insert only a shell of water molecules around the protein specifying `-shell <radius>`.

In [None]:
!gmx solvate -cp box.gro -o solvated.gro -p penetratin.top

Let's look at the output file and the updated topology:

In [None]:
!tail -n 20 solvated.gro

In [None]:
!tail penetratin.top -n 5

<div class="alert alert-block alert-warning"><b>Important observation</b><br>As you have seen, GROMACS have updated the topology file to take count of the modification that we have done on the system. The old topology file is not lost, it is saved as #topol.top.1#.</div>
<br>
<div class="alert alert-block alert-danger"><b>Important observation</b><br> GROMACS is a fantastic computer program, but still a computer program. This means that it does what you tell him. If you repeat the command above two times, GROMACS will take the empty box, fill it and update the topol.top file two times. This will result in a topology file containing double the water that is actually present in the solvated.gro file. This will generate a <b>Fatal Error</b> in the following steps.</div>

### Influence of the box shape and dimension on system size

We have told you before that the dodecahedric box allows to reduce the volume of the system and reduce the computational cost of the simulation. Let's now look in practice at the amount of water molecules that would have been added using a dodecahedric box.

Since we do not wat to mess with the filed produced until now, let's move everything in a folder.

In [None]:
%%bash
ls

In [None]:
%%bash
# backup everything in a folder 
mkdir -p cubic_box
mv box.gro solvated.gro penetratin.top \#penetratin.top.1# cubic_box/

# restore the original topology (only protein and not water inside) -> remove last line of the topology (SOL)
cat cubic_box/penetratin.top | head -n -1 >penetratin.top

# now, repeat everything with different box
gmx editconf -f penetratin.gro -o dodecahedron.gro -c -d 0.8 -bt dodecahedron > /dev/null 2>&1
gmx solvate -cp dodecahedron.gro -o dodecahedron_solvated.gro -p penetratin.top > /dev/null 2>&1

# Let's look at the new topology
tail penetratin.top -n 5