ZebraFish Haemoglobin protein and Phenyl Hydrazine Complex
======================


## ``Gromacs_py`` simulation

Here is an example of a short simulation of the Zebrafish Haemoglobin (`zeb_hb`) protein in complex with phenyl hydrazine (`phz`) ligand using AMBER force-field model.

Seven successive steps are used:

1. Load the protein-ligand complex in its best -docked state. Docking performed externally using Autodock Vina through `AMDock` GUI.
   
2. In-complex creation of Protein Topology using ``GmxSys.add_top()``.
   
3. Ligand topology creation using `prepare_top()` which uses `acpype` to build ligand topology.
   
4. Solvation of the complex using ``GmxSys.solvate_add_ions()``.

5. Minimisation of the structure using ``GmxSys.em_2_steps()``.

6. Equilibration of the system using ``GmxSys.em_equi_three_step_iter_error()``.

7. Production run using ``GmxSys.production()``.

### Import

In [1]:
import sys
import os
import shutil

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## To use `gromacs_py` in a project

In [2]:
from gromacs_py import gmx

## Simulation setup

- Define a few variables for you simulation, like:
  
    1. simulation output folders
    2. ionic concentration
    3. number of minimisation steps
    4. equilibration and production time

### Regarding equilibriation time:
The following variables define the sim times (relative units) for each stage of the three-stage equilibriation process. Check notes below for details:

1. `HA_time`
2. `CA_time`
3. `CA_LOW_time` 


In [3]:
DATA_OUT = 'zeb_hb_phz_complex_sim'

# System Setup
vsite='none'
sys_top_folder = os.path.join(DATA_OUT, 'sys_top')
#ignore_hydrogen = {'ignh': None}

# Energy Minimisation
em_folder = os.path.join(DATA_OUT, 'em')
em_sys_folder = os.path.join(DATA_OUT, 'sys_em')
em_step_number = 10000
emtol = 10.0  	# Stop minimization when the maximum force < 10 J/mol
emstep  = 0.01      # Energy step size


# Equillibration
equi_folder = os.path.join(DATA_OUT, 'sys_equi')
HA_time = 0.5
CA_time = 1.0
CA_LOW_time = 4.0

dt_HA = 0.001
dt = 0.002

HA_step = 1000 * HA_time / dt_HA
CA_step = 1000 * CA_time / dt
CA_LOW_step = 1000 * CA_LOW_time / dt

# Production
os.makedirs(DATA_OUT, exist_ok = True)
prod_folder = os.path.join(DATA_OUT, 'sys_prod')
prod_time = 100.0

prod_step = 1000 * prod_time / dt

## Create the `GmxSys` object

Load protein information only from docked PDB file on disk

In [4]:
pdb_file = "zeb_hb_phz_docking/input/ZEB_HB_REfined_h.pdb"
sys_name = "zeb_hb_phz_complex"
complex_sys = gmx.GmxSys(name=sys_name, coor_file=pdb_file)


## Create topology:

PhenyleHydrazine SMILE : "C1=CC=C(C=C1)NN"

**Note:** Hydrogen atoms need to be ignored, or else this won't work with this particular pdb

Topology creation involves:
- topology creation using `pdb2gmx` via the `prepare_top()` function.
  * Need the docked complex information for this as there is no hydrogen information in the original PDB.
- box creation using `editconf`
- Supposed to use AMBER force-field so that `acpype` can be used to prepare ligand topologies automatically.
- - Get the SMILES code from the ligand PDB @ [Openbabel online](https://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/index.html)

**TODO:** Not running `acpype` to get ligand topology. Dunno why.


In [5]:
complex_sys.prepare_top(out_folder=DATA_OUT, ff='amber99sb-ildn')
complex_sys.create_box(dist=1.0, box_type="dodecahedron", check_file_out=True)
complex_sys.solvate_add_ions(out_folder=DATA_OUT, name=sys_name,create_box_flag=False)

pdb2pqr30 --ff AMBER --ffout AMBER --keep-chain --titration-state-method=propka --with-ph=7.00 tmp_pdb2pqr.pdb 00_zeb_hb_phz_complex.pqr
gmx pdb2gmx -f 01_zeb_hb_phz_complex_good_his.pdb -o zeb_hb_phz_complex_pdb2gmx.pdb -p zeb_hb_phz_complex_pdb2gmx.top -i zeb_hb_phz_complex_posre.itp -water tip3p -ff amber99sb-ildn -ignh -vsite none
gmx editconf -f zeb_hb_phz_complex_sim/zeb_hb_phz_complex_pdb2gmx.pdb -o zeb_hb_phz_complex_sim/zeb_hb_phz_complex_pdb2gmx_box.pdb -bt dodecahedron -d 1.0
gmx grompp -f ../../../../../../usr/local/miniforge3/envs/mdanalysis/lib/python3.12/site-packages/gromacs_py/gmx/template/mini.mdp -c zeb_hb_phz_complex_water.pdb -r zeb_hb_phz_complex_water.pdb -p zeb_hb_phz_complex_water_ion.top -po out_mini.mdp -o genion_zeb_hb_phz_complex_water_ion.tpr -maxwarn 1
gmx genion -s genion_zeb_hb_phz_complex_water_ion.tpr -p zeb_hb_phz_complex_water_ion.top -o zeb_hb_phz_complex_water_ion.gro -np 17 -pname NA -nn 18 -nname CL


In [6]:
lig_name = 'phz'
lig_resname = lig_name.upper()

gmx.gmxsys.ambertools.smile_to_pdb("C1=CC=C(C=C1)NN",os.path.join(DATA_OUT,f"{lig_name}.pdb"),lig_resname)
lig_sys = gmx.GmxSys(name=lig_name, coor_file=os.path.join(DATA_OUT,f"{lig_name}.pdb"))
#acpype seems to create topology correctly, but this stupid gromacs_py does not recognize it
try:
    lig_sys.prepare_top(out_folder=os.path.join(DATA_OUT,lig_name),ff='amber99sb-ildn', include_mol={lig_resname: 'C1=CC=C(C=C1)NN'})
except:
    pass

No amino acids present in pdb file, no PD2PQR calculation



acpype -i PHZ_h_unique.pdb -b PHZ -c bcc -a gaff -o gmx -n 0
gmx pdb2gmx -f 01_phz_good_his.pdb -o phz_pdb2gmx.pdb -p phz_pdb2gmx.top -i phz_posre.itp -water tip3p -ff amber99sb-ildn -ignh -vsite none
The following command could not be executed correctly :
gmx pdb2gmx -f 01_phz_good_his.pdb -o phz_pdb2gmx.pdb -p phz_pdb2gmx.top -i phz_posre.itp -water tip3p -ff amber99sb-ildn -ignh -vsite none


In [7]:
#Since gromacs_py does not recognize the topology file, we need to set it manually
lig_sys.coor_file = os.path.join("../../",DATA_OUT,lig_name,lig_resname+".acpype",f"{lig_resname}_h_unique.pdb")
lig_sys.top_file = os.path.join("../../",DATA_OUT,lig_name,lig_resname+".acpype",f"{lig_resname}_GMX.top")

In [8]:
complex_sys.insert_mol_sys(lig_sys, 1, sys_name, DATA_OUT)

gmx trjconv -f ../PHZ.acpype/PHZ_h_unique.pdb -o ../PHZ.acpype/PHZ_h_unique_compact.pdb -s ../PHZ.acpype/PHZ_h_unique.pdb -ur compact -pbc none
gmx trjconv -f ../../zeb_hb_phz_complex_water_ion.gro -o ../../zeb_hb_phz_complex_water_ion_compact.pdb -s ../../genion_zeb_hb_phz_complex_water_ion.tpr -ur compact -pbc mol
The following command could not be executed correctly :
gmx trjconv -f ../../zeb_hb_phz_complex_water_ion.gro -o ../../zeb_hb_phz_complex_water_ion_compact.pdb -s ../../genion_zeb_hb_phz_complex_water_ion.tpr -ur compact -pbc mol


RuntimeError: Following Command Fails : /usr/local/miniforge3/envs/mdanalysis/bin/gmx trjconv -f ../../zeb_hb_phz_complex_water_ion.gro -o ../../zeb_hb_phz_complex_water_ion_compact.pdb -s ../../genion_zeb_hb_phz_complex_water_ion.tpr -ur compact -pbc mol 
 Ret code = 1 
 Note that major changes are planned in future for trjconv, to improve usability and utility.
Select group for output
Selected 0: 'System'
 
                :-) GROMACS - gmx trjconv, 2024.2-conda_forge (-:

Executable:   /usr/local/miniforge3/envs/mdanalysis/bin.AVX2_256/gmx
Data prefix:  /usr/local/miniforge3/envs/mdanalysis
Working dir:  /home/daneel/gitrepos/gromacs_sims/zeb_hb/zeb_hb_phz_complex_sim/phz/zeb_hb_phz_complex_sim
Command line:
  gmx trjconv -f ../../zeb_hb_phz_complex_water_ion.gro -o ../../zeb_hb_phz_complex_water_ion_compact.pdb -s ../../genion_zeb_hb_phz_complex_water_ion.tpr -ur compact -pbc mol

Will write pdb: Protein data bank file
Reading file ../../genion_zeb_hb_phz_complex_water_ion.tpr, VERSION 2024.2-conda_forge (single precision)
Reading file ../../genion_zeb_hb_phz_complex_water_ion.tpr, VERSION 2024.2-conda_forge (single precision)
Group     0 (         System) has 21639 elements
Group     1 (        Protein) has  2220 elements
Group     2 (      Protein-H) has  1095 elements
Group     3 (        C-alpha) has   143 elements
Group     4 (       Backbone) has   429 elements
Group     5 (      MainChain) has   573 elements
Group     6 (   MainChain+Cb) has   708 elements
Group     7 (    MainChain+H) has   711 elements
Group     8 (      SideChain) has  1509 elements
Group     9 (    SideChain-H) has   522 elements
Group    10 (    Prot-Masses) has  2220 elements
Group    11 (    non-Protein) has 19419 elements
Group    12 (          Water) has 19419 elements
Group    13 (            SOL) has 19419 elements
Group    14 (      non-Water) has  2220 elements
Select a group: Reading frames from gro file 'Protein in water', 21569 atoms.
Reading frame       0 time    0.000   
Precision of ../../zeb_hb_phz_complex_water_ion.gro is 0.001 (nm)

-------------------------------------------------------
Program:     gmx trjconv, version 2024.2-conda_forge
Source file: src/gromacs/tools/trjconv.cpp (line 1037)

Fatal error:
Index[21569] 21570 is larger than the number of atoms in the
trajectory file (21569). There is a mismatch in the contents
of your -f, -s and/or -n files.

For more information and tips for troubleshooting, please check the GROMACS
website at https://manual.gromacs.org/current/user-guide/run-time-errors.html
-------------------------------------------------------


**Notes on error above** Visual inspection of `../../zeb_hb_phz_complex_water_ion.gro` reveals the correct number of atoms ($21569$) has been set. Confirm that the `../../genion_zeb_hb_phz_complex_water_ion.tpr` has the same number.

## Energy minimisation

Set parallelization and GPU options here. Change them later, if needed.

In [None]:
#Parallelization
nthreads = int(os.environ.get('PBS_NCPUS', '16'))

#Set Parallelization
complex_sys.nt = nthreads
#complex_sys.ntmpi = 1
complex_sys.gpu_id = '0'

complex_sys.em_2_steps(out_folder=em_folder,
        no_constr_nsteps=em_step_number,
        constr_nsteps=em_step_number,
        posres="",
        create_box_flag=False, emtol=emtol, emstep=emstep)

## Plot energy:

In [None]:
ener_pd_1 = complex_sys.sys_history[-1].get_ener(selection_list=['Potential'])
ener_pd_2 = complex_sys.get_ener(selection_list=['Potential'])

ener_pd_1['label'] = 'no bond constr'
ener_pd_2['label'] = 'bond constr'

ener_pd = pd.concat([ener_pd_1, ener_pd_2])

ener_pd['Time (ps)'] = np.arange(len(ener_pd))

In [None]:
ax = sns.lineplot(x="Time (ps)", y="Potential",
        hue="label",
        data=ener_pd)
ax.set_xlabel('step')
ax.set_ylabel('energy (KJ/mol)')
plt.grid()

## System minimisation and equilibration

Based on `gromacs_py` docs, this is a 3-stage equilibriation process. 

All three steps seem to be NPT with berendsen coupling and v-rescale for temp coupling. Each step just has different restraints. This does not seem so bad: closer to lab conditions.

Since the statistical ensemble is pretty much always NPT, this is different from the Lemkul-lysozyme tutorial at [MDTutorials](http://www.mdtutorials.com/gmx/lysozyme/).

**Note:** Had to run this on param at least. Too slow even in ofc workstn.

In [None]:
complex_sys.em_equi_three_step_iter_error(out_folder=equi_folder,
    no_constr_nsteps=em_step_number,
    constr_nsteps=em_step_number,
    nsteps_HA=HA_step,  
    nsteps_CA=CA_step,
    nsteps_CA_LOW=CA_LOW_step,
    dt=dt, dt_HA=dt_HA,
    vsite=vsite, maxwarn=1)


### Plot Equilibriation

Since the statistical ensemble is pretty much always NPT, this is different from the Lemkul-lysozyme tutorial at [MDTutorials](http://www.mdtutorials.com/gmx/lysozyme/). So we need to see Volume as well as Pressure, temperature, and density.

In [None]:
quantities = ["Temperature", "Pressure", "Volume", "Density"]
units = ["$K$", "$bar$", "$A^3$", "$kg/m^3$"]

pd_1 = complex_sys.sys_history[-2].get_ener(selection_list=quantities)
pd_2 = complex_sys.sys_history[-1].get_ener(selection_list=quantities)
pd_3 = complex_sys.get_ener(selection_list=quantities)

pd_1['label'] = 'HA_constr'
pd_2['label'] = 'CA_constr'
pd_2['Time (ps)'] = pd_2['Time (ps)'] + pd_1['Time (ps)'].max()
pd_3['label'] = 'CA_LOW_constr'
pd_3['Time (ps)'] = pd_3['Time (ps)'] + pd_2['Time (ps)'].max()

display(pd.concat([pd_1, pd_2, pd_3]))

In [None]:
plt.rcParams.update({'font.size': 22})

fig, axs = plt.subplots(4, 1, figsize=(24,13.5), sharex=True, tight_layout=True)

for ax, quantity, unit in zip(axs, quantities, units):
    for df in (pd_1, pd_2, pd_3):
        ax.plot(df["Time (ps)"], df[quantity], label=str(df['label'][0]))
        ax.set_ylabel(quantity + "(" + unit + ")")
        ax.grid()

axs[0].legend()
axs[-1].set_xlabel("Time (ps)");

Looks okay to me. Fluctuations are high at the end because CA constraints are low, but there is a well-defined average.

Alternatively, we **could** not do the `CA_LOW_constr` part.

### Plot RMSD

In [None]:
# Define reference structure for RMSD calculation
ref_sys =  md_sys.sys_history[1]

struct="Protein"

rmsd_pd_1 = md_sys.sys_history[-2].get_rmsd([struct, struct], ref_sys=ref_sys)
rmsd_pd_2 = md_sys.sys_history[-1].get_rmsd([struct, struct], ref_sys=ref_sys)
rmsd_pd_3 = md_sys.get_rmsd([struct, struct], ref_sys=ref_sys)


rmsd_pd_1['label'] = 'HA_constr'
rmsd_pd_2['label'] = 'CA_constr'
rmsd_pd_2['time'] = rmsd_pd_2['time'] + rmsd_pd_1['time'].max()
rmsd_pd_3['label'] = 'CA_LOW_constr'
rmsd_pd_3['time'] = rmsd_pd_3['time'] + rmsd_pd_2['time'].max()

display(pd.concat([rmsd_pd_1, rmsd_pd_2, rmsd_pd_3]))


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(24,13.5))

for df in (rmsd_pd_1, rmsd_pd_2, rmsd_pd_3):
        ax.plot(df["time"], df["Protein"], label=str(df['label'][0]))
        ax.set_ylabel(quantity + "(" + unit + ")")
        
ax.set_title(struct)
ax.set_ylabel('RMSD (nm)')
ax.set_xlabel('Time (ps)')
plt.grid()

## Checkpointing

Checkpoint using pickling. This will be easier to restore in the cluster

In [None]:
import pickle

with open('checkpoint_equi.pycpt', 'wb') as py_cpt:
    pickle.dump(md_sys, py_cpt)

## Production MD 

In [None]:
md_sys.production(out_folder=prod_folder,
        nsteps=prod_step,
        dt=dt, vsite=vsite, maxwarn=1)


## Checkpointing Again


In [None]:
import pickle

with open('checkpoint_prod.pycpt', 'wb') as py_cpt:
    pickle.dump(md_sys, py_cpt)

## Post-Production

### Prepare trajectory

In [None]:
# Center trajectory
md_sys.center_mol_box(traj=True)

### Trajectory Conversion for better viewing

In [None]:
# Align the protein coordinates
md_sys.convert_trj(select='Protein\nSystem\n', fit='rot+trans', pbc='none', skip='10')

In [None]:
md_sys.display_history()