# ASMSA: Prepare and check input files

**Next steps**
- [tune.ipynb](tune.ipynb): Perform initial hyperparameter tuning for this molecule
- [train.ipynb](train.ipynb): Use results of previous tuning in more thorough training
- [md.ipynb](md.ipynb): Use a trained model in MD simulation with Gromacs

In [1]:
%cd p53

/home/jovyan/ASMSA/p53


In [2]:
#avoid TF to consume GPU memory
import tensorflow as tf
tf.config.set_visible_devices([], 'GPU')
tf.config.list_logical_devices()

2024-10-24 14:09:02.019707: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-24 14:09:02.031962: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-24 14:09:02.035614: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-24 14:09:02.045320: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[LogicalDevice(name='/device:CPU:0', device_type='CPU')]

In [3]:
import mdtraj as md
import numpy as np
import urllib.request
import asmsa
import os
import re
import tensorflow as tf
import gromacs as gmx
import gromacs.fileformats as gf
import matplotlib.pyplot as plt
import numpy as np
import nglview as nv

NOTE: Some configuration directories are not set up yet: 
	/home/jovyan/.gromacswrapper
	/home/jovyan/.gromacswrapper/qscripts
	/home/jovyan/.gromacswrapper/templates
NOTE: You can create the configuration file and directories with:
	>>> import gromacs
	>>> gromacs.config.setup()


pod/asmsa-gmx-13210-c4vzn condition met




## Prepare input files

Tryptophan cage files are downloaded in this section from our Google drive. 

This is for demonstration purpose, in real use the inputs should be placed here, and _conf, traj, topol, index_ variables set to their filenames names.

In [4]:
# Define input files

base = 'p53'

# input conformation, it should not contain hydrogens
orig_conf = 'eq_nh.gro'
conf = base + '.pdb'

# input trajectory
# atom numbering must be consistent with {conf}, no hydrogens as well

orig_traj = "traj_comp_nh.xtc"
traj = base + '-pdc.xtc'

# everything else is generated with pdb2gmx to make sure the files are consistent

In [5]:
v=nv.show_file(orig_conf)
v.clear()
v.add_representation('licorice')
v

NGLWidget()

In [6]:
tr1 = md.load(orig_conf)
tr1.atom_slice(tr1.topology.select('element != H')).save_pdb(conf)

In [7]:
gmx.trjconv(f=orig_traj,s=conf,pbc='nojump',input='System System'.split(),o=traj)

pod/asmsa-gmx-13210-c4vzn condition met
            :-) GROMACS - gmx trjconv, 2023.2-plumed_2.10.0_dev (-:

Executable:   /gromacs/AVX2_256_ts/bin/gmx
Data prefix:  /gromacs/AVX2_256_ts
Working dir:  /mnt/ASMSA/p53
Command line:
  gmx trjconv -f traj_comp_nh.xtc -s p53.pdb -pbc nojump -o p53-pdc.xtc

Will write xtc: Compressed trajectory (portable xdr format): xtc
Group     0 (         System) has  1566 elements
Group     1 (        Protein) has  1565 elements
Group     2 (      Protein-H) has  1565 elements
Group     3 (        C-alpha) has   199 elements
Group     4 (       Backbone) has   597 elements
Group     5 (      MainChain) has   797 elements
Group     6 (   MainChain+Cb) has   983 elements
Group     7 (    MainChain+H) has   797 elements
Group     8 (      SideChain) has   768 elements
Group     9 (    SideChain-H) has   768 elements
Group    10 (    Prot-Masses) has  1565 elements
Group    11 (    non-Protein) has     1 elements
Group    12 (          Other) has     1 elem

Note that major changes are planned in future for trjconv, to improve usability and utility.
Select group for output
Selected 0: 'System'


Last frame      10000 time 1000000.000    ->  frame   9999 time 999900.000      
 ->  frame  10000 time 1000000.000      
Last written: frame  10000 time 1000000.000


GROMACS reminds you: "Only entropy comes easy." (Anton Chekov)



(0, None, None)

#### Density of additional internal coordinates

In how many randomly sampled distances from all atom-to-atom one atom should appear in average

In [8]:
nb_density = 2 # integer in [1, n_atoms-1]

In [9]:
topol = base + '.top'
index = base + '.ndx'
gro = base + '.gro'

with open('inputs.py','w') as i:
    i.write(f'''
base = '{base}'
conf = '{conf}'
traj = '{traj}'
topol = '{topol}'
index = '{index}'
gro = '{gro}'

nb_density = {nb_density}
''')

In [11]:
# XXX: replaced OH with OH1 in charmm22st.ff/aminoacids.c.tdb
gmx.pdb2gmx(f=orig_conf,ignh=True,p=topol,n=index,o=gro,water='tip3p',ff='charmm22st')

pod/asmsa-gmx-13210-c4vzn condition met


Using the Charmm22st force field in directory ./charmm22st.ff

going to rename ./charmm22st.ff/aminoacids.r2b

going to rename ./charmm22st.ff/rna.r2b
Reading eq_nh.gro...
Read 'Protein in water', 1566 atoms

Analyzing pdb file
Splitting chemical chains based on TER records or chain id changing.

There are 1 chains and 0 blocks of water and 200 residues with 1566 atoms

  chain  #res #atoms

  1 ' '   200   1566  

No occupancies in eq_nh.gro

Reading residue database... (Charmm22st)

Processing chain 1 (1566 atoms, 200 residues)

Identified residue TRP91 as a starting terminus.

Identified residue LEU289 as a ending terminus.
Start terminus TRP-91: NH3+
End terminus LEU-289: COO-

Checking for duplicate atoms....

With the -remh option the generated index file (p53.ndx) might be useless (the index file is generated before hydrogens are added)

Generating any missing hydrogen atoms and/or adding termini.

Now there are 200 residues with 3087 atoms

Making bonds...

Number of bonds was 

            :-) GROMACS - gmx pdb2gmx, 2023.2-plumed_2.10.0_dev (-:

Executable:   /gromacs/AVX2_256_ts/bin/gmx
Data prefix:  /gromacs/AVX2_256_ts
Working dir:  /mnt/ASMSA/p53
Command line:
  gmx pdb2gmx -f eq_nh.gro -ignh -p p53.top -n p53.ndx -o p53.gro -water tip3p -ff charmm22st

Opening force field file ./charmm22st.ff/aminoacids.r2b
Opening force field file ./charmm22st.ff/rna.r2b
No occupancies in eq_nh.gro
Opening force field file ./charmm22st.ff/atomtypes.atp
Opening force field file ./charmm22st.ff/aminoacids.rtp
Opening force field file ./charmm22st.ff/dna.rtp
Opening force field file ./charmm22st.ff/lipids.rtp
Opening force field file ./charmm22st.ff/rna.rtp
Opening force field file ./charmm22st.ff/aminoacids.hdb
Opening force field file ./charmm22st.ff/dna.hdb
Opening force field file ./charmm22st.ff/lipids.hdb
Opening force field file ./charmm22st.ff/rna.hdb
Opening force field file ./charmm22st.ff/aminoacids.n.tdb
Opening force field file ./charmm22st.ff/dna.n.tdb
Openin

(0, None, None)

## Sanity checks

In [13]:
# Load the trajectory, it should report expected numbers of frames and atoms/residua

tr = md.load(traj,top=conf)
idx=tr[0].top.select("name CA")

# for trivial cases like Ala-Ala, where superposing on CAs fails
#idx=tr[0].top.select("element != H")

tr.superpose(tr[0],atom_indices=idx)

<mdtraj.Trajectory with 10001 frames, 1566 atoms, 200 residues, and unitcells at 0x7f63f0761570>

In [14]:
# Visual check, all frames should look "reasonable"

# Because of different conventions of numbering atoms in proteins,
# PDB file {conf} and the trajectory {traj} can become inconsistent, and this would appear here 
# as rather weird shapes of the molecule

import nglview as nv

v = nv.show_mdtraj(tr)
v.clear()
v.add_representation("licorice")
v

NGLWidget(max_frame=10000)

## Split datasets

Split trajectory into 3 parts. Each part will represent training, validation and testing dataset respectively. The workflow is following:
1. Shuffle configurations in trajectory
2. Select proportions to divide the trajectory
3. Divide the trajectory
4. Compute RMSD between
   * **train x validation** trajectory and filter similar structures in train trajectory
   * **train x test** trajectory and filter similar structures in train trajectory
   * **test x validation** trajectory and filter similar structures in test trajectory
5. Transform into internal coordinates
6. Save internal coordinates as datasets which can be loaded in **train.ipynb** and **tune.ipynb** notebooks

In [15]:
# shuffle the trajectory so the configurations are dispersed across all datasets
np.random.shuffle(tr.xyz)

In [16]:
# - set proportions for train, validation and test datasets
# - proportions must be equal to 1 when added together
train = .7
validation = .15
test = .15

assert train + validation + test == .9999999999999999 or 1

tr_i = len(tr) * train
X_train = tr.slice(slice(0,int(tr_i)))

va_i = len(tr) * validation
X_validate = tr.slice(slice(int(tr_i),int(tr_i)+int(va_i)))

te_i = len(tr) * test
X_test = tr.slice(slice(int(tr_i)+int(va_i),len(tr)))

X_train.xyz.shape, X_validate.xyz.shape, X_test.xyz.shape

((7000, 1566, 3), (1500, 1566, 3), (1501, 1566, 3))

In [17]:
X_train.save_xtc('train.xtc')
X_validate.save_xtc('validate.xtc')
X_test.save_xtc('test.xtc')

In [None]:
# eventual recovery
"""
X_train = md.load_xtc('train.xtc',conf)
X_validate = md.load_xtc('validate.xtc',conf)
X_test = md.load_xtc('test.xtc',conf)
"""

In [31]:
# get RMSD from train trajectory compared to validation trajectory
gmx.select(s=conf,on='backbone.ndx',select='Backbone')

pod/asmsa-gmx-13210-c4vzn condition met



         based on residue and atom names, since they could not be
         definitively assigned from the information in your input
         files. These guessed numbers might deviate from the mass
         and radius of the atom type. Please check the output
         files if necessary. Note, that this functionality may
         be removed in a future GROMACS version. Please, consider
         using another file format for your input.



             :-) GROMACS - gmx select, 2023.2-plumed_2.10.0_dev (-:

Executable:   /gromacs/AVX2_256_ts/bin/gmx
Data prefix:  /gromacs/AVX2_256_ts
Working dir:  /mnt/ASMSA/p53
Command line:
  gmx select -s p53.pdb -on backbone.ndx -select Backbone

Analyzed topology coordinates

GROMACS reminds you: "Don't You Wish You Never Met Her, Dirty Blue Gene?" (Captain Beefheart)



(0, None, None)

In [None]:
gmx.rms(s=conf,f='train.xtc',f2='validate.xtc',n='backbone.ndx',m='trainxval_rmsd.xpm')

In [None]:
# load the RMDS matrix
txv = gf.XPM('trainxval_rmsd.xpm')
txv.array.shape

In [None]:
# minima per row -- for each configuration in train, how far is the nearest one from validation
txv_min = np.min(txv.array,axis=1)
txv_min.shape

In [None]:
plt.hist(txv_min,bins=50)
plt.show()

In [None]:
# drop similar structures (to validation trajectory) in train trajectory to avoid dataset being biased
txv_difference = 0.05

train_tr = X_train[np.argwhere(txv_min > txv_difference).flat]
train_tr.xyz.shape

In [None]:
train_tr.save_xtc('tmp_train.xtc')
gmx.rms(s=conf,f='tmp_train.xtc',f2='test.xtc',n='backbone.ndx',m='trainxtest_rmsd.xpm')

In [None]:
txt = gf.XPM('trainxtest_rmsd.xpm')
txt.array.shape

In [None]:
txt_min = np.min(txt.array,axis=1)
txt_min.shape

In [None]:
plt.hist(txt_min,bins=50)
plt.show()

In [None]:
# ... one more time with test trajectory & test x validation...
txt_difference = 0.05

x_train = train_tr[np.argwhere(txt_min > txt_difference).flat]
x_train.save_xtc('x_train.xtc')

In [None]:
# test x validation
gmx.rms(f='test.xtc',f2='validate.xtc',s=conf,n='backbone.ndx',m='testxvalidate_rmsd.xpm')

In [None]:
txv = gf.XPM('testxvalidate_rmsd.xpm')
txv.array.shape

In [None]:
txv_min = np.min(txv.array,axis=1)
txv_min.shape

In [None]:
plt.hist(txv_min,bins=50)
plt.show()

In [None]:
# ... one more time with test trajectory & test x validation...
txv_difference = 0.05

x_test = X_test[np.argwhere(txv_min > txv_difference).flat]
x_test.save_xtc('x_test.xtc')

In [18]:
# lazy man: if you dont want to wait for rmsd
! ln -sf train.xtc x_train.xtc
! ln -sf test.xtc x_test.xtc

In [19]:
# recovery

x_train = md.load('x_train.xtc', top=conf)
x_test = md.load('x_test.xtc', top=conf)


In [20]:
# get shapes of filtered trajectories that are to be used as datasets
validate_tr = md.load('validate.xtc', top=conf)

trajs = [x_train, validate_tr, x_test]
x_train.xyz.shape, validate_tr.xyz.shape, x_test.xyz.shape

((7000, 1566, 3), (1500, 1566, 3), (1501, 1566, 3))

In [21]:
# reshuffle the geometries to get frame last so that we can use vectorized calculations later on
geoms = []

for i in range(len(trajs)):
    geoms.append(np.moveaxis(trajs[i].xyz,0,-1))
    print(geoms[i].shape)

(1566, 3, 7000)
(1566, 3, 1500)
(1566, 3, 1501)


In [22]:
# save geometries

tf.data.Dataset.from_tensor_slices(geoms[0]).save('datasets/geoms/train')
tf.data.Dataset.from_tensor_slices(geoms[1]).save('datasets/geoms/validate')
tf.data.Dataset.from_tensor_slices(geoms[2]).save('datasets/geoms/test')

### Internal coordinates computation

Exercise the ASMSA library on your input. Just check everything appears to work.

There are multiple options that can be combined:
- use traditional internal coordinates (bond distances, angles, and dihedrals) or not
- include additional distances between atoms that may not be bound to express protein folding state more directly
   - dense (all-to-all) atom distances, feasible for very small peptides only
   - sparse atom distances (only some pairs are chosen)
   
**Choose the suitable one in the cell bellow, and copy the same to [tune.ipynb](tune.ipynb) and [train.ipynb](train.ipynb)**, they must be consistent


In [23]:
# traditional internal coordinates (bond distances, angles, and torsions) only

# mol = asmsa.Molecule(conf,topol)

# internal coordinates and sparse any-any atom distances (not restricted to bonds)
# eventually, top (and index) can be left out to use sparse distances only


mols = []
for i in range(len(geoms)):
    sparse_dists = asmsa.NBDistancesSparse(geoms[i].shape[0], density=nb_density)
    mols.append(asmsa.Molecule(pdb=conf,top=topol,ndx=index,fms=[sparse_dists]))

# dense distances are feasible for very small (upto 5 residua) peptides only

# dense_dists = asmsa.NBDistancesDense(geom.shape[0])
# mol = asmsa.Molecule(pdb=conf,top=topol,ndx=index,fms=[dense_dists])

In [24]:
intcoords = []
for i in range(len(mols)):
    intcoords.append(mols[i].intcoord(geoms[i]).T)
    print(intcoords[i].shape)

(7000, 13656)
(1500, 13656)
(1501, 13656)


In [25]:
[train,validate,test] = intcoords

In [26]:
# normalize training set
train_mean = np.mean(train,axis=0)
train -= train_mean
train_scale = np.std(train,axis=0)
train /= train_scale

In [27]:
# normalize test and validation sets
test -= train_mean
test /= train_scale
validate -= train_mean
validate /= train_scale

In [28]:
# save for usage in tune/train/test phase

tf.data.Dataset.from_tensor_slices(train).save('datasets/intcoords/train')
tf.data.Dataset.from_tensor_slices(validate).save('datasets/intcoords/validate')
tf.data.Dataset.from_tensor_slices(test).save('datasets/intcoords/test')

np.savetxt('datasets/intcoords/mean.txt',train_mean)
np.savetxt('datasets/intcoords/scale.txt',train_scale)

### Density of the conformational space

- Sample the training trajectory randomly
- For each point in the trajectory:
  - calculate RMSD to all points in the sample
  - pick some number $n$ of nearest ones
  - calculate the _density_ at this point as $$ d = \sum_{i=1}^n e^{-d_i} / n $$  i.e. the nearer the sample points are, the higher the density
 
Altogether, $d$ roughly corresponds to the probability that the molecule during simulation ends up in this area of the conformational space.

In [29]:
sample_size = 5000
x_train = md.load('x_train.xtc', top=conf)
tr_sample = x_train[np.random.choice(len(x_train),sample_size,False)]
tr_sample.save('sample.xtc')

In [32]:
# XXX: Broken with Zn
gmx.rms(f='x_train.xtc',f2='sample.xtc',s=conf,n='backbone.ndx',m='sample_rmsd.xpm')

pod/asmsa-gmx-13210-c4vzn condition met



         based on residue and atom names, since they could not be
         definitively assigned from the information in your input
         files. These guessed numbers might deviate from the mass
         and radius of the atom type. Please check the output
         files if necessary. Note, that this functionality may
         be removed in a future GROMACS version. Please, consider
         using another file format for your input.



              :-) GROMACS - gmx rms, 2023.2-plumed_2.10.0_dev (-:

Executable:   /gromacs/AVX2_256_ts/bin/gmx
Data prefix:  /gromacs/AVX2_256_ts
Working dir:  /mnt/ASMSA/p53
Command line:
  gmx rms -f x_train.xtc -f2 sample.xtc -s p53.pdb -n backbone.ndx -m sample_rmsd.xpm

Can not find mass in database for atom ZN in residue 300 ZN2

-------------------------------------------------------
Program:     gmx rms, version 2023.2-plumed_2.10.0_dev
Source file: src/gromacs/fileio/confio.cpp (line 522)

Fatal error:
Masses were requested, but for some atom(s) masses could not be found in the
database. Use a tpr file as input, if possible, or add these atoms to the mass
database.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
command terminated with exit co

GromacsError: [Errno 1] Gromacs tool failed
Command invocation: gmx rms -f x_train.xtc -f2 sample.xtc -s p53.pdb -n backbone.ndx -m sample_rmsd.xpm

In [None]:
rms = gf.XPM('sample_rmsd.xpm')

#### Visual check to verify the sample size is representative
- typically, not many distances should be less than 0.1 nm and more than 1 nm 
(the latter depends on the molecule, can be more for e.g. big disordered proteins)
- the histogram should be semi-smooth

In [None]:
plt.hist(rms.array.flatten(),bins=50)
plt.show()

In [None]:
k_nearest = 200
rms_sort = np.sort(rms.array.astype(np.float32))
erms = np.exp(-rms_sort[:,:k_nearest])
dens = (np.sum(erms,axis=1)-1.) / (erms.shape[1] - 1)

#### Histogram of densities
- quite high number of points should fall above 0.8, those are low energy basins
- the interval [0.5, 1.0] should be reasonably covered
- on the contrary, too many points below 0.4 would indicate either insufficient sampling above or too sparse trajectory

In [None]:
plt.hist(dens,bins=20)
plt.show()

In [None]:
len(dens),len(x_train)

In [None]:
np.savetxt('datasets/train_density.txt',dens)