# ASMSA: Prepare and check input files

**Next steps**
- [tune.ipynb](tune.ipynb): Perform initial hyperparameter tuning for this molecule
- [train.ipynb](train.ipynb): Use results of previous tuning in more thorough training
- [md.ipynb](md.ipynb): Use a trained model in MD simulation with Gromacs

In [None]:
%cd trpcage

In [None]:
#avoid TF to consume GPU memory
import tensorflow as tf

tf.config.set_visible_devices([], 'GPU')
tf.config.list_logical_devices()

import torch

In [None]:
import mdtraj as md
import numpy as np
import urllib.request
import asmsa
import os
import re
import tensorflow as tf
import gromacs as gmx
import gromacs.fileformats as gf
import matplotlib.pyplot as plt
import numpy as np

## Prepare input files

Tryptophan cage files are downloaded in this section from our Google drive. 

This is for demonstration purpose, in real use the inputs should be placed here, and _conf, traj, topol, index_ variables set to their filenames names.

In [None]:
# Define input files
base = 'trpcage_correct'

# input conformation, it should not contain hydrogens
conf = base + '.pdb'

# input trajectory
# atom numbering must be consistent with {conf}, no hydrogens as well

traj = 'trpcage_red.xtc'

# everything else is generated with pdb2gmx to make sure the files are consistent

#### Density of additional internal coordinates

In how many randomly sampled distances from all atom-to-atom one atom should appear in average

In [None]:
nb_density = 2 # integer in [1, n_atoms-1]

In [None]:
topol = base + '.top'
index = base + '.ndx'
gro = base + '.gro'

with open('inputs.py','w') as i:
    i.write(f'''
base = '{base}'
conf = '{conf}'
traj = '{traj}'
topol = '{topol}'
index = '{index}'
gro = '{gro}'

nb_density = {nb_density}
''')

In [None]:
gmx.pdb2gmx(f=conf, ignh=True,p=topol,n=index,o=gro,water='tip3p',ff='amber99sb-ildn')

## Sanity checks

In [None]:
# Load the trajectory, it should report expected numbers of frames and atoms/residua

tr = md.load(traj,top=conf)
idx=tr[0].top.select("name CA")

# for trivial cases like Ala-Ala, where superposing on CAs fails
#idx=tr[0].top.select("element != H")

tr.superpose(tr[0],atom_indices=idx)

In [None]:
# Visual check, all frames should look "reasonable"

# Because of different conventions of numbering atoms in proteins,
# PDB file {conf} and the trajectory {traj} can become inconsistent, and this would appear here 
# as rather weird shapes of the molecule

import nglview as nv

v = nv.show_mdtraj(tr)
v.clear()
v.add_representation("licorice")
#v.add_representation('ball+stick', selection='ZN2', radius=0.5, color="green") no Zn :(
v

## Split datasets

Split trajectory into 3 parts. Each part will represent training, validation and testing dataset respectively. The workflow is following:
1. Shuffle configurations in trajectory
2. Select proportions to divide the trajectory
3. Divide the trajectory
4. Compute RMSD between
   * **train x validation** trajectory and filter similar structures in train trajectory
   * **train x test** trajectory and filter similar structures in train trajectory
   * **test x validation** trajectory and filter similar structures in test trajectory
5. Transform into internal coordinates
6. Save internal coordinates as datasets which can be loaded in **train.ipynb** and **tune.ipynb** notebooks

In [None]:
# shuffle the trajectory so the configurations are dispersed across all datasets
np.random.shuffle(tr.xyz)

In [None]:
# - set proportions for train, validation and test datasets
# - proportions must be equal to 1 when added together
train = .7
validation = .15
test = .15

assert train + validation + test == .9999999999999999 or 1

tr_i = len(tr) * train
X_train = tr.slice(slice(0,int(tr_i)))

va_i = len(tr) * validation
X_validate = tr.slice(slice(int(tr_i),int(tr_i)+int(va_i)))

te_i = len(tr) * test
X_test = tr.slice(slice(int(tr_i)+int(va_i),len(tr)))

X_train.xyz.shape, X_validate.xyz.shape, X_test.xyz.shape

In [None]:
X_train.save_xtc('train.xtc')
X_validate.save_xtc('validate.xtc')
X_test.save_xtc('test.xtc')

## Recovery

In [None]:
# eventual recovery

X_train = md.load_xtc('train.xtc',conf)
X_validate = md.load_xtc('validate.xtc',conf)
X_test = md.load_xtc('test.xtc',conf)


# Keep on calculate

In [None]:
# get RMSD from train trajectory compared to validation trajectory
gmx.select(s=conf,on='backbone.ndx',select='Backbone')

In [None]:
gmx.rms(s=conf,f='train.xtc',f2='validate.xtc',n='backbone.ndx',m='trainxval_rmsd.xpm')

In [None]:
# load the RMDS matrix
txv = gf.XPM('trainxval_rmsd.xpm')
txv.array.shape

In [None]:
# minima per row -- for each configuration in train, how far is the nearest one from validation
txv_min = np.min(txv.array,axis=1)
txv_min.shape

In [None]:
plt.hist(txv_min,bins=50)
plt.show()

In [None]:
# drop similar structures (to validation trajectory) in train trajectory to avoid dataset being biased
txv_difference = 0.05

train_tr = X_train[np.argwhere(txv_min > txv_difference).flat]
train_tr.xyz.shape

In [None]:
train_tr.save_xtc('tmp_train.xtc')
gmx.rms(s=conf,f='tmp_train.xtc',f2='test.xtc',n='backbone.ndx',m='trainxtest_rmsd.xpm')

In [None]:
txt = gf.XPM('trainxtest_rmsd.xpm')
txt.array.shape

In [None]:
txt_min = np.min(txt.array,axis=1)
txt_min.shape

In [None]:
plt.hist(txt_min,bins=50)
plt.show()

In [None]:
# ... one more time with test trajectory & test x validation...
txt_difference = 0.05

x_train = train_tr[np.argwhere(txt_min > txt_difference).flat]
x_train.save_xtc('x_train.xtc')

In [None]:
# test x validation
gmx.rms(f='test.xtc',f2='validate.xtc',s=conf,n='backbone.ndx',m='testxvalidate_rmsd.xpm')

In [None]:
txv = gf.XPM('testxvalidate_rmsd.xpm')
txv.array.shape

In [None]:
txv_min = np.min(txv.array,axis=1)
txv_min.shape

In [None]:
plt.hist(txv_min,bins=50)
plt.show()

In [None]:
# ... one more time with test trajectory & test x validation...
txv_difference = 0.05

x_test = X_test[np.argwhere(txv_min > txv_difference).flat]
x_test.save_xtc('x_test.xtc')

In [None]:
# skip thorough RMS
! ln train.xtc x_train.xtc
! ln test.xtc x_test.xtc

# Recovery

In [None]:
exec(open('inputs.py').read())

In [None]:
# recovery

x_train = md.load('x_train.xtc', top=conf)
x_test = md.load('x_test.xtc', top=conf)


# Anyway

In [None]:
# get shapes of filtered trajectories that are to be used as datasets
validate_tr = md.load('validate.xtc', top=conf)

trajs = [x_train, validate_tr, x_test]
x_train.xyz.shape, validate_tr.xyz.shape, x_test.xyz.shape

In [None]:
# reshuffle the geometries to get frame last so that we can use vectorized calculations later on
geoms = [ np.moveaxis(t.xyz,0,-1) for t in trajs]
print ([ g.shape for g in geoms ])

In [None]:
# save geometries

tf.data.Dataset.from_tensor_slices(geoms[0]).save('../Thermal-unfolding/datasets/geoms/train')
tf.data.Dataset.from_tensor_slices(geoms[1]).save('../Thermal-unfolding/datasets/geoms/validate')
tf.data.Dataset.from_tensor_slices(geoms[2]).save('../Thermal-unfolding/datasets/geoms/test')

# Internal coordinates computation

Exercise the ASMSA library on your input. Just check everything appears to work.

There are multiple options that can be combined:
- use traditional internal coordinates (bond distances, angles, and dihedrals) or not
- include additional distances between atoms that may not be bound to express protein folding state more directly
   - dense (all-to-all) atom distances, feasible for very small peptides only
   - sparse atom distances (only some pairs are chosen)
   

We save the computed internal coordinates for training, and a feature extraction model here, therefore everything in the other notebooks should work too.


## New - Traditional internal coordinates (all bond distances, angles, and torsions)


In [None]:
'''
# mol = asmsa.Molecule(conf,topol)

# internal coordinates and sparse any-any atom distances (not restricted to bonds)
# eventually, top (and index) can be left out to use sparse distances only

sparse_dists = asmsa.NBDistancesSparse(geoms[0].shape[0], density=nb_density)
mol=asmsa.Molecule(pdb=conf,top=topol,ndx=index,fms=[sparse_dists])

# dense distances are feasible for very small (upto 5 residua) peptides only

# dense_dists = asmsa.NBDistancesDense(geom.shape[0])
# mol = asmsa.Molecule(pdb=conf,top=topol,ndx=index,fms=[dense_dists])
'''

In [None]:
#print(+ mol.__dict__)

##### Alternative: only backbone + Cbeta anlges and dihedrals

In [None]:
with open('backbone.ndx') as i:
    i.readline()
    bb = np.array([ int(j)-1 for j in " ".join(i.readlines()).split() ])

In [None]:
bb

In [None]:
# backbone angles and dihedrals
#angles = np.array([ bb[i:i+3] for i in range(0,len(bb)-3) ])
diheds = np.array([ bb[i:i+4] for i in range(0,len(bb)-4) ])
diheds

In [None]:
# XXX: select alpha carbons and matching betas
tr1 = md.load(conf)
cas = tr1.topology.select('name CA and not resname GLY')
cbs = tr1.topology.select('name CB')
assert(len(cas) == len(cbs))

In [None]:
cas

In [None]:
# indices of CAs (non-GLY) on the backbone
cai = np.argwhere(bb.reshape(1,-1) == cas.reshape(-1,1))[:,1]
cai

In [None]:
# angles of CB-CA-X, where X is the next atom on the backbone
cbangles = np.array([[ cbs[0], cas[0], bb[cai[0]+1] ]] +
                   [[cbs[i], bb[cai[i]], bb[cai[i]-1] ] for i in range(1,len(cbs))])
# just check 
cbangles+1

In [None]:
cbdiheds = np.array([[ cbs[0], cas[0], bb[cai[0]+1], bb[cai[0]+2] ]] +
                   [[cbs[i], bb[cai[i]], bb[cai[i]-1], bb[cai[i]-2]] for i in range(1,len(cbs))])
cbdiheds+1

In [None]:
'''
# just angles
mol=asmsa.Molecule(pdb=conf,n_atoms=geoms[0].shape[0],
                   angles=np.concatenate((angles,cbangles)),
                   diheds=np.concatenate((diheds,cbdiheds)))
'''

In [None]:
'''
# molecule model with explicit angles and dihedrals, and sparse distances from among Calpha and Cbetas
# (don't bother with distances now)
sparse_dists = asmsa.NBDistancesSparse(geoms[0].shape[0], density=nb_density, atoms = np.concatenate((cas,cbs)))
mol=asmsa.Molecule(pdb=conf,n_atoms=geoms[0].shape[0],
                   angles=np.concatenate((angles,cbangles)),
                   diheds=np.concatenate((diheds,cbdiheds)),
                   fms=[sparse_dists]) 
'''

In [None]:
sparse_dists = asmsa.NBDistancesSparse(geoms[0].shape[0], density=nb_density, atoms = np.concatenate((cas,cbs)))
dense_dists = asmsa.NBDistancesDense(geoms[0].shape[0])


mol=asmsa.Molecule(pdb=conf,n_atoms=geoms[0].shape[0],
                   diheds=np.concatenate((diheds,cbdiheds)),
                   fms=[sparse_dists]) 

In [None]:
mol_model = mol.get_model()

example_input = torch.randn((*geoms[0].shape[:2],1))

In [None]:
mol_model.angles

In [None]:
mol_model.get_indices()

In [None]:
len(sparse_dists.bonds)

In [None]:
len(dense_dists.bonds)

##### Save the features (molecule) model

In [None]:
mol_model = mol.get_model()

example_input = torch.randn((*geoms[0].shape[:2],1))
traced_script_module = torch.jit.trace(mol_model, example_input)

traced_script_module.save('features-thermal-unfolding.pt')

##### Compute the interanal coordinates now

In [None]:
intcoords = [ mol.intcoord(g).T for g in geoms]
print(
    [ g.shape for g in geoms ],
    [ i.shape for i in intcoords ]
)

In [None]:
[train,validate,test] = intcoords

In [None]:
# validate the saved model -- should yield nearly 0.

test_from_model = mol_model(torch.from_numpy(geoms[2])).numpy()
np.max(test - test_from_model.T)

In [None]:
# normalize training set
train_mean = np.mean(train,axis=0)
train -= train_mean
train_scale = np.std(train,axis=0)
train /= train_scale

In [None]:
# normalize test and validation sets
test -= train_mean
test /= train_scale
validate -= train_mean
validate /= train_scale

## Old gold

In [None]:
# traditional internal coordinates (bond distances, angles, and torsions) only

# mol = asmsa.Molecule(conf,topol)

# internal coordinates and sparse any-any atom distances (not restricted to bonds)
# eventually, top (and index) can be left out to use sparse distances only


mols = []
for i in range(len(geoms)):
    sparse_dists = asmsa.NBDistancesSparse(geoms[i].shape[0], density=nb_density)
    mols.append(asmsa.Molecule(pdb=conf,top=topol,ndx=index,fms=[sparse_dists]))

# dense distances are feasible for very small (upto 5 residua) peptides only

# dense_dists = asmsa.NBDistancesDense(geom.shape[0])
# mol = asmsa.Molecule(pdb=conf,top=topol,ndx=index,fms=[dense_dists])

In [None]:
intcoords = []
for i in range(len(mols)):
    intcoords.append(mols[i].intcoord(geoms[i]).T)
    print(intcoords[i].shape)

In [None]:
[train,validate,test] = intcoords

In [None]:
# normalize training set
train_mean = np.mean(train,axis=0)
train -= train_mean
train_scale = np.std(train,axis=0)
train /= train_scale

In [None]:
# normalize test and validation sets
test -= train_mean
test /= train_scale
validate -= train_mean
validate /= train_scale

## Save 

In [None]:
# save for usage in tune/train/test phase

tf.data.Dataset.from_tensor_slices(train).save('../Thermal-unfolding/datasets/intcoords/train')
tf.data.Dataset.from_tensor_slices(validate).save('../Thermal-unfolding/datasets/intcoords/validate')
tf.data.Dataset.from_tensor_slices(test).save('../Thermal-unfolding/datasets/intcoords/test')

np.savetxt('../Thermal-unfolding/datasets/intcoords/mean.txt',train_mean)
np.savetxt('../Thermal-unfolding/datasets/intcoords/scale.txt',train_scale)

# Density of the conformational space

- Sample the training trajectory randomly
- For each point in the trajectory:
  - calculate RMSD to all points in the sample
  - pick some number $n$ of nearest ones
  - calculate the _density_ at this point as $$ d = \sum_{i=1}^n e^{-d_i} / n $$  i.e. the nearer the sample points are, the higher the density
 
Altogether, $d$ roughly corresponds to the probability that the molecule during simulation ends up in this area of the conformational space.

In [None]:
sample_size = 5000
x_train = md.load('x_train.xtc', top=conf)
tr_sample = x_train[np.random.choice(len(x_train),sample_size,False)]
tr_sample.save('sample.xtc')

In [None]:
gmx.rms(f='x_train.xtc',f2='sample.xtc',s=conf,n='backbone.ndx',m='sample_rmsd.xpm')

In [None]:
rms = gf.XPM('sample_rmsd.xpm')

#### Visual check to verify the sample size is representative
- typically, not many distances should be less than 0.1 nm and more than 1 nm 
(the latter depends on the molecule, can be more for e.g. big disordered proteins)
- the histogram should be semi-smooth

In [None]:
plt.hist(rms.array.flatten(),bins=50)
plt.show()

In [None]:
k_nearest = 200
rms_sort = np.sort(rms.array.astype(np.float32))
erms = np.exp(-rms_sort[:,:k_nearest])
dens = (np.sum(erms,axis=1)-1.) / (erms.shape[1] - 1)

#### Histogram of densities
- quite high number of points should fall above 0.8, those are low energy basins
- the interval [0.5, 1.0] should be reasonably covered
- on the contrary, too many points below 0.4 would indicate either insufficient sampling above or too sparse trajectory

In [None]:
plt.hist(dens,bins=20)
plt.show()

In [None]:
len(dens),len(x_train)

In [None]:
np.savetxt('datasets/train_density.txt',dens)