# Internal coordinates and averages

This notebooks focuses on internal coordinates, distances, angle and dihedral angles. The name "internal coordinates" referes to the fact that they are insensitive to external (or global) rotations or translations of  the system. 

These internal coordinates will also be used to demonstrate the computation of averages of time series and the error on such averages.

In [None]:
%matplotlib widget
import numpy as np
import mdtraj
import pandas
import matplotlib.pyplot as plt
import nglview

The following cell sets up nglview with GUI controls (under development). When clicking on an atom, one can find all its details under the tab `Extra` and then `Picked`. We will often need the atom index in this notebook.

In [None]:
traj = mdtraj.load('traj.dcd', top='init.pdb')
df = pandas.read_csv("scalars.csv")
view = nglview.show_mdtraj(traj, gui=True)
view.add_representation("ball+stick", selection="protein")
view

## 1. Interatomic distances

The following example just computes two bond lengths as function of time, namely for the `N-CA` and `CA-C` bonds in the leucine residue. Note that all indices in Python start counting from zero, so the bond between the first two atoms in the PDB and DCD files is denoted by `[0, 1]`.

The documnetation of `compute_distances` can be found here: http://mdtraj.org/latest/api/generated/mdtraj.compute_distances.html#mdtraj.compute_distances

In [None]:
distances = mdtraj.compute_distances(traj, [[0, 1], [1, 2]])
print(distances.shape)
print(distances[400, 0])
plt.close(1)  # This is needed to rerun the code cell correctly
plt.figure(1)
for counter, col in enumerate(distances.T):
    plt.plot(df["Time (ps)"], col, label=str(counter))
plt.xlabel("Time [ps]")
plt.ylabel("Distances [nm]")
plt.legend(loc=0)
plt.show()

**<span style="color:#A03;font-size:14pt">
&#x270B; HANDS-ON! &#x1F528;
</span>**

> For the simple bond lengths shown above, the initialization (warming up of the protein during the first 5 picoseconds) seems to have little effect.
> Try to find the atomic indices of stronly electrostatically interacting pairs, e.g. the nitrogen in the lysine side changes interact with carboxilic groups in a glutamine and the c-terminus. Plot the distances between two pairs of electrostatically interacting atoms. Can you recognize the thermalization in these interatomic distances? Add two more distances, one for a hydrogen bond in an alpha helix and one outside an alpha helix.

## 2. Valence and dihedral angles

A similar analysis for the C-S-C angle, the first backbone dihedral angle and a side-chain dihedral angle, is carried out with [compute_angles](http://mdtraj.org/latest/api/generated/mdtraj.compute_angles.html) and [compute_dihedrals](http://mdtraj.org/latest/api/generated/mdtraj.compute_dihedrals.html) in the following code cell.

In [None]:
angles = mdtraj.compute_angles(traj, [[171, 172, 173]])
dihedrals = mdtraj.compute_dihedrals(traj, [[0, 1, 2, 21], [222, 225, 226, 231]])
print(angles.shape)
plt.close(2)  # This is needed to rerun the code cell correctly
plt.figure(2)
plt.plot(df["Time (ps)"], angles[:, 0]/np.pi*180, label='C-S-C angle')
plt.plot(df["Time (ps)"], dihedrals[:, 0]/np.pi*180, label='N-CA-C-N angle (psi)')
plt.plot(df["Time (ps)"], dihedrals[:, 1]/np.pi*180, label='CA-CB-OG-HG angle (chi2)')
plt.xlabel("Time [ps]")
plt.ylabel("Angles [deg]")
plt.legend(loc=0)
plt.show()

Valence angles in proteins are usually rather stiff while the (psi and phi) dihedrals in the backbone and the chiN dihedrals in the side chains explain most of the conformational changes.

## 3. Averages, fluctuations, error on the average

Computing averages and fluctuations (standard deviations) of a time-dependent data series is easily carried out with Numpy. The following code cell contains some examples.

In [None]:
print("average distance 0 [nm]", distances[:, 0].mean())
print("average distance 0 [nm]", np.mean(distances[:, 0]))
print("all average distances [nm]", distances.mean(axis=0))
print("all average distances [nm]", np.mean(distances, axis=0))
print("st.dev. distance 0 [nm]", distances[:, 0].std())
print("st.dev. distance 0 [nm]", np.std(distances[:, 0]))
print("all st.dev. distances [nm]", distances.std(axis=0))
print("all st.dev. distances [nm]", np.std(distances, axis=0))

The quantities from an MD simulation, of which an average is computed, are stochastical. Hence, any average over a finite number of steps is also a stochastic quantity, subject to an uncertainty. When computing an average over $N$ **uncorrelated** data points, the error on the average is

$$\sigma_{\langle a \rangle} = \sqrt{\frac{1}{N-1}\sum_{i=1}^N (a_i - \langle a \rangle)^2}$$

This formular is rarely useful for computing averages (over time) of quantities from MD simulations, because the values for subsequent time steps are often corellated. (See e.g. the inter-atomic distances for electrostatically interacting pairs of atoms.)

One should correct the factor $1/(N-1)$ and replace $N$ by the number of uncorrelated data points in a time series. The function `num_independent` below estimates this number of independent samples with the block-averaging method.

**TODO: add reference and double check algorithm with Allen and Tildesley.**

**TODO: take a look at the following paper: https://pubs.acs.org/doi/10.1021/acs.jctc.5b00784**

In [None]:
def averror(values, num=None):
    """Compute the error on the average of uncorrelated samples."""
    if num is None:
        num = len(values)
    elif num <= 1:
        return np.nan
    return np.std(values) / np.sqrt(num - 1)

from scipy.optimize import minimize_scalar

def num_independent(values, doplot=False):
    """Estimate the number of independent samples in a time series.
    
    This is basic implementation of the block-averaging method.
    
    Parameters
    ----------
    values
        A time-correlated series, must be a numpy array with shape (N,).
    maxnb
        The maximum number of blocks to consider
    
    Returns
    -------
    error
        The error on the average.
    """
    if values.ndim != 1:
        raise TypeError("Only one-dimensional arrays are supported.")
    bs_grid = []
    blav_vars = []
    bs = 1
    while bs < len(values) / 10:
        nb = len(values) // bs
        blocks = values[:nb * bs].reshape(nb, bs)
        block_averages = blocks.mean(axis=1)
        blav_vars.append(block_averages.var(ddof=1)/(nb-1))
        bs_grid.append(bs)
        bs = max(bs+1, int(1.1**len(bs_grid)))
    bs_grid = np.array(bs_grid)
    blav_vars = np.array(blav_vars)
    
    def get_trend(bs):
        trend = bs_grid / (bs + bs_grid)
        #trend = (1 - np.exp(-bs_grid / bs))
        prefac = np.dot(trend, blav_vars) / np.dot(trend, trend)
        return prefac * trend
    
    def mismatch(bs):
        return (((blav_vars - get_trend(bs))/bs_grid)**2).sum()
    
    result = minimize_scalar(mismatch, bounds=[1, len(values)*10], method='bounded')
    optbs = result.x
    
    if doplot:
        plt.figure()
        plt.plot(bs_grid, get_trend(optbs))
        plt.plot(bs_grid, blav_vars)
        plt.show()
    
    return len(values)/optbs

values = distances[:, 0]
num = num_independent(values)
print("Number of values", len(values))
print("Number of uncorellated values", num)
print("Error on average", averror(values, num))

**<span style="color:#A03;font-size:14pt">
&#x270B; HANDS-ON! &#x1F528;
</span>**

> Try estimating the error on the average of all internal coordinates computed so far. Does the number of independent samples correlated with the time dependence plotted previously?

## 4. Average of a dihedral angle

The average of a dihedral angle can become meaningless because these angles are subject to a periodic boundary, i.e. $\phi=35^\circ$ and $\phi=395^\circ$ represent the same dihedral angles. As a consequence, dihedral angles exceeding $180^\circ$ will continue at $-180^\circ$. Such discontinuous jumps result in meaningless averages.

This problem can be surmounted by computing a slightly different type of average, i.e. the average cosine and sine of the angle, followed by a conversion back into an angle. For example:

In [None]:
x = np.cos(dihedrals[:, 1]).mean()
y = np.sin(dihedrals[:, 1]).mean()
avdihed1 = np.arctan2(y, x)
avdihed1_wrong = dihedrals[:, 1].mean()
print(avdihed1/np.pi*180)
print(avdihed1_wrong/np.pi*180)
plt.close(3)  # This is needed to rerun the code cell correctly
plt.figure(3)
plt.plot(df["Time (ps)"], dihedrals[:, 1]/np.pi*180, label='CA-CB-OG-HG angle (chi2)')
plt.axhline(avdihed1/np.pi*180, color='k')
plt.axhline(avdihed1_wrong/np.pi*180, color='r')
plt.xlabel("Time [ps]")
plt.ylabel("Angles [deg]")
plt.legend(loc=0)
plt.show()

In this case, the difference is small but noticable. The naive calculation of the average includes some samples close to $-180^\circ$, whereas the alternative approach does not suffer from this issue. To understand why this works, consider a plot of the sine veruse the cosine:

In [None]:
import matplotlib.patches as mpatches
plt.close(4)  # This is needed to rerun the code cell correctly
plt.figure(4)
plt.plot(np.cos(dihedrals[:, 1]), np.sin(dihedrals[:, 1]), 'k+', alpha=0.1)
plt.plot([x], [y], 'ro')
plt.plot([0, x], [0, y], 'r-')
plt.gca().add_patch(mpatches.Arc([0, 0], 1, 1, angle=0, theta1=0, theta2=avdihed1/np.pi*180, color='r'))
plt.axhline(0, color='k')
plt.axvline(0, color='k')
plt.gca().set_aspect('equal')
plt.show()

When plotting the dihedral angles as points on a circle, i.e. $x=\cos\phi$ and $y=\sin\phi$, there is no longer a discontinuity and the average is well-behaved. The angle is then derived from the average (red dot) by computing the angle with the X-axis, using the `arctan2` function. A few remarks:

- The function `arctan2` takes the Y-coordinate as first argument.

- This approach will not work when the dihedral angles uniformly distributed over the interval $[-180^\circ,180^\circ]$. In this case, the average cosine and sine are (nearly) zero and the angle with the X-axis is not (or ill) defined.

## 5. Histogram and free energy

Histograms may also be convenient to characterize an internal coordinate. We will apply it here to the C-S-C angle, which is a relatively stiff mode. The angles are also [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution), which can be shown by overlaying the histogram with the normal probability density.

For reference, the force-field parameters for this angle can be found here:

https://github.com/openmm/openmm/blob/master/wrappers/python/simtk/openmm/app/data/amber14/protein.ff14SB.xml#L3077

and reads

```
<Angle angle="1.726130630222392" k="518.816" type1="protein-CT" type2="protein-S" type3="protein-CT"/>
```

The equilibrium angle in degrees is $98.9^\circ$.

In [None]:
plt.close(5)  # This is needed to rerun the code cell correctly
plt.figure(5)
a = angles/np.pi*180  # for convenience
bins = np.arange(85, 115, 1)
plt.hist(a, bins, density=True)
ava = a.mean()
avs = a.std()
x = np.linspace(80, 120, 79)
# The following equation is the probability density of a
# univariate normal distribution.
y = np.exp(-(x - ava)**2 / (2*avs**2)) / np.sqrt(2*avs**2*np.pi)
plt.plot(x, y)
plt.axvline(ava, color='k')
plt.title("Average = {:.1f}, Std.Dev. = {:.1f}".format(ava, avs))
plt.xlabel("C-S-C angle [deg]")
plt.ylabel("Probability density [1/deg]")
plt.show()

The average is slightly different from the equilibrium value, which may be due to other force-field terms influencing the local geometry.

**<span style="color:#A03;font-size:14pt">
&#x270B; HANDS-ON! &#x1F528;
</span>**

> How can you judge if the average and the equilibrium are significantly different?

The empirical probabilities from the histogram can be translated into free energies (up to a constant) by employing the following relationship:

$$F(\theta) = -k_B T \log(p(\theta))$$

This relation can also be applied to the probability density to obtain a quadratic approximation of the free energy.

In [None]:
boltzmann = (1e-3 * 1.380649e-23 * 6.02214076e23)  # in kJ/mol
temperature = 300
# Quadratic approximation
fc = boltzmann * temperature / avs**2  # force constant
f = 0.5 * fc * (x - ava)**2 
# Empiricial model, using histogram
pe = np.histogram(a, bins, density=True)[0]
fe = -boltzmann * temperature * np.log(pe)
fe -= fe.min()

plt.close(6)  # This is needed to rerun the code cell correctly
plt.figure(6)
plt.plot(x, f, label='Quadratic free energy model')
plt.plot((bins[1:] + bins[:-1])/2, fe, label='Empirical free energy')
plt.xlabel("C-S-C angle [deg]")
plt.ylabel("Free energy [kJ/mol]")
plt.legend(loc=0)
plt.show()

# Finally, the force constant is printed in kJ/rad^2, for comparison with the force-field parameter.
print(fc * (180/np.pi)**2)

Also here, the free-energy force constant deviates slightly from the AMBER force-field parameter. Again, this could be due to an influence from the environment of the C-S-C angle.

**<span style="color:#A03;font-size:14pt">
&#x270B; HANDS-ON! &#x1F528;
</span>**

> Also estimate the error on the force constant.