# Alignment and principal component analysis (PCA)

This notebook starts with alignment of a protein trajectory to a reference geometry, by minimizing the RMSD. This is needed to remove (random) global rotation and translation from the protein, before analyzing the principal components that describe the global motions of the protein.

This notebook shows how to perform PCA with bare Numpy operations, to let you better understand each step. The MDTraj documentation also explains how to perform the [PCA analysis with scikit-learn](http://mdtraj.org/latest/examples/pca.html).

In [None]:
from sys import stdout

import matplotlib.pyplot as plt
import mdtraj
import nglview
import numpy as np
import pandas
from openmm import *
from openmm.app import *
from openmm.unit import *
from pdbfixer import PDBFixer

The next cell loads the trajectory and strips water molecules.

In [None]:
traj = mdtraj.load('traj.dcd', top='init.pdb')
traj.restrict_atoms(traj.topology.select("protein"))
df = pandas.read_csv("scalars.csv")

## 1. RMSD Alignment

We will use the [superpose](http://mdtraj.org/latest/api/generated/mdtraj.Trajectory.superpose.html) method from MDTraj to remove the global rotation and translation of the protein. All frames will be aligned with the first geometry of the trajectory, by setting the second argument to zero:

In [None]:
traj.superpose(traj, 0)

This was relatively easy. We can also (optionally) compute the actual RMSD value at each time step, with the [rmsd](http://mdtraj.org/latest/api/generated/mdtraj.rmsd.html) function:

In [None]:
rmsds = mdtraj.rmsd(traj, traj, 0)
plt.close(1)  # This is needed to rerun the code cell correctly
fig, ax = plt.subplots(num=1)
ax.plot(df["Time (ps)"], rmsds)
ax.set_xlabel("Time [ps]")
ax.set_ylabel("RMSD [nm]")

The rapid initial increase of the RMSD represents a sudden change in geometry caused by the energy minimization before the actual MD run. Over time, the protein geometry slowly drifts away from its initial structure. In the following visualization, the effect of the alignment is clear: the geometry does not slowly rotate as in the previous notebook. Instead it just seems to wiggle in place.

In [None]:
view = nglview.show_mdtraj(traj)
view.add_representation("ball+stick", selection="protein")
view

## 2. The covariance matrix

For the remaining part of the analysis, the initialization phase will first be removed from the trajectory.

The first step in a principal component analysis is to compute the covariance matrix of the atomic positions, for which we will use NumPy's [cov](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) function. In this notebook, all atoms are included, but one may also work on a subset, e.g. the alpha carbons, instead.

Because this is very short MD simulation of a rather rigid protein, a direct visualization of the covariance does not show much.

In [None]:
# We first need to change the shape of the array with atomic positions.
print(traj.xyz.shape)
xyz = traj.xyz.reshape((-1, traj.n_atoms * 3))
print(xyz.shape)
# rowvar=True indicates that every row is one time step (or frame).
# The first 250 steps are skipped because they cover the initialization phase.
covar = np.cov(xyz[250:], rowvar=False)
print(covar.shape)
# Show the matrix elements corresponding to carbon atoms
selection = traj.top.select("name CA")
# Select the correct Cartesian components of atomic coordinates
selection_xyz = np.array(
    [3*selection, 3*selection + 1, 3*selection + 2]).T.ravel()
plt.matshow(covar[selection_xyz][:, selection_xyz])
print(covar.min())
print(covar.max())

The diagonal of the covariance matrix contains the variances of the atomic positions. When taking the square root of the diagonal, one obtains the so-called root-mean-square fluctuation (RMSF) of the atomic positions. A high RMSF corresponds to more flexible atoms. To make the RMSF plot legible, the following example only visualizes the RMSF for the alpha carbons atoms.

In [None]:
# The diagonal elements corresponding to X, Y and Z coordinates
# of each atom are summed, such that one value part alpha
# carbon is retained.
rmsf = np.sqrt(np.diag(covar)[selection_xyz].reshape(-1, 3).sum(axis=1))
plt.close(2)  # This is needed to rerun the code cell correctly
fig, ax = plt.subplots(num=2)
ax.plot(rmsf)
ax.set_xlabel("Alpha carbon index")
ax.set_ylabel("RMSF")

A common result for this plot is that the termini are more flexible than the central part of the protein chain. Also flexible loops can be identified with this analysis. (Keep in mind that this example is based on a relatively short MD simulation, such that the results may not yet be converged.)

One can also compute the RMSF directly, without first computing the whole covariance matrix and then extracting only the diagonal elements, as shown the following example.

In [None]:
rmsf_alt = np.sqrt(np.var(xyz[250:], axis=0)[
                   selection_xyz].reshape(-1, 3).sum(axis=1))
print(rmsf[:5])
print(rmsf_alt[:5])

## 3. Diagonalization of the covariance matrix

The next step is to obtain the eigenvalues and eigenvectors, for which we use NumPy's [eigh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eigh.html) function.

In [None]:
evals, evecs = np.linalg.eigh(covar)
# Print the10 most significant eigenvalues.
print(evals[-10:])
plt.close(3)  # This is needed to rerun the code cell correctly
fig, ax = plt.subplots(num=3)
ax.semilogy(evals)
ax.set_xlabel("Eigenvalue index")
ax.set_ylabel("Eigenvalue [nm^2]")

The first six eigenvalues are practically zero and correspond to the frozen global rotation and translation. The remaining 1746 modes represent internal motions of the protein with increasing magnitude. In most modes, there is also no motion, which is partially caused by the short simulation time.

## 3. Projection on a PCA mode

Once we have obtained the PCA modes (the eigenvectors), we can project the trajectory on one mode and follow the displacement along each mode. The following cell plots the projection on the first 5 modes.

In [None]:
plt.close(4)  # This is needed to rerun the code cell correctly
fig, ax = plt.subplots(num=4)

# Compute the average geometry
av = xyz.mean(axis=0)
# Plot projections on the three-most significant principal modes.
for i in range(3):
    proj = np.dot(xyz - av.reshape(1, -1), evecs[:, -i - 1])
    ax.plot(df["Time (ps)"], proj, label=f"Proj. {i}")
ax.set_xlabel("Time [ps]")
ax.set_ylabel("Displacement [nm]")
ax.legend(loc=0)

The [cosine content](https://doi.org/10.1103/PhysRevE.65.031910) is clearly visible in these projections, showing that the MD simulation is far too short to completely explore the conformational space. So far, the MD simulation essentially performed an undirected random walk in the conformational degrees of freedom of the protein.

## 4. Filtering the MD trajectory

One may visualize the motion in one or more modes by filtering the atomic positions, only retaining displacements along modes. To do so, one collects the projected modes and reconstructs the trajectory using only a few selected modes.

In [None]:
# The filtering algorithm
xyz_filter = 0
# Motions along the three most significant modes are retained.
for i in 0, 1, 2:
    evec = evecs[:, -i - 1]
    proj = np.dot(xyz - av.reshape(1, -1), evec)
    xyz_filter += np.outer(proj, evec)
xyz_filter += av

# Finally, we cast the result back into an MDTraj trajectory and visualize it.
traj_filter = mdtraj.Trajectory(
    xyz_filter.reshape(-1, traj.n_atoms, 3), traj.topology)
view_filter = nglview.show_mdtraj(traj_filter)
view_filter.add_representation("ball+stick", selection="protein")
view_filter

A careful inspect of the filtered trajectory should reveal the following:

- The average structure (to which the filtered modes are added) does not always make chemical sense. For example, methyl hydrogens overlap when a methyl group is freely rotating. Conversely, this also means PCA becomes meaningless with extreme conformational changes, because then the average structure is no longer meaningful.

- Most of the motion is close to the termini and some side chains, while the majority of the protein remains locked in place.

## 5. Extrapolation of a PCA mode

One may also visualize individual PCA modes by simply performing a linear extrapolation of the mode relative to the average structure.

In [None]:
# The amplitude is taken from the maximum eigenvalue
# of the covariance matrix. This will always have the
# right order of magnitude.
amplitude = np.sqrt(evals.max())
# With evecs[:, -1], the last eigenmode,
# i.e. the first principal mode, is selected.
xyz_extra = np.outer(
    np.linspace(-amplitude, amplitude, 100), evecs[:, -1]) + av

# Again, we cast the result back into an MDTraj trajectory and visualize it.
traj_extra = mdtraj.Trajectory(
    xyz_extra.reshape(-1, traj.n_atoms, 3), traj.topology)
view_extra = nglview.show_mdtraj(traj_extra)
view_extra.add_representation("ball+stick", selection="protein")
view_extra

Also in this analysis, the individual frames are not necessarily realistic geometries.

## 6. Hinging mode in Adenylate Kinase

In the above examples, the principal modes were not informative because the MD simulation was too short. The villin headpiece is also not the best example to demonstrate global modes.

**<span style="color:#A03;font-size:14pt">
&#x270B; HANDS-ON! &#x1F528;
</span>**

> The protein structures [2RGX](https://www.rcsb.org/structure/2rgx) and [2RH5](https://www.rcsb.org/structure/2rh5) represent two different conformations of one
and the same protein: Adenylate Kinase from Aquifex aeolicus. This is a phosphotransferase enzyme that
catalyzes the interconversion of adenine nucleotides. Basically, it is able to convert two molecules of
adenosinediphosphate (ADP) to one adenosinemonophosphate (AMP) and one adenosinetriphosphate
(ATP), and vice versa. As such, it plays an important role in cellular energy homeostasis.
2RH5 describes the conformation of this enzyme in the absence of a substrate. 2RGX contains a substrate
analogue (bis(adenosine)-5'-pentaphosphate) and hence can be considered to describe the conformation of
the enzyme in the presence of a substrate. The main difference between both conformations resembles the
action of pac-man: the enzyme opens up two lid-regions in the absence of a ligand, and closes them once a
ligand is present.
>
> ![pacman](pacman.png)
>
> 1. Load both structures with MDTraj, align one to the other and visualize both on top of each other with nglview.
>
> 2. Perform 2 nanoseconds of MD simulation using only one of the two states as initial structure. Compute the RMSD as function of time, using the two crystal structures as references. Also perform a PCA on this trajectory to identify and visualize the hinging mode.

When using PDB files from the RCSB Protein Data Bank, it is common that some minor parts (atoms, terminial residues, terminations, ...) are missing. These issues can be fixed with a tool called PDBFixer. General usage instructions can be found on the [PDBFixer](https://htmlpreview.github.io/?https://github.com/openmm/pdbfixer/blob/master/Manual.html) homepage.
It is also capable of a few other typical tasks, such as replacing non-standard residues or applying point mutations.

The following cell shows how to use PDBFixer in thise case, with a for loop to fix both PDB files with the same code. PDBFixer downloads the PDB files, applies the requested fixed and writes out the corrected PDB. All this will take a few seconds.

The fixed PDB files will be suitable for superposition (next code cell) and for starting points for an MD simulations, which you have to implement.

In [None]:
for pdbid in "2rgx", "2rh5":
    # Fix the PDB file with the PDB fixer
    print("Loading and fixing", pdbid)
    # Remove all chains except the first
    fixer = PDBFixer(pdbid=pdbid)
    nchain = len(list(fixer.topology.chains()))
    fixer.removeChains(range(1, nchain))
    # Remove everything except water and protein
    fixer.removeHeterogens(True)
    # Fill in missing atoms
    fixer.findMissingResidues()
    fixer.findMissingAtoms()
    fixer.addMissingAtoms()
    fixer.addMissingHydrogens(7.0)
    with open(f"{pdbid}_fixed.pdb", "w") as f:
        PDBFile.writeFile(fixer.topology, fixer.positions, f)

The following code shows how to load the two PDB files as single-frame trajectories in MDTraj. They are aligned by minimizing the RMSD and then visualized with nglview.

In [None]:
# Load trajectories
traj1 = mdtraj.load('2rgx_fixed.pdb')
traj2 = mdtraj.load('2rh5_fixed.pdb')
# Superpose
traj2_aligned = traj2.superpose(traj1)
# View
view = nglview.show_mdtraj(traj1)
view.add_component(traj2_aligned)
view[0].clear_representations()
view[1].clear_representations()
view[0].add_cartoon(color='blue')
view[1].add_cartoon(color='red')
view