*To run cells with code, `shift + enter`* 

*To restart the session, `Kernel -> Restart and clear output`*

# PCA analysis
----

----
## Introduction 

-------


### <span style="color:DarkRed"> Some details on trajectory/topology input

-------

### C$\alpha$ coordinate PCA
#### Simulation input

We will input several different simulations into the same PCA, which will have been run with different ligands: 

<img src='z_images/PCA_input.png' width="400">
<br>

The trajectories that we input into the PCA must have same topology, so must be processed first to remove all water, ligands, etc.

It is also important to do an accurate alignment of all the frames for each trajectory, to the **same starting structure** if carrying out C$\alpha$ coordinate PCA. Depending on the system, it may make sense to align everything, or only part of the structure. In this case, all C$\alpha$ were used to align.

In the `0_TRAJECTORIES` folder there is a file `first_frame.pdb` - which in each case is the same structure of the protein with ligands, ions and waters removed. Alignment for each trajectory can then be done with cpptraj (but is already done for the tutorial). There are some example scripts in the folder `/Scripts/Traj_processing_scripts`.

For this tutorial, we will use every 5th snapshot of a 1$\mu$s simulation (40,000 frames - original simulation 200,000 frames).

Four simulations will be used: 

* 0_system_A (PDK1 with allosteric inhibitor 1F8 bound. PDB ID 3ORX.)


* 1_system_B (PDK1 with allosteric activator 2A2 bound. PDB ID 3ORZ.)


* 2_system_C (PDK1 with no allosteric ligand bound.)


* 3_system_D (PDK1 with allosteric activator J30 bound. PDB ID 3OTU)


### <span style="color:DarkRed">Some background on PCA
    
From a set of data $\begin{equation} X_t\end{equation}$ which we obtain from the trajectory, we compute the covariance matrix:

\begin{equation}
C=(X−μ)^T(X−μ)
\end{equation}

and solve  the eigenvalue problem:
\begin{equation}
Cr_i=σ_ir_i
\end{equation}

$\begin{equation} r_i \end{equation}$ are the principal components and $\begin{equation} σ_i\end{equation}$ are their respective variances.

The input data $\begin{equation} X_t\end{equation}$ can be something like C$\alpha$ coordinates, or backbone or sidechain torsions. 

In order to compare different simuluations, we do the dimensionality reduction on all available trajectories of the system. In this case, we have 4 different simulations, and will input all into the PCA together. We can then compare the highest variance motions between the different systems. 

**Reduction of dimensionality**

This therefore allows a large dimensional dataset (i.e. all C$\alpha$ coordinates in x, y and z directions) to be reduced into a smaller number of dimensions, where the new set of dimensions should still account for a large amount of the variance. 

<img src='z_images/PCA_dimensions.png' width="400" >


The output of this is a set of principal components, with Principal Component 1 (PC1) having the highest variance, and subsequent PC's having decreasing variance. 

As a guideline, we usually calculate the first 10 principal components, and we can check how much of the variance these first 10 PCs account for. 

The following script selects a subset of residues (it excludes the terminal regions of the model protein) and carries out a C$\alpha$ coordinate PCA. 



### <span style="color:DarkRed">Output from the PCA
    
The idea is to also use the output from the PCA in calculations of MI (Mutual Infomrmation) or KL (Kullback-Leibler) Divergence, therefore we need to output the values of PC per snapshot for each system, and distributions of these values.

The script will output several things: 


* Frames corresponding to the minimum and maximum values of PC1 and PC2 for each system.


* Per atom contribution to PC1 and PC2.


* Per snapshot value of PC1 and PC2 for each system.


* Distribution of values of PC1 and PC2 for each system.

    
| Folder name | Allosteric ligand |
| :--- | :-: | 
| 0_system_A | Inhibitor 1F8 |
| 1_system_B | Activator 2A2 |
| 2_system_C | No ligand |
| 3_system_D | Activator J30 |



#### The following packages are required: 

In [1]:
import pyemma
import pyemma.coordinates as coor
from pyemma.coordinates import pca
import mdtraj as md
from pyemma.coordinates import load
from pyemma.coordinates import source
import numpy as np
import os
import matplotlib.pyplot as plt



#### The following cell willl run the PCA analysis. 

Atom selection can be changed by editing the script `Scripts/7.PCA_tutorial_CA_coordinates.py` and changing the variable "atom_sel". This takes a string for the atom selection, in the format accepted by [mdtraj](http://mdtraj.org/latest/atom_selection.html "mdtraj Atom Selection").

<br><br>
Run the below cell with `shift + enter` : 

In [2]:
!python Scripts/7.PCA_tutorial_CA_coordinates.py

<mdtraj.Topology with 1 chains, 285 residues, 4676 atoms, 4732 bonds>

Input trajectory locations: 
['0_TRAJECTORIES/0_system_A/short_traj_aligned.dcd', '0_TRAJECTORIES/1_system_B/short_traj_aligned.dcd', '0_TRAJECTORIES/2_system_C/short_traj_aligned.dcd', '0_TRAJECTORIES/3_system_D/short_traj_aligned.dcd']

Obtaining file info: 100% (4/4) [##################################] eta 00:01 |
Data input to PCA as follows: 
trajectory length =  4000
number of dimensions =  483
number of trajectories = 4
total number of frames =  16000

Running PCA
calc mean+cov: 100% (160/160) [####################################] eta 00:00 |
Generating per system output
getting output of PCA: 100% (160/160) [############################] eta 00:01 |

Saving frames corresponding to the minimum and maximum values of PCA for each system to folder 2_PCA/CA_COOR_OUTPUT

Calculating variance for calculated PCs
Percentage variance PC1:  29.76155028286423
Percentage variance PC2:  18.757485506498064
Percentage var

---

#### Move to the output folder:

In [2]:
cd 2_PCA/CA_COOR_OUTPUT/

/home/t702348/lisa/X_PDK1_tutorial/0_Analysis/2_PCA/CA_COOR_OUTPUT


### Loading the PCA atom contribution values into a pymol session
---

In the cell below or from a terminal window in the folder `2_PCA/CA_COOR_OUTPUT
`, run the pymol script `../../Scripts/8_alter_B_factor.pml`

This will launch a pymol session with many different structures:

* Two structures with colour scale plot indicating the per atom comtribution to that Principal Component (PC) for PC1 and PC2.


* For each system, the structures corresponding to the minimum and maximum values of PC1 and PC2. 

<br>

<img src="z_images/PCA_colour_scale.png" width="500">


In [None]:
!pymol ../../Scripts/8_alter_B_factor.pml

<img src="z_images/PCA_pymol.png">