*To run cells with code, `shift + enter`* 

*To restart the session, `Kernel -> Restart and clear output`*

# Mutual Information (MI) analysis

----

----
## Introduction 
-------

### <span style="color:DarkRed"> Some details on the input data
    

-------
    
From the previous analysis, we will have generated data for different descriptors: 

* Torsions


* C$\alpha$ PCA 


* C$\alpha$ distances (all pairwise C$\alpha$-C$\alpha$ distances)

Normally we also include analysis for:

* Energy decomposition (all pairwise non-bonded energies) for ligand or substrate


* Specific distances (particular distances relevant to system)


* Torsion PCA

-------

This means we have already a lot of data we can use for the MI calculation. 

At the moment, the script just takes two variables and computes MI. 

The next steps are to develop this to calculate all possible pairs of variables, depending on the analysis we have already run: 


e.g. 
<br>
- Calculate MI between all pairs of torsions to find motions which are correlated


- Calculate all pairwise interaction energies with all torsions to find motions correlated to particular interactions of ligand or substrate


### <span style="color:DarkRed"> Some background on Information Theory.

---

Both KL divergence and MI come from [information theory](https://en.wikipedia.org/wiki/Information_theory "Information Theory"), and both are based on [information entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory "Information Entropy"):


**Information entropy _H_ :**

Information entropy tells us how much "information" is present in a random variable. 

For example, the more uncertain or random a variable is, the more information it will contain. The less random the variable, the less information it contains. 


The definition of entropy used in information theory is directly analogous to the definition used in [statistical thermodynamics](https://en.wikipedia.org/wiki/Entropy_in_thermodynamics_and_information_theory). 

\begin{equation}
{\displaystyle H=-\sum _{i}p_{i}\log (p_{i})}
\end{equation}


**KL Divergence _D<sub>KL</sub>_ :**

The KL calculations highlighted differences between different systems (e.g. activator vs. inhibitor). KL divergence is also known as "Relative entropy":

\begin{equation}
D_{KL} (P\|Q)=\sum _{i}P(i)\,\log {\frac {P(i)}{Q(i)}}
\end{equation}



MI calculations will determine whether two particular descriptors are correlated for one system. For example, in the activated system, do the interaction energies of the ligand correlate with particular torsions, or particular principal components. 

<br>

**Mutual information _I(X;Y)_ :** 

\begin{equation}
I(X;Y)=\sum _{y\in Y}\sum _{x\in X}p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}
\end{equation}

where: 

p(x,y) => joint probability function of the two variables X and Y 

p(x) and p(y) => marginal probability distributions of the two variables X and Y

MI can also be defined as follows:

\begin{equation}
{\begin{aligned}I(X;Y)&{}\equiv \mathrm {H} (X)-\mathrm {H} (X|Y)\\&{}\equiv \mathrm {H} (Y)-\mathrm {H} (Y|X)\\&{}\equiv \mathrm {H} (X)+\mathrm {H} (Y)-\mathrm {H} (X,Y)\\&{}\equiv \mathrm {H} (X,Y)-\mathrm {H} (X|Y)-\mathrm {H} (Y|X)\end{aligned}}
\end{equation}




The script will compute MI as:
\begin{equation}
I(X;Y)= H(X) + H(Y) - H(X,Y)
\end{equation}

# <span style="color:DarkRed"> Overall workflow

Summary of the overall workflow:

<img src="z_images/MI_calculation.png" width="800" >

"**Step 2.**" can also be another geometric feature, if you want to test for correlation between two different distances/angles, for example.

### <span style="color:DarkRed">  Accounting for noise

In order to check that our signal is "real" and not just an artefact of finite sampling, we can add a correction. To do this, we take one of the variables which we want to compute MI for, randomise it in time, and then again compute MI. 

We can then subtract this "not true MI" from the MI we obtained from the original data. 

<br>
\begin{equation}
{MI}_{corrected} = MI_{raw-data} - MI_{randomised-data}
\end{equation}

<br>

<img src="z_images/MI_randomise.png" width="500" >

<br>

### <span style="color:DarkRed">  Selecting the correct number of bins

An appropriate number of bins to use for calculating probability distributions for MI depends on the size of the dataset.

We can compute the MI between two variables for a range of bin numbers, and choose the number of bins which maximises the "real" MI we obtain. 

For the two sets of data above, MI<sub>(raw data)</sub>  and MI<sub>(randomised data)</sub> are computed for a range of bin values and so we can obtain MI<sub>(corr)</sub> as a function of the number of bins. We can then choose the number of bins which corresponds to the maximum value of MI<sub>(corr)</sub>.




<img src="z_images/MI_bins.png" width="600" >

---

### To run the script

Pass the script with the location to two sets of data, and the number of bins to use for each set of data.

`script , variableA , variableB , binsA , binsB = argv`

e.g.  `$ MI_calc.py PC1_raw_data_system_A.dat raw_data_psi_1.dat 100 100`

And this can tell us whether PC1 and the first psi torison have any correlation. 

In [None]:
!python Scripts/10_MI_calc.py **filepathA** **filepathB** **binsA** **binsB**

### Future work
We plan to be able to calculate all combinations of variables: So each pairwise interaction vs. each dihedral - and find which regions show correlated motions or interactions. 

Input could be two different variables: 

* Torsions with interaction energies. 


Or both the same: 

* All torsions vs all torsions (to find motions which are correlated).

MI will then be visualised as a plot, as below.

<img src="z_images/MI_plan.png" width="500">

This could be particularly useful for allostery, as it could help to find correlated motions or interactions which involve the orthosteric site and the allosteric site. 