# C$\alpha$ - C$\alpha$ distance Kullback-Leibler (KL) divergence
----

----
## Introduction 

-------



### <span style="color:DarkRed"> Some details on trajectory/topology input
    


-------

#### Topology format

For this tutorial, we will load the trajectory as a .dcd file, using a pdb as the topology. 

*Rename chain ID*

If the system has multiple chains, the script will only compute distances *within* each chain, if the chain IDs are different (e.g. Chain A, Chain B, Chain C)

We need to rename all the chains with the same chain ID, which we can do with pymol. 

*Renumber residues*

It is also useful to ensure the residues are all numbered differently. For example you might have: 

* Chain A: residues 1-100


* Chain B: residues 1-100


* Chain C: residues 1-100


**Instead we want the labels to look like: **

* Chain A: residues 1-300


----

To chain chain ID: 

**Script to do below steps:**   `Scripts/Alter_chain_id.pml`

Open pymol and set retain_order and pdb_retain_ids: 

**`PyMOL>set retain_order, 1`**

**`PyMOL>set pdb_retain_ids, 1`**

then: 

**`PyMOL>alter all, chain='A'`**

Then save a pdb. 

-----

To renumber chain: 

Can use pymol script to [renumber](https://raw.githubusercontent.com/Pymol-Scripts/Pymol-script-repo/master/renumber.py) residues: `Scripts/renumber.py`

Open pymol and run the script. Then:

**`PyMOL>renumber chain A, 1`**

`Renumber: range (1 to 200)`
    



### <span style="color:DarkRed"> Overall workflow

-----------

To summarise the overall workflow:

1. Run simulations. 

2. Calculate distributions all CA - CA distances for 2 or more systems.

3. Compute KL between the different systems. 

4. Plot the highest KL distances onto the structure in pymol, colour coded to show the range of KL. 

#### PART 1

Make separate script

In [None]:
# coding: utf-8
import scipy as sp
#from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import mdtraj as md
import sys
from sys import argv
import math
import os
get_ipython().magic(u'pylab inline')

In [None]:
input_system = 0

In [None]:
# system_list with folder names of different systems. 
system_list = ["0_system_A","1_system_B"]
# Input one or more trajectory names into filname list. 
filename_list = ["short_traj_aligned.dcd"]

topology_filename = "first_frame.pdb"

md_data = ["0_TRAJECTORIES"]

filename_list_1_traj = []
filename_list_1_pdb = []

# Make a list with all file locations of trajectory data
all_files_list = []

for i in range(0,len(system_list)):
    for j in range(0,len(filename_list)):
        filename_traj = "%s/%s/%s" % (md_data[0],system_list[i],filename_list[j])
        filename_list_1_traj.append(filename_traj)
        filename_pdb = "%s/%s/%s" % (md_data[0],system_list[i],topology_filename)
        filename_list_1_pdb.append(filename_pdb)


In [None]:
# Make a list of lists to separate file locations for each simulation.
input_files = []
for i in range(0,len(system_list)):
    inside_list = []
    for j in range(0,len(filename_list)):
        filenames = "%s/%s/%s" % (md_data[0],system_list[i],filename_list[j])
        inside_list.append(filenames)
    input_files.append(inside_list)
print (input_files)

In [None]:
for i in system_list:
    if not os.path.exists("4_CA_DISTANCES/%s/OUTPUT/CA_dist" % i):
        filename = "4_CA_DISTANCES/%s/OUTPUT/CA_dist" % i
        cmd = "mkdir -p %s" % filename
        os.system(cmd)
        
for i in system_list:
    if not os.path.exists("4_CA_DISTANCES/%s/OUTPUT/CA_raw_data" % i):
        filename = "4_CA_DISTANCES/%s/OUTPUT/CA_raw_data" % i
        cmd = "mkdir -p %s" % filename
        os.system(cmd)

In [None]:
print input_files[input_system]

In [None]:
print filename_list_1_traj 
print filename_list_1_pdb 

In [None]:
outfile = "4_CA_DISTANCES/%s/OUTPUT" % system_list[input_system]
traj_input = filename_list_1_traj[input_system]
pdb_input = filename_list_1_pdb[input_system]
print outfile
print traj_input
print pdb_input

In [None]:
test = md.load_pdb(pdb_input)
top = test.topology

In [None]:
print top

In [None]:
traj = md.load_dcd(traj_input,top=pdb_input)

In [None]:
print traj

In [None]:
CA_contacts = md.compute_contacts(traj, contacts='all', scheme="ca")

contacts : array-like, ndim=2 or ‘all’

An array containing pairs of indices (0-indexed) of residues to compute the contacts between, or ‘all’. The string ‘all’ will select all pairs of residues separated by two or more residues (i.e. the i to i+1 and i to i+2 pairs will be excluded).

>To compute the contact distance between residue 0 and 10 and
>residues 0 and 11
>md.compute_contacts(t, [[0, 10], [0, 11]])

>> the itertools library can be useful to generate the arrays of indices

>> `group_1 = [0, 1, 2]`

>> `group_2 = [10, 11]`

>> `pairs = list(itertools.product(group_1, group_2))`

>> `print(pairs)`

>> `[(0, 10), (0, 11), (1, 10), (1, 11), (2, 10), (2, 11)]`

>> `md.compute_contacts(t, pairs)`



In [None]:
# CA_contacts[0] is a distance per snapshot for atom pair i 
# CA_contacts[1] is 2 atom indices involved in atom pair i

In [None]:
distance_per_snapshot = CA_contacts[0]
indices_per_snapshot = CA_contacts[1]

In [None]:
print indices_per_snapshot.T
print distance_per_snapshot
# So atom pairs e.g. 0-3 and column below is the value over all snapshots

In [None]:
pair_1_distance = CA_contacts[0][:,0]
pair_1_atoms = CA_contacts[1][0]

In [None]:
print pair_1_distance
print len(pair_1_distance)
print pair_1_atoms
print len(pair_1_atoms)

In [None]:
print outfile

In [None]:
print distance_per_snapshot.shape

In [None]:
# Output files with all atom pairs, and with all distances vs all snapshots

np.savetxt("%s/ALL_atom_pairs.dat" % outfile , CA_contacts[1], fmt='%s')
#np.savetxt("%s/ALL_distances_per_snapshot.dat" % outfile, distance_per_snapshot, fmt='%.20f')

In [None]:
min_list = []
max_list = []
for i in range(0,len(distance_per_snapshot[:][0])):
    min_list.append(min(distance_per_snapshot[:,i]))
    max_list.append(max(distance_per_snapshot[:,i]))
#print "min value: ",  min(min_list)
#print "max value: " , max(max_list)

In [None]:
min_array = np.array(min_list)
max_array = np.array(max_list)
min_max_array = np.vstack((min_list,max_list)).T

In [None]:
min_max_array_angstrom = min_max_array * 10
np.savetxt("%s/min_max_rawdata.dat" % outfile,min_max_array_angstrom)

### Bin range

Run the script up to here for each case in order to output bin ranges for each distance. 

### After running system 1 - restart the kernal and run system 2. 


For the KL calculation, we need to have the same bin range for each distance - so must determine the correct range to use for each before saving the distributions

#### PART2 

Make this a separate script


In [None]:
_0_system_A_min_max = np.loadtxt("4_CA_DISTANCES/0_system_A/OUTPUT/min_max_rawdata.dat")
_1_system_B_min_max = np.loadtxt("4_CA_DISTANCES/1_system_B/OUTPUT/min_max_rawdata.dat")

In [None]:
min_col_sys_A = _0_system_A_min_max[:,0]
max_col_sys_A = _0_system_A_min_max[:,1]

min_col_sys_B = _1_system_B_min_max[:,0]
max_col_sys_B = _1_system_B_min_max[:,1]

In [None]:
MIN_cols = np.vstack((min_col_sys_A,min_col_sys_B)).T

In [None]:
MAX_cols = np.vstack((max_col_sys_A,max_col_sys_B)).T

In [None]:
global_mins = []
for i in MIN_cols:
    if i[0] < i[1]:
        global_mins.append(i[0])
    else: 
        global_mins.append(i[1])
        
global_maxs = []
for i in MAX_cols:
    if i[0] > i[1]:
        global_maxs.append(i[0])
    else: 
        global_maxs.append(i[1])

In [None]:
#global_mins = np.array(global_mins)
#global_maxs = np.array(global_maxs)

In [None]:
global_mins_int = []
for i in range(0,len(global_mins)):
    global_mins_int.append(int(global_mins[i]))
global_mins_arr = np.array(global_mins_int).clip(min=0)
#global_mins_arr = global_mins_arr.clip(min=0)

global_maxs_int = []
for i in range(0,len(global_maxs)):
    global_maxs_int.append(int(global_maxs[i]))
global_maxs_arr = np.array(global_maxs_int).clip(min=0)
#global_maxs_arr = global_maxs_arr.clip(min=0)

In [None]:
col1 = (global_mins_arr - 3).clip(min=0)
col2 = (global_maxs_arr + 3).clip(min=0)
min_max_arr_margin_int = np.vstack((col1,col2)).T

In [None]:
np.savetxt("4_CA_DISTANCES/global_min_max_array.dat" % outfile,min_max_arr_margin_int,fmt='%d')

### make a new script with PART 1 and PART3
PART3

In [None]:
min_max_arr_margin_int  = np.loadtxt("4_CA_DISTANCES/global_min_max_array.dat")

In [None]:
MINBIN = min_max_arr_margin_int[:,0]
MAXBIN = min_max_arr_margin_int[:,1]

In [None]:
print input_system

In [None]:
for i in range(0,len(distance_per_snapshot[0,:])):
    dist_angstrom = distance_per_snapshot[:,i] * 10
    # load bin ranges from min_max file
    min_bin = MINBIN[i]
    max_bin = MAXBIN[i]
    #np.savetxt("%s/CA_raw_data/distance_%s_raw_data.dat" % (outfile,i) , distance_per_snapshot[:,i], fmt=['%.20f'])
    (n, bins) = np.histogram(dist_angstrom, bins = 100, range = (min_bin, max_bin), normed=True)
    n = n / (sum(n))
    bincentre = 0.5*(bins[1:]+bins[:-1])
    index = np.linspace(1, len(bincentre), num = len(bincentre), dtype = int)
    total_bin_addition = 0.000001
    all_bins = len(bincentre)
    non_zero = np.count_nonzero(n)
    zero_bins = all_bins - non_zero
    if zero_bins != 0:
        bin_addition = total_bin_addition/float(zero_bins)
        # Adds the bin_addition amount into all zero-count bins
        for j in xrange(len(n)):
            if n[j] == 0.0:
                n[j] = bin_addition
    data = np.vstack((index,n)).T
    np.savetxt("%s/CA_dist/distance_%s_distribution.dat" % (outfile,i) , data, fmt=['%d','%.20f'])        

-----

-----

### If the first system has now run: Restart to compute second system
----------

#### To restart the session after running 0_system_A, use the drop down menu at the top: 

  --> `Kernel` --> `Restart and clear output`

Then change `input_system = 1` and rerun all cells up to here, for the second system. 

-----

-----


### Calculating KL between two systems 

Once we have run the script above for the two different systems, we can do the KL calculation. 

In the folders `4_CA_DISTANCES/0_system_A/OUTPUT/CA_dist` and `4_CA_DISTANCES/1_system_B/OUTPUT/CA_dist`  we have a distribution for every CA distance for each case. 

We should also have files `4_CA_DISTANCES/0_system_A/OUTPUT/ALL_atom_pairs.dat` and `4_CA_DISTANCES/1_system_B/OUTPUT/ALL_atom_pairs.dat` which have the atom indices for each system's output.

Check that each output has the same atom indices:

`$ vimdiff 4_CA_DISTANCES/0_system_A/OUTPUT/ALL_atom_pairs.dat 4_CA_DISTANCES/1_system_B/OUTPUT/ALL_atom_pairs.dat`

----------

In the folder `Scripts`, there is a script `7.0_script_run_CACOOR_KL.sh`. 

Run this, with the number of distances for this system as the argument, as follows:

In [None]:
number_of_distances = distance_per_snapshot[1].shape
print number_of_distances

In [None]:
cd 4_CA_DISTANCES/

In [None]:
!bash ../Scripts/10_CA_KL.sh 0_system_A 1_system_B 39903

In [None]:
print (CA_contacts[1][:,1])

In [None]:
#col1 and col2 are the two columns of RESIDUE numbers - which make each pair
# Want to make the same array but with atom numbers of CA atoms

column1 = CA_contacts[1][:,0]
column2 = CA_contacts[1][:,1]

column_1_atom_num = []
column_2_atom_num = []

for i in column1:
    column_1_atom_num.append(int(top.select("name CA and resid %s"%i)))

for i in column2:
    column_2_atom_num.append(int(top.select("name CA and resid %s"%i)))

column_1_atom_array = np.array(column_1_atom_num)
column_2_atom_array = np.array(column_2_atom_num)

atom_number_pair_array = np.vstack((column_1_atom_array,column_2_atom_array)).T

In [None]:
print atom_number_pair_array

In [None]:
print top.atom(2)

In [None]:
print ("CA_contacts[1] is all the pairs of residue numbers: ")
print (CA_contacts[1])
print ("atom_number_pair_array is all the pairs of CA atom indices")
print (atom_number_pair_array) 

In [None]:
print outfile

In [None]:
np.savetxt("%s/atom_pairs_ATOM_NUMBERS.dat" % outfile, atom_number_pair_array, fmt='%s')

In [None]:
KL_values = np.loadtxt("4_CA_DISTANCES/KL_OUTPUT/0_system_A_1_system_B/ALL_KL_CA.dat")

In [None]:
KL_values = KL_values.reshape(len(KL_values),1)

In [None]:
atom_indices_KLs = np.concatenate((atom_number_pair_array,KL_values),axis=1)

In [None]:
print atom_indices_KLs

### Sizes of KL values

In order to set minimum limits of KL values in order to visualise, the below plot shows the KL for each distance in order to see where a lot of the values lie. 

Therefore we will use a lower cutoff od around 13 in order to only visualise the high KL distances. 

In [None]:
plt.plot(KL_values)
plt.xlabel("distance")
plt.ylabel("KL value")
print max(KL_values)
print min(KL_values)

In [None]:
np.savetxt("4_CA_DISTANCES/KL_OUTPUT/0_system_A_1_system_B/atoms_indices_KLs.dat",atom_indices_KLs, fmt='%d  %d  %0.10f')

So now have a file 4_CA_DISTANCES/KL_OUTPUT/0_system_A_1_system_B with atom indices and KL

Now just need to set up testdistance10..pml properly to load structure and input distances. 

### Visualising the output of the KL calculation 



---

#### _TODO_

_**Bin ranges** _

_Usually calculating distributions for 10's or 100's of thousands of distances with this script._

_Since KL between two sims requires the same bin range for a particular descriptor, need to ensure that for distance 1 (for example), all sims have same bin range. And same for distance 2....distance 200k._

_Since some distances large, and some small, using the same bin range for **everything** is not a great idea._

_Currently we just set a bin range, but this can be improved to check the binning for each distance._

---

#### Plan to improve binning: 

Run every system, and output: 

`distance_per_snapshot = CA_contacts[0]` is a distance per snapshot for all atom pairs 

and 

`indices_per_SS_SHORT = CA_contacts[1]` is 2 atom indices involved in atom pairs

e.g. 

`pair_1_distance = CA_contacts[0][:,0]`

`pair_1_atoms = CA_contacts[1][0]`


Then once have `distance_per_snapshot` for all simulations, can load all together to get the min/max for each distance, over all the sims run.



#### Min max

Make lists of min / max for each distance and output to the file `2_CA_DISTANCES/1_SSB/1_UCB1478733/OUTPUT/min_max_array.dat`.

After running first one - select bin ranges for each distance by taking a reasonable margin on either side. This means we can output the histograms for each (since we have to load everything to get the min max anyway)

Then once run for each system, have the min and max for each distance, and check that initially selected bin ranges are ok. 

If not, repeat histogramming. 
