[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kamerlinlab/KIF/blob/main/tutorials/network_analysis_tutorial/Step1_Tutorial_PTP1B_Network_Analysis.ipynb)

### Tutorial for Performing Network Analysis on Simulations of PTP1B. 

In this jupyter notebook we will use the network_analysis.py module on the dataset generated for the enzyme PTP1B in order to generate per residue correlation and distance matrices that can be applied to many different graph theory methods. In our manuscript, we used these matrices as inputs for weighted implementation of suboptimal paths (WISP) calculations in order to study the allosteric communication pathways present in PTP1B. The R script used to perform these analyses alongside a python script to generate a PyMOL compatible visualisation of the WISP results are also included. 

Therefore after running this notebook you can:
- Run the .R script (Step2_Run_WISP.r) to perform the WISP calulation.
- If you want to visualise the results in PyMOL instead of VMD, you can then run the python script (Step3_Gen_Pymol_Visuals.py)

(Note that the outputs of this notebook are already provided in the folder: "WISP_Inputs", so you can skip this notebook if you wish to just look at WISP. 

**Some additional Points:**
- The dataset used herein is the same as what was used in the manuscript. 
- This approach does not require labels (i.e. unsupervised datasets are fine). 
- If you do have class labels and want to study the differences between classes (for example to study alterations in allosteric networks in different conformational states), then you could first split your trajectory frames and perform the analysis for each conformation separately. Here, however we had success combining snapshots from all states of PTP1B (so both the closed and open WPD-loop conformations). Please see the manuscript for further details. 


<center><img src="https://raw.githubusercontent.com/kamerlinlab/KIF/main/tutorials/miscellaneous/ptp1b_banner.png" alt="Drawing" style="width: 70%" /></center>

### Setup

Install and load the required modules and then download the dataset we'll be working on from google drive

In [1]:
%pip install KIF

Note: you may need to restart the kernel to use updated packages.


In [2]:
from key_interactions_finder import pycontact_processing
from key_interactions_finder import data_preperation
from key_interactions_finder import network_analysis

We will first need to download the PTP1B dataset from google drive. 
The tutorial data will be saved in the relative path defined by "save_dir" in the cell block below.

You can change this as you see fit. If you want to use the current directory you can do:

save_dir=""

In [3]:
from key_interactions_finder.utils import download_prep_tutorial_dataset

drive_url = r"https://drive.google.com/file/d/1nbK3fw7z1hDXiGINZe-VT1SGdPbNhF07/view?usp=share_link"
save_dir = "tutorial_datasets"

download_prep_tutorial_dataset(drive_url=drive_url, save_dir=save_dir)

Downloading...
From: https://drive.google.com/uc?id=1nbK3fw7z1hDXiGINZe-VT1SGdPbNhF07
To: c:\Users\Rory Crean\Desktop\Github\key-interactions-finder\tutorials\network_analysis_tutorial\tutorial_datasets\tutorial_dataset.zip
100%|██████████| 158M/158M [00:05<00:00, 27.0MB/s] 


Tutorial files were successfully downloaded and unzipped.


### Step 1. Process PyContact files with the pycontact_processing.py module 

In this section we will work with the PyContact output files generated. 
Here we will merge our separate runs together and remove any false interactions that can be generated by the PyContact library. 

In [4]:
pycontact_files_horizontal = ["PyContact_Per_Frame_Interactions_Block1.csv", "PyContact_Per_Frame_Interactions_Block2.csv",
                              "PyContact_Per_Frame_Interactions_Block3.csv", "PyContact_Per_Frame_Interactions_Block4.csv",
                              "PyContact_Per_Frame_Interactions_Block5.csv", "PyContact_Per_Frame_Interactions_Block6.csv",
                              "PyContact_Per_Frame_Interactions_Block7.csv", "PyContact_Per_Frame_Interactions_Block8.csv",
                              "PyContact_Per_Frame_Interactions_Block9.csv", "PyContact_Per_Frame_Interactions_Block10.csv",
                              "PyContact_Per_Frame_Interactions_Block11.csv", "PyContact_Per_Frame_Interactions_Block12.csv",
                              "PyContact_Per_Frame_Interactions_Block13.csv", "PyContact_Per_Frame_Interactions_Block14.csv",
                              "PyContact_Per_Frame_Interactions_Block15.csv", "PyContact_Per_Frame_Interactions_Block16.csv",
                              "PyContact_Per_Frame_Interactions_Block17.csv", "PyContact_Per_Frame_Interactions_Block18.csv",
                              "PyContact_Per_Frame_Interactions_Block19.csv", "PyContact_Per_Frame_Interactions_Block20.csv",
                              ]

# Define where the downloaded tutorial files are located. 
in_dir = save_dir + r"/PTP1B_data/example_horizontal_data"

pycontact_dataset = pycontact_processing.PyContactInitializer(
    pycontact_files=pycontact_files_horizontal,
    multiple_files=True,
    merge_files_method="horizontal",  
    remove_false_interactions=True,
    in_dir=in_dir,
)

Your PyContact file(s) have been succefully processed.
You have 1886 features and 10000 observations.
The fully processed dataframe is accesible from the '.prepared_df' class attribute.


In [5]:
# As outputted above, we can inspect the newly prepared dataset by accessing the '.prepared_df' class attribute as follows:
pycontact_dataset.prepared_df

Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 240Pro Hbond sc-bb,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,1Glu 6Phe Other bb-bb,2Met 240Pro Hbond sc-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,2Met 270Tyr Other bb-sc,2Met 6Phe Hbond bb-bb,...,287Gln 296Glu Other sc-sc,286Val 293Leu Hydrophobic sc-sc,292Glu 298Leu Other sc-sc,290Trp 296Glu Other bb-sc,291Lys 298Leu Hbond sc-bb,288Asp 297Asp Other bb-sc,294Ser 298Leu Other bb-sc,288Asp 298Leu Other sc-sc,287Gln 298Leu Other bb-sc,289Gln 298Leu Other bb-sc
0,2.79062,2.22760,11.75476,3.85765,0.01413,0.91267,4.97278,0.83832,2.71548,3.32993,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.48888,0.15072,11.56569,2.01823,0.00000,1.89134,3.61045,0.08871,1.04626,3.49882,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.65394,0.11149,7.74254,3.23310,0.00000,0.29681,4.05759,0.01007,1.47758,2.89084,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.67279,0.26855,7.00332,1.61461,0.00000,3.11450,4.76641,0.15005,1.11730,2.75300,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.73688,0.23481,5.10867,1.28357,0.00000,1.68206,3.98947,1.06603,2.91757,3.42521,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,5.88591,0.01018,9.95938,3.59306,0.00000,1.78412,5.69829,1.32305,1.11285,3.48554,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,8.34367,0.01030,9.86854,2.94537,0.00000,0.16410,4.74343,2.71557,1.36727,3.43449,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,9.10518,0.02668,6.73513,1.70116,0.00000,0.09681,3.34138,1.84982,0.99641,3.39886,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,2.33875,0.00000,8.18129,2.79464,0.00000,0.64758,5.16315,2.49588,1.48416,3.21917,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 2 (Optional). Prepare the Dataset for Network Analysis with the data_preperation.py module. 

In this step, we will generate a instance of the UnsupervisedFeatureData class to prepare the dataset for the network analysis. We will use this instance to perform some (optional) filtering to limit the features we make use of in our analysis. 

If you want to skip this step, then you can simply take the dataframe produced in the prior step (pycontact_dataset.prepared_df) and use that in Step 3. I would recommend though to filter features with a low %occupancy as we will do in this step. 

In [6]:
# First we generate an instance of the UnsupervisedFeatureData class (because we have per frame class labels).
unsupervised_dataset = data_preperation.UnsupervisedFeatureData(
    input_df=pycontact_dataset.prepared_df,
)

# See attributes availale to the class
unsupervised_dataset.__dict__.keys()

dict_keys(['input_df', 'df_filtered', 'df_processed'])

As we can see from above, our instance of the UnsupervisedFeatureData class has two attributes, the 'input_df' and the 'df_filtered' (currently empty). Once we have used a filtering method in the next code block, we will have created a populated 'df_filtered', and be able to use this in step 3.

In [7]:
# In this case, I am going to remove features/contacts that are not present for at least 50% of the simulation time.
unsupervised_dataset.filter_by_occupancy(min_occupancy=50)
unsupervised_dataset.filter_by_interaction_type(
    interaction_types_included=["Hbond", "Saltbr", "Hydrophobic", "Other"])

print(f"Number of features before filtering: {len(unsupervised_dataset.input_df.columns)}")
print(f"Number of features after filtering: {len(unsupervised_dataset.df_filtered.columns)}")
unsupervised_dataset.df_filtered

Number of features before filtering: 1886
Number of features after filtering: 986


Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,2Met 240Pro Hbond sc-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,2Met 270Tyr Other bb-sc,2Met 6Phe Hbond bb-bb,2Met 273Val Hydrophobic sc-sc,2Met 274Ile Hydrophobic sc-sc,...,289Gln 292Glu Hbond bb-bb,289Gln 293Leu Hbond bb-sc,289Gln 294Ser Other bb-bb,290Trp 293Leu Hbond bb-bb,290Trp 294Ser Hbond bb-sc,291Lys 294Ser Hbond bb-sc,291Lys 296Glu Saltbr sc-sc,291Lys 295Hip Hbond bb-bb,292Glu 295Hip Hbond bb-sc,290Trp 295Hip Other bb-bb
0,2.79062,11.75476,3.85765,0.91267,4.97278,0.83832,2.71548,3.32993,1.79010,2.63887,...,5.84398,4.90948,0.03020,1.37886,4.53196,3.51456,4.05396,1.88718,9.50276,0.00000
1,1.48888,11.56569,2.01823,1.89134,3.61045,0.08871,1.04626,3.49882,0.00771,3.87662,...,3.00541,3.82055,0.02262,0.25570,4.00946,2.86784,6.15452,0.60597,7.98836,0.00000
2,2.65394,7.74254,3.23310,0.29681,4.05759,0.01007,1.47758,2.89084,0.34961,2.88760,...,5.26235,4.83590,0.02475,1.56767,5.05774,3.85535,10.35693,1.07959,8.41847,0.00746
3,0.67279,7.00332,1.61461,3.11450,4.76641,0.15005,1.11730,2.75300,0.94718,1.10542,...,4.76141,4.33542,0.00764,3.04014,5.38331,4.73111,7.15285,0.59023,4.61401,0.00000
4,0.73688,5.10867,1.28357,1.68206,3.98947,1.06603,2.91757,3.42521,0.85554,1.12289,...,4.06021,5.05878,0.01925,1.95579,5.00841,5.58901,7.39404,1.36542,6.76236,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,5.88591,9.95938,3.59306,1.78412,5.69829,1.32305,1.11285,3.48554,0.50031,1.38936,...,3.79417,6.73365,0.01304,4.02895,5.40077,5.03374,12.96654,3.44472,6.49388,0.04974
9996,8.34367,9.86854,2.94537,0.16410,4.74343,2.71557,1.36727,3.43449,1.32015,0.93137,...,4.80578,4.53435,0.01939,2.27820,5.06846,3.85514,10.05204,3.62520,8.12842,0.05530
9997,9.10518,6.73513,1.70116,0.09681,3.34138,1.84982,0.99641,3.39886,1.74845,0.78892,...,5.39940,5.16106,0.00752,2.62120,4.78300,6.13537,11.73527,3.75943,6.05206,0.02272
9998,2.33875,8.18129,2.79464,0.64758,5.16315,2.49588,1.48416,3.21917,1.07795,3.58457,...,4.33339,5.89272,0.00983,1.45542,4.51928,3.08697,8.26397,2.79550,5.44399,0.01265


### Step 3. Generate the matrices needed with the network_analysis.py module. 

In this section, we will generate both a per residue correlation matrix and a per residue distance matrix and save these to disk for later use with the R script. 

Graph theory methods applied to biomolecular simulations tend to require two pieces of data:
1. A correlation matrix for every node (residue) in the network to every other node.
2. A distance matrix (sometimes called a distance/contact map) which can be used to filter the correlation matrix. Filtering is done by setting a max distance cut-off (effectively saying whether the two nodes/residues are in contact with each other.



In [8]:
# First, lets generate an instance of the CorrelationNetwork class. 
correlation_network = network_analysis.CorrelationNetwork(
    dataset=unsupervised_dataset.df_filtered, 
)

# As before lets take a look at the class attributes
correlation_network.__dict__.keys()

dict_keys(['dataset', 'feature_corr_matrix'])

In [9]:
# As we can see from the above, we have now generated a correlation matrix for each feature. 
correlation_network.feature_corr_matrix 

Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,2Met 240Pro Hbond sc-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,2Met 270Tyr Other bb-sc,2Met 6Phe Hbond bb-bb,2Met 273Val Hydrophobic sc-sc,2Met 274Ile Hydrophobic sc-sc,...,289Gln 292Glu Hbond bb-bb,289Gln 293Leu Hbond bb-sc,289Gln 294Ser Other bb-bb,290Trp 293Leu Hbond bb-bb,290Trp 294Ser Hbond bb-sc,291Lys 294Ser Hbond bb-sc,291Lys 296Glu Saltbr sc-sc,291Lys 295Hip Hbond bb-bb,292Glu 295Hip Hbond bb-sc,290Trp 295Hip Other bb-bb
1Glu 241Ser Hbond sc-bb,1.000000,-0.133671,-0.132070,0.030746,0.045689,-0.050800,-0.019976,0.087437,0.052699,0.004847,...,-0.013222,-0.030848,-0.037698,0.037795,0.030640,0.044554,0.037501,0.097782,-0.031415,0.032442
1Glu 4Lys Hbond bb-sc,-0.133671,1.000000,-0.105491,0.058863,0.039973,0.001571,-0.054994,0.054990,-0.068267,-0.023862,...,-0.018361,-0.020636,-0.018834,0.014355,-0.007887,-0.006891,0.020959,0.019222,-0.031855,0.006527
1Glu 5Glu Hbond bb-bb,-0.132070,-0.105491,1.000000,0.015241,0.010696,0.040757,0.166263,0.006134,0.106039,0.099689,...,0.028942,0.031006,0.016102,-0.019572,-0.008937,0.002708,-0.009021,-0.040930,0.027336,-0.032085
2Met 240Pro Hbond sc-bb,0.030746,0.058863,0.015241,1.000000,-0.078592,-0.358808,-0.282994,0.042091,-0.329885,0.067559,...,-0.002822,-0.003877,-0.005728,0.010464,0.032921,0.013054,0.026064,-0.001953,0.028502,-0.001974
2Met 5Glu Hbond bb-bb,0.045689,0.039973,0.010696,-0.078592,1.000000,0.029403,0.039402,0.118284,-0.071986,-0.148912,...,-0.000744,0.005742,-0.007793,0.009019,-0.012819,-0.012495,0.010474,-0.001330,0.022297,-0.007110
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291Lys 294Ser Hbond bb-sc,0.044554,-0.006891,0.002708,0.013054,-0.012495,-0.031442,-0.030483,-0.005568,0.000765,0.028728,...,-0.017442,-0.004936,-0.105848,0.163385,0.114192,1.000000,0.063846,0.275186,-0.058191,-0.301174
291Lys 296Glu Saltbr sc-sc,0.037501,0.020959,-0.009021,0.026064,0.010474,-0.041401,-0.020969,0.026767,0.016570,-0.018065,...,-0.013034,-0.067814,-0.057211,0.069160,0.085219,0.063846,1.000000,0.346009,-0.033259,0.088169
291Lys 295Hip Hbond bb-bb,0.097782,0.019222,-0.040930,-0.001953,-0.001330,-0.049803,-0.063016,0.006742,0.034394,-0.017036,...,-0.020520,-0.103296,-0.175136,0.204561,0.119594,0.275186,0.346009,1.000000,-0.024513,0.349329
292Glu 295Hip Hbond bb-sc,-0.031415,-0.031855,0.027336,0.028502,0.022297,-0.008214,0.006927,-0.002716,-0.019070,0.012825,...,-0.068725,0.031098,0.050106,-0.026823,-0.014672,-0.058191,-0.033259,-0.024513,1.000000,-0.043682


Whilst the above correlation matrix could be used for network analysis, it may be more intuituative to represent the data at the per residue level.
With the below code block we can do exactly that. 
Note that this will both write the file to disk and return the per residue correlation matrix

In [10]:
import os
os.makedirs("WISP_Inputs", exist_ok=True) # we'll save the results here. 

per_res_corr_matrix = correlation_network.gen_res_correl_matrix(
    out_file="WISP_Inputs/per_res_matrix.csv"
)
per_res_corr_matrix

WISP_Inputs/per_res_matrix.csv saved to disk.


array([[ 1.        ,  0.16626298, -0.24062193, ...,  0.08941919,
        -0.03380876, -0.03234933],
       [ 0.16626298,  1.        , -0.27489959, ...,  0.08118594,
        -0.0466597 ,  0.02954917],
       [-0.24062193, -0.27489959,  1.        , ..., -0.09910845,
         0.06740939,  0.03839995],
       ...,
       [ 0.08941919,  0.08118594, -0.09910845, ...,  1.        ,
         0.38142254,  0.31687698],
       [-0.03380876, -0.0466597 ,  0.06740939, ...,  0.38142254,
         1.        ,  0.22214761],
       [-0.03234933,  0.02954917,  0.03839995, ...,  0.31687698,
         0.22214761,  1.        ]])

Alongside a correlation matrix we will also want to generate some form of contact map to define which nodes (residues) are in close proximity to one another. 

Our program provides you with two options of how to generate this matrix:

1. Using the method "gen_res_contact_matrix". Here, we define residues as in contact with one another if they share an interaction in the dataframe. (Essentially using the column names to identify pairs of residues in contact). 
 

2. Using the method "heavy_atom_contact_map_from_pdb". Here, we calculate the minimum heavy atom distance between each residue pair and if the minimum distance is below the defined distance cut-off (d_cut), the two atoms are considered in contact. 
    
    * Note that if you have multiple PDB files, you can use the method "heavy_atom_contact_map_from_multiple_pdbs" instead. Here, if in any of the frames provided, the two residues are below the contact distance cut-off (d_cut), they will be considered in contact. 

Option 2 is the more standard method to determine close contacts and is what we will use in this tutorial as well. 

In [11]:
# Option 1:
# per_res_contact_matrix = correlation_network.gen_res_contact_matrix(
#     out_file=r"WISP_Inputs/per_res_contact_matrix.csv"
# )

In [12]:
# Option 2, single PDB file example.  
# heavy_atom_contact_map = network_analysis.heavy_atom_contact_map_from_pdb(
#     pdb_file=save_dir + r"/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Closed.pdb",
#     first_res=1,
#     last_res=298,
#     d_cut=5,
#     out_file=r"WISP_Inputs/heavy_atom_contact_map.csv"
# )

# Option 2, multiple PDB files  
heavy_atom_contact_map = network_analysis.heavy_atom_contact_map_from_multiple_pdbs(
    pdb_files=[save_dir + r"/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Closed.pdb",
               save_dir + r"/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Open.pdb"],
    first_res=1,
    last_res=298,
    d_cut=6,
    out_file=r"WISP_Inputs/heavy_atom_contact_map_MultiPDB.csv"
)

# We can safely ignore the warning from MDAnalysis as this is not needed for our calculation. 



WISP_Inputs/heavy_atom_contact_map_MultiPDB.csv saved to disk.


### Step 4. Run the Network Analysis of your Choice. 

There are many possible way to analyse these networks generated and several packages developed specifically towards this. For this reason, we choose not to build this into our program. Instead, an example is provided of how the datasets generated here can be used with Bio3D to perform WISP calculations. 