### Tutorial for Performing Network Analysis on Simulations of PTP1B. 

In this jupyter notebook we will use the network_analysis.py module on the PyContact dataset generated for PTP1B in order to generate per residue correlation and distance matrices that can be applied to many different graph theory methods. In our manuscript, we used these matrices as inputs for weighted implementation of suboptimal paths (WISP) calculations in order to study the allosteric communication pathways present in PTP1B. The R script used to perform these analyses alongside a python script to generate a PyMOL compatible visualisation of the WISP results are also included. 

Therefore after running this notebook you can:
- Run the .R script (Step2_Run_WISP.r) to perform the WISP calulation.
- If you want to visualise the results in PyMOL instead of VMD, you can then run the python script (Step3_Gen_Pymol_Visuals.py)

(Note that the WISP inputs are outputs are provided so you could thereoretically skip this notebook if desired)


**Some additional Points:**
- The dataset used herein is the same PTP1B as what was used in the manuscript. 
- This approach does not require class labels (i.e. unsupervised datasets are fine). 
- If you do have class labels and want to study the differences between classes (for example to study alterations in allosteric networks in different conformational states), then you could first split your trajectory frames and perform the analysis for each conformation separately. Here, however we had success combining snapshots from all states of PTP1B (so both the closed and open WPD-loop conformations). Please see the manuscript for further details. 


<center><img src="https://raw.githubusercontent.com/kamerlinlab/key-interactions-finder/main/tutorials/miscellaneous/ptp1b_banner.png?token=GHSAT0AAAAAABU4IYQQAORIMUF26F3CWUZYYZNYZOA" alt="Drawing" style="width: 70%" /></center>

### Setup

Load modules, download the dataset 

In [1]:
import pandas as pd
import numpy as np

import sys # note temporary... 
sys.path.append("..") # note temporary... 

from key_interactions_finder import pycontact_processing
from key_interactions_finder import data_preperation
from key_interactions_finder import network_analysis

In [None]:
# TODO Update with new way to get the dataset. 

### Step 1. Process PyContact files with the pycontact_processing.py module 

In this section we will work with the PyContact output files generated. 
Here we will merge our separate runs together and remove any false interactions that can be generated by the PyContact library. 

In [2]:
pycontact_files_horizontal = ["PyContact_Per_Frame_Interactions_Block1.csv", "PyContact_Per_Frame_Interactions_Block2.csv",
                              "PyContact_Per_Frame_Interactions_Block3.csv", "PyContact_Per_Frame_Interactions_Block4.csv",
                              "PyContact_Per_Frame_Interactions_Block5.csv", "PyContact_Per_Frame_Interactions_Block6.csv",
                              "PyContact_Per_Frame_Interactions_Block7.csv", "PyContact_Per_Frame_Interactions_Block8.csv",
                              "PyContact_Per_Frame_Interactions_Block9.csv", "PyContact_Per_Frame_Interactions_Block10.csv",
                              "PyContact_Per_Frame_Interactions_Block11.csv", "PyContact_Per_Frame_Interactions_Block12.csv",
                              "PyContact_Per_Frame_Interactions_Block13.csv", "PyContact_Per_Frame_Interactions_Block14.csv",
                              "PyContact_Per_Frame_Interactions_Block15.csv", "PyContact_Per_Frame_Interactions_Block16.csv",
                              "PyContact_Per_Frame_Interactions_Block17.csv", "PyContact_Per_Frame_Interactions_Block18.csv",
                              "PyContact_Per_Frame_Interactions_Block19.csv", "PyContact_Per_Frame_Interactions_Block20.csv",
                              ]

pycontact_dataset = pycontact_processing.PyContactInitializer(
    pycontact_files=pycontact_files_horizontal,
    multiple_files=True,
    merge_files_method="horizontal",  
    remove_false_interactions=True,
    in_dir="datasets/PTP1B_Data/example_horizontal_data",
)

Your PyContact file(s) have been succefully processed.
You have 2790 features and 10000 observations.
The fully processed dataframe is accesible from the '.prepared_df' class attribute.


In [3]:
# As outputted above, we can inspect the newly prepared dataset by accessing the '.prepared_df' class attribute as follows:
pycontact_dataset.prepared_df

Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 240Pro Hbond sc-bb,1Glu 3Glu Hbond bb-sc,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,1Glu 6Phe Other bb-bb,2Met 240Pro Hbond sc-bb,2Met 4Lys Hbond bb-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,...,290Trp 186Ser Other bb-sc,288Asp 298Leu Other sc-sc,298Leu 196Lys Hbond bb-sc,297Asp 152Tyr Other bb-sc,297Asp 151Tyr Other sc-bb,287Gln 298Leu Other bb-sc,298Leu 287Gln Other sc-bb,289Gln 298Leu Other bb-sc,298Leu 289Gln Other sc-bb,290Trp 284Ser Other sc-bb
0,2.79062,2.22760,6.19015,11.75476,3.85765,0.01413,0.91267,2.30422,4.97278,0.83832,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.48888,0.15072,3.75961,11.56569,2.01823,0.00000,1.89134,1.81649,3.61045,0.08871,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.65394,0.11149,5.94822,7.74254,3.23310,0.00000,0.29681,2.29761,4.05759,0.01007,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.67279,0.26855,3.82198,7.00332,1.61461,0.00000,3.11450,3.07824,4.76641,0.15005,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.73688,0.23481,3.52825,5.10867,1.28357,0.00000,1.68206,2.29991,3.98947,1.06603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,5.88591,0.01018,1.72647,9.95938,3.59306,0.00000,1.78412,3.08623,5.69829,1.32305,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,8.34367,0.01030,2.03138,9.86854,2.94537,0.00000,0.16410,2.92621,4.74343,2.71557,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,9.10518,0.02668,1.27821,6.73513,1.70116,0.00000,0.09681,1.98417,3.34138,1.84982,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,2.33875,0.00000,4.65840,8.18129,2.79464,0.00000,0.64758,3.07499,5.16315,2.49588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 2 (Optional). Prepare the Dataset for Network Analysis with the data_preperation.py module. 

In this step, we will generate a instance of the UnsupervisedFeatureData class to prepare the dataset for the network analysis. We will use this instance to perform some (optional) filtering to limit the features we make use of in our analysis. 

If you want to skip this step, then you can simply take the dataframe produced in the prior step (pycontact_dataset.prepared_df) and use that in Step 3. I would recommend though to filter features with a low %occupancy as we will do in this step. 

In [4]:
# First we generate an instance of the UnsupervisedFeatureData class (because we have per frame class labels).
unsupervised_dataset = data_preperation.UnsupervisedFeatureData(
    input_df=pycontact_dataset.prepared_df,
)

# See attributes availale to the class
unsupervised_dataset.__dict__.keys()

dict_keys(['input_df', 'df_filtered', 'df_processed'])

As we can see from above, our instance of the UnsupervisedFeatureData class has two attributes, the 'input_df' and the 'df_filtered' (currently empty). Once we have used a filtering method in the next code block, we will have created a populated 'df_filtered', and be able to use this in step 3.

In [5]:
# In this case, I am going to remove features/contacts that are not present for at least 50% of the simulation time.
unsupervised_dataset.filter_by_occupancy(min_occupancy=50)
unsupervised_dataset.filter_by_interaction_type(
    interaction_types_included=["Hbond", "Saltbr", "Hydrophobic", "Other"])

print(f"Number of features before filtering: {len(unsupervised_dataset.input_df.columns)}")
print(f"Number of features after filtering: {len(unsupervised_dataset.df_filtered.columns)}")
unsupervised_dataset.df_filtered

Number of features before filtering: 2790
Number of features after filtering: 1508


Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 3Glu Hbond bb-sc,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,2Met 240Pro Hbond sc-bb,2Met 4Lys Hbond bb-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,2Met 270Tyr Other bb-sc,2Met 6Phe Hbond bb-bb,...,296Glu 149Lys Hbond bb-sc,296Glu 298Leu Hbond bb-bb,297Asp 150Ser Hbond bb-sc,298Leu 150Ser Hbond sc-bb,298Leu 149Lys Hbond sc-bb,287Gln 196Lys Hbond sc-bb,288Asp 284Ser Hbond sc-bb,290Trp 295Hip Other bb-bb,295Hip 151Tyr Other bb-sc,298Leu 148Ile Other sc-bb
0,2.79062,6.19015,11.75476,3.85765,0.91267,2.30422,4.97278,0.83832,2.71548,3.32993,...,4.41611,1.09097,0.25778,2.49891,0.02171,0.00000,0.00000,0.00000,0.00000,0.00000
1,1.48888,3.75961,11.56569,2.01823,1.89134,1.81649,3.61045,0.08871,1.04626,3.49882,...,5.23246,0.70584,0.00000,1.30223,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
2,2.65394,5.94822,7.74254,3.23310,0.29681,2.29761,4.05759,0.01007,1.47758,2.89084,...,4.10256,1.07889,0.00000,0.00000,0.00000,0.86248,0.02063,0.00746,0.00000,0.00000
3,0.67279,3.82198,7.00332,1.61461,3.11450,3.07824,4.76641,0.15005,1.11730,2.75300,...,5.27269,0.07609,0.03441,0.00000,0.00000,0.10617,0.02888,0.00000,0.02310,0.00000
4,0.73688,3.52825,5.10867,1.28357,1.68206,2.29991,3.98947,1.06603,2.91757,3.42521,...,4.10210,0.59820,0.00921,0.00000,0.00000,0.00000,0.86343,0.00000,0.00000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,5.88591,1.72647,9.95938,3.59306,1.78412,3.08623,5.69829,1.32305,1.11285,3.48554,...,7.24663,0.81696,0.23105,0.50044,0.03031,0.49957,0.01327,0.04974,0.20528,0.00000
9996,8.34367,2.03138,9.86854,2.94537,0.16410,2.92621,4.74343,2.71557,1.36727,3.43449,...,3.30450,1.50024,1.56385,1.70037,4.28076,0.02343,0.01558,0.05530,0.56723,1.77844
9997,9.10518,1.27821,6.73513,1.70116,0.09681,1.98417,3.34138,1.84982,0.99641,3.39886,...,7.82822,1.30594,0.54968,2.62928,0.00000,0.29803,0.00954,0.02272,0.86606,0.00000
9998,2.33875,4.65840,8.18129,2.79464,0.64758,3.07499,5.16315,2.49588,1.48416,3.21917,...,8.61171,0.91257,0.98280,3.68056,0.00000,0.32268,0.00000,0.01265,0.36237,0.00000


### Step 3. Generate the matrices needed with the network_analysis.py module. 

In this section, we will generate both a per residue correlation matrix and a per residue distance matrix and save these to disk for later use with the R script. 

Graph theory methods applied to biomolecular simulations tend to require two pieces of data:
1. A correlation matrix for every node (residue) in the network to every other node.
2. A distance matrix (sometimes called a distance/contact map) which can be used to filter the correlation matrix. Filtering is done by setting a max distance cut-off (effectively saying whether the two nodes/residues are in contact with each other.



In [6]:
# First, lets generate an instance of the CorrelationNetwork class. 
correlation_network = network_analysis.CorrelationNetwork(
    dataset=unsupervised_dataset.df_filtered, 
)

# As before lets take a look at the class attributes
correlation_network.__dict__.keys()

dict_keys(['dataset', 'out_dir', 'feature_corr_matrix'])

In [7]:
# As we can see from the above, we have now generated a correlation matrix for each feature. 
correlation_network.feature_corr_matrix 

Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 3Glu Hbond bb-sc,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,2Met 240Pro Hbond sc-bb,2Met 4Lys Hbond bb-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,2Met 270Tyr Other bb-sc,2Met 6Phe Hbond bb-bb,...,296Glu 149Lys Hbond bb-sc,296Glu 298Leu Hbond bb-bb,297Asp 150Ser Hbond bb-sc,298Leu 150Ser Hbond sc-bb,298Leu 149Lys Hbond sc-bb,287Gln 196Lys Hbond sc-bb,288Asp 284Ser Hbond sc-bb,290Trp 295Hip Other bb-bb,295Hip 151Tyr Other bb-sc,298Leu 148Ile Other sc-bb
1Glu 241Ser Hbond sc-bb,1.000000,-0.199592,-0.133671,-0.132070,0.030746,-0.083083,0.045689,-0.050800,-0.019976,0.087437,...,-0.012931,-0.025843,0.026255,-0.032349,0.015572,0.053772,0.001384,0.032442,0.019854,-0.017273
1Glu 3Glu Hbond bb-sc,-0.199592,1.000000,-0.242108,0.183317,-0.210722,0.134615,-0.072549,0.135973,0.259842,-0.143439,...,0.002980,0.091657,-0.102785,0.006615,-0.090970,-0.081971,0.030043,-0.054030,-0.072974,-0.081103
1Glu 4Lys Hbond bb-sc,-0.133671,-0.242108,1.000000,-0.105491,0.058863,0.037338,0.039973,0.001571,-0.054994,0.054990,...,-0.026104,-0.035720,0.021652,0.009455,0.019120,0.011474,-0.002076,0.006527,0.006951,0.027677
1Glu 5Glu Hbond bb-bb,-0.132070,0.183317,-0.105491,1.000000,0.015241,-0.023047,0.010696,0.040757,0.166263,0.006134,...,-0.004709,0.028478,-0.033809,-0.014215,-0.026213,-0.029664,-0.024244,-0.032085,-0.012739,-0.016702
2Met 240Pro Hbond sc-bb,0.030746,-0.210722,0.058863,0.015241,1.000000,-0.162844,-0.078592,-0.358808,-0.282994,0.042091,...,0.000048,-0.005616,0.023345,0.012441,0.015343,-0.020956,-0.039566,-0.001974,-0.022163,0.026518
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287Gln 196Lys Hbond sc-bb,0.053772,-0.081971,0.011474,-0.029664,-0.020956,0.012538,0.002460,-0.008610,-0.036431,-0.009250,...,0.039076,-0.072798,0.041947,0.010746,0.044010,1.000000,-0.013302,0.080143,0.117952,0.047693
288Asp 284Ser Hbond sc-bb,0.001384,0.030043,-0.002076,-0.024244,-0.039566,0.043549,-0.005165,0.061010,0.014441,-0.040096,...,-0.078648,0.047412,0.000770,-0.025527,-0.021755,-0.013302,1.000000,0.014960,-0.019124,-0.030758
290Trp 295Hip Other bb-bb,0.032442,-0.054030,0.006527,-0.032085,-0.001974,0.000422,-0.007110,-0.010157,-0.007125,0.007823,...,0.012872,-0.047009,0.128804,0.043804,0.065238,0.080143,0.014960,1.000000,0.210723,0.056439
295Hip 151Tyr Other bb-sc,0.019854,-0.072974,0.006951,-0.012739,-0.022163,0.016174,0.005488,0.007680,-0.007800,0.012851,...,0.040392,-0.134748,0.260321,0.091806,0.017838,0.117952,-0.019124,0.210723,1.000000,0.038252


Whilst the above correlation matrix could be used for network analysis, it may be more intuituative to represent the data at the per residue level.
With the below code block we can do exactly that. 
Note that this will both write the file to disk and return the per residue correlation matrix

In [8]:
per_res_corr_matrix = correlation_network.gen_res_correl_matrix(
    out_file="outputs/PTP1B_network_analysis/per_res_matrix.csv"
)
per_res_corr_matrix

outputs/PTP1B_network_analysis/per_res_matrix.csv saved to disk.


array([[ 1.        ,  0.29242761, -0.3325355 , ...,  0.16970353,
        -0.10278502,  0.0916571 ],
       [ 0.29242761,  1.        , -0.44949612, ...,  0.08118594,
         0.05962114, -0.04268972],
       [-0.3325355 , -0.44949612,  1.        , ...,  0.16970353,
        -0.10278502,  0.0916571 ],
       ...,
       [ 0.16970353,  0.08118594,  0.16970353, ...,  1.        ,
         0.38142254,  0.31687698],
       [-0.10278502,  0.05962114, -0.10278502, ...,  0.38142254,
         1.        , -0.25391092],
       [ 0.0916571 , -0.04268972,  0.0916571 , ...,  0.31687698,
        -0.25391092,  1.        ]])

Alongside a correlation matrix we will also want to generate some form of contact map to define which nodes (residues) are in close proximity to one another. 

Our program provides you with two options of how to generate this matrix:

1. Using the method "gen_res_contact_matrix". Here, we define residues as in contact with one another if they share an interaction in the dataframe. (Essentially using the column names to identify pairs of residues in contact). 
 

2. Using the method "heavy_atom_contact_map_from_pdb". Here, we calculate the minimum heavy atom distance between each residue pair and if the minimum distance is below the defined distance cut-off (d_cut), the two atoms are considered in contact. 
    
    * Note that if you have multiple PDB files, you can use the method "heavy_atom_contact_map_from_multiple_pdbs" instead. Here, if in any of the frames provided, the two residues are below the contact distance cut-off (d_cut), they will be considered in contact. 

Option 2 is the more standard method to determine close contacts and is what we will use in this tutorial as well. 

In [9]:
# Option 1:
# per_res_contact_matrix = correlation_network.gen_res_contact_matrix(
#     out_file="outputs/PTP1B_network_analysis/per_res_contact_matrix.csv"
# )

In [10]:
# Option 2, single PDB file example.  
# heavy_atom_contact_map = network_analysis.heavy_atom_contact_map_from_pdb(
#     pdb_file="datasets/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Closed.pdb",
#     first_res=1,
#     last_res=298,
#     d_cut=5,
#     out_file="outputs/PTP1B_network_analysis/heavy_atom_contact_map.csv"
# )

# Option 2, multiple PDB files  
network_analysis.heavy_atom_contact_map_from_multiple_pdbs(
    pdb_files=["datasets/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Closed.pdb",
               "datasets/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Open.pdb"],
    first_res=1,
    last_res=298,
    d_cut=6,
    out_file="outputs/PTP1B_network_analysis/heavy_atom_contact_map_MultiPDB.csv"
)

# We can safely ignore the warning from MDAnalysis as this is not needed for our calculation. 



outputs/PTP1B_network_analysis/heavy_atom_contact_map_MultiPDB.csv saved to disk.


array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1]])

### Step 4. Run the Network Analysis of your Choice. 

There are many possible way to analyse these networks generated and several packages developed specifically towards this. For this reason, we choose not to build this into our program. Instead, an example is provided of how the datasets generated here can be used with Bio3D to perform WISP calculations. 