[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kamerlinlab/KIF/blob/main/tutorials/network_analysis_tutorial/Step1_Tutorial_PTP1B_Network_Analysis.ipynb)

### Tutorial for Performing Network Analysis on Simulations of PTP1B. 

In this jupyter notebook we will use the network_analysis.py module on the dataset generated for the enzyme PTP1B in order to generate per residue correlation and distance matrices that can be applied to many different graph theory methods. In our manuscript, we used these matrices as inputs for weighted implementation of suboptimal paths (WISP) calculations in order to study the allosteric communication pathways present in PTP1B. The R script used to perform these analyses alongside a python script to generate a PyMOL compatible visualisation of the WISP results are also included. 

Therefore after running this notebook you can:
- Run the .R script (Step2_Run_WISP.r) to perform the WISP calulation.
- If you want to visualise the results in PyMOL instead of VMD, you can then run the python script (Step3_Gen_Pymol_Visuals.py)

(Note that the outputs of this notebook are already provided in the folder: "WISP_Inputs", so you can skip this notebook if you wish to just look at WISP. 

**Some additional Points:**
- The dataset used herein is the same as what was used in the manuscript. 
- This approach does not require labels (i.e. unsupervised datasets are fine). 
- If you do have class labels and want to study the differences between classes (for example to study alterations in allosteric networks in different conformational states), then you could first split your trajectory frames and perform the analysis for each conformation separately. Here, however we had success combining snapshots from all states of PTP1B (so both the closed and open WPD-loop conformations). Please see the manuscript for further details. 


<center><img src="https://raw.githubusercontent.com/kamerlinlab/KIF/main/tutorials/miscellaneous/ptp1b_banner.png" alt="Drawing" style="width: 70%" /></center>

### Setup

Install and load the required modules and then download the dataset we'll be working on from google drive

In [1]:
%pip install KIF

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from key_interactions_finder import data_preperation
from key_interactions_finder import network_analysis

We will first need to download the PTP1B dataset from google drive. 
The tutorial data will be saved in the relative path defined by "save_dir" in the cell block below.

You can change this as you see fit. If you want to use the current directory you can do:

save_dir=""

In [3]:
from key_interactions_finder.utils import download_prep_tutorial_dataset

drive_url = r"https://drive.google.com/file/d/1hJbwCCuTTgI4xglwu1vXyzo-yaZJbmUY/view?usp=share_link"
save_dir = "tutorial_datasets/"

download_prep_tutorial_dataset(drive_url=drive_url, save_dir=save_dir)

Downloading...
From: https://drive.google.com/uc?id=1hJbwCCuTTgI4xglwu1vXyzo-yaZJbmUY
To: c:\Users\Rory Crean\Desktop\Github\key-interactions-finder\tutorials\network_analysis_tutorial\tutorial_datasets\tutorial_dataset.zip
100%|██████████| 27.5M/27.5M [00:04<00:00, 6.25MB/s]


Tutorial files were successfully downloaded and unzipped.


At this point, we'll define the location of our downloaded input files and where we would like to save our output files to throughout this tutorial.

In [4]:
# Where all input data is stored. 
in_dir = save_dir + r"PTP1B_Tutorial/Input_data/"

### Preperation Step 1: Load the non-covalent interaction datasets

The contact identification calculation was split into 4 blocks of different residues ranges. We will first need to load these blocks in and merge them. Luckly this is very easy with pandas. 

Note this data was generated using the script: "identify_contacts.py" which is provided with KIF.

In [5]:
input_files = ["PTP1B_block1.csv", "PTP1B_block2.csv", "PTP1B_block3.csv", "PTP1B_block4.csv"]
dfs = []
for file_name in input_files:
    file_path = in_dir + file_name
    df = pd.read_csv(file_path)
    dfs.append(df)

all_contacts_df = pd.concat(dfs, join='outer', axis=1)
all_contacts_df.head(3)

Unnamed: 0,1Glu 4Lys Saltbr,1Glu 5Glu Other,1Glu 240Pro Other,1Glu 241Ser Hbond,1Glu 243Val Other,1Glu 244Asp Other,2Met 5Glu Other,2Met 6Phe Other,2Met 234Met Other,2Met 240Pro Other,...,289Gln 292Glu Hbond,289Gln 293Leu Other,290Trp 293Leu Other,290Trp 294Ser Hbond,291Lys 294Ser Hbond,291Lys 295Hip Other,291Lys 296Glu Saltbr,291Lys 297Asp Saltbr,292Glu 295Hip Hbond,295Hip 298Leu Hbond
0,11.7873,3.8771,2.2511,2.8257,0.0003,0.0,4.9883,3.331,0.5597,0.9231,...,5.8577,4.9207,1.3966,4.5413,3.548,1.8982,4.081,0.0004,9.5329,0.0021
1,11.5986,2.0279,0.1625,1.521,0.0036,0.0,3.6268,3.5072,0.0064,1.9025,...,3.0508,3.8434,0.267,4.0187,2.886,0.6077,6.1905,0.0027,7.9961,0.1484
2,7.7794,3.2396,0.1173,2.6724,0.0025,0.0,4.07,2.9013,0.1024,0.3117,...,5.2784,4.8548,1.5899,5.0631,3.8698,1.0813,10.3764,0.0021,8.4305,0.0001


We can see we now have a dataframe with all the contacts found (989) identified and of length 10000, with matches with the number of frames in the trajectory. 

In [6]:
all_contacts_df.shape

(10000, 989)

### Step 2 (Optional). Prepare the Dataset for Network Analysis with the data_preperation.py module. 

In this step, we will generate a instance of the UnsupervisedFeatureData class to prepare the dataset for the network analysis. We will use this instance to perform some (optional) filtering to limit the features we make use of in our analysis. 

If you want to skip this step, then you can simply take the dataframe produced in the prior step "all_contacts_df" and use that in Step 3. I would recommend though to filter features with a low %occupancy as we will do in this step. 

In [7]:
# First we generate an instance of the UnsupervisedFeatureData class (because we have per frame class labels).
unsupervised_dataset = data_preperation.UnsupervisedFeatureData(
    input_df=all_contacts_df,
)

# See attributes availale to the class
unsupervised_dataset.__dict__.keys()

dict_keys(['input_df', 'df_filtered', 'df_processed'])

As we can see from above, our instance of the UnsupervisedFeatureData class has two attributes, the 'input_df' and the 'df_filtered' (currently empty). Once we have used a filtering method in the next code block, we will have created a populated 'df_filtered', and be able to use this in step 3.

In [8]:
# In this case, I am going to remove features/contacts that are not present for at least 50% of the simulation time.
unsupervised_dataset.filter_by_occupancy(min_occupancy=50)
unsupervised_dataset.filter_by_interaction_type(
    interaction_types_included=["Hbond", "Saltbr", "Hydrophobic", "Other"])

print(f"Number of features before filtering: {len(unsupervised_dataset.input_df.columns)}")
print(f"Number of features after filtering: {len(unsupervised_dataset.df_filtered.columns)}")
unsupervised_dataset.df_filtered

Number of features before filtering: 989
Number of features after filtering: 973


Unnamed: 0,1Glu 4Lys Saltbr,1Glu 5Glu Other,1Glu 240Pro Other,1Glu 241Ser Hbond,1Glu 243Val Other,2Met 5Glu Other,2Met 6Phe Other,2Met 234Met Other,2Met 240Pro Other,2Met 243Val Other,...,289Gln 292Glu Hbond,289Gln 293Leu Other,290Trp 293Leu Other,290Trp 294Ser Hbond,291Lys 294Ser Hbond,291Lys 295Hip Other,291Lys 296Glu Saltbr,291Lys 297Asp Saltbr,292Glu 295Hip Hbond,295Hip 298Leu Hbond
0,11.7873,3.8771,2.2511,2.8257,0.0003,4.9883,3.3310,0.5597,0.9231,0.0020,...,5.8577,4.9207,1.3966,4.5413,3.5480,1.8982,4.0810,0.0004,9.5329,0.0021
1,11.5986,2.0279,0.1625,1.5210,0.0036,3.6268,3.5072,0.0064,1.9025,0.0282,...,3.0508,3.8434,0.2670,4.0187,2.8860,0.6077,6.1905,0.0027,7.9961,0.1484
2,7.7794,3.2396,0.1173,2.6724,0.0025,4.0700,2.9013,0.1024,0.3117,0.0070,...,5.2784,4.8548,1.5899,5.0631,3.8698,1.0813,10.3764,0.0021,8.4305,0.0001
3,7.0477,1.6177,0.2778,0.6766,0.0000,4.7946,2.7631,1.2100,3.1232,0.5819,...,4.7681,4.3495,3.0597,5.3966,4.7426,0.5909,7.1754,0.0120,4.6376,0.0347
4,5.1379,1.2913,0.2407,0.7466,0.0003,4.0074,3.4370,1.2473,1.6855,0.5643,...,4.0849,5.0764,1.9743,5.0200,5.5969,1.3683,7.4113,0.0044,6.7782,0.0006
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9.9967,3.5990,0.0120,5.9129,0.8203,5.7118,3.4871,0.7801,1.7868,3.7542,...,3.7995,6.7416,4.0403,5.4065,5.0490,3.4596,12.9961,0.0025,6.5109,0.0174
9996,9.9186,2.9649,0.0148,8.3658,2.6657,4.7486,3.4420,0.0823,0.1708,3.7551,...,4.8272,4.5485,2.2995,5.0739,3.8750,3.6292,10.0779,0.0040,8.1487,0.0010
9997,6.7540,1.7064,0.0295,9.1168,1.9575,3.3653,3.4084,0.0144,0.1013,3.0831,...,5.4347,5.1678,2.6428,4.7954,6.1449,3.7615,11.7561,0.0006,6.0633,0.0025
9998,8.2204,2.8096,0.0072,2.3583,0.0069,5.1797,3.2275,0.7801,0.6542,3.7186,...,4.3518,5.9081,1.4750,4.5312,3.1011,2.8072,8.3094,0.0016,5.4612,0.0588


### Step 3. Generate the matrices needed with the network_analysis.py module. 

In this section, we will generate both a per residue correlation matrix and a per residue distance matrix and save these to disk for later use with the R script. 

Graph theory methods applied to biomolecular simulations tend to require two pieces of data:
1. A correlation matrix for every node (residue) in the network to every other node.
2. A distance matrix (sometimes called a distance/contact map) which can be used to filter the correlation matrix. Filtering is done by setting a max distance cut-off (effectively saying whether the two nodes/residues are in contact with each other.



In [9]:
# First, lets generate an instance of the CorrelationNetwork class. 
correlation_network = network_analysis.CorrelationNetwork(
    dataset=unsupervised_dataset.df_filtered, 
)

# As before lets take a look at the class attributes
correlation_network.__dict__.keys()

dict_keys(['dataset', 'feature_corr_matrix'])

In [10]:
# As we can see from the above, we have now generated a correlation matrix for each feature. 
correlation_network.feature_corr_matrix 

Unnamed: 0,1Glu 4Lys Saltbr,1Glu 5Glu Other,1Glu 240Pro Other,1Glu 241Ser Hbond,1Glu 243Val Other,2Met 5Glu Other,2Met 6Phe Other,2Met 234Met Other,2Met 240Pro Other,2Met 243Val Other,...,289Gln 292Glu Hbond,289Gln 293Leu Other,290Trp 293Leu Other,290Trp 294Ser Hbond,291Lys 294Ser Hbond,291Lys 295Hip Other,291Lys 296Glu Saltbr,291Lys 297Asp Saltbr,292Glu 295Hip Hbond,295Hip 298Leu Hbond
1Glu 4Lys Saltbr,1.000000,-0.106718,0.059235,-0.132924,-0.240734,0.039716,0.055230,-0.008314,0.059515,-0.124814,...,-0.018356,-0.020632,0.014332,-0.007949,-0.006919,0.019186,0.020903,-0.018033,-0.031990,-0.076899
1Glu 5Glu Other,-0.106718,1.000000,-0.092312,-0.133577,0.273373,0.010701,0.005492,0.086371,0.014361,0.133159,...,0.028922,0.031137,-0.019652,-0.009060,0.002746,-0.041306,-0.009276,0.049259,0.027462,0.098379
1Glu 240Pro Other,0.059235,-0.092312,1.000000,0.061964,-0.084220,0.048308,0.007319,0.062003,0.050415,-0.117449,...,-0.020422,-0.028110,0.023948,0.008424,0.019399,0.043583,0.024442,-0.003348,0.023709,0.007596
1Glu 241Ser Hbond,-0.132924,-0.133577,0.061964,1.000000,-0.125245,0.045482,0.087990,-0.067588,0.031637,-0.041886,...,-0.013134,-0.030851,0.037641,0.030728,0.044500,0.098008,0.037536,-0.075393,-0.031564,-0.049097
1Glu 243Val Other,-0.240734,0.273373,-0.084220,-0.125245,1.000000,-0.087042,-0.227936,0.015805,-0.239604,0.490719,...,0.012796,0.018456,-0.034840,-0.006984,-0.002372,-0.075501,-0.066725,0.022219,0.023398,0.245874
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291Lys 295Hip Other,0.019186,-0.041306,0.043583,0.098008,-0.075501,-0.001305,0.006817,-0.062576,-0.001995,-0.010389,...,-0.020079,-0.103207,0.204838,0.120001,0.275255,1.000000,0.345861,-0.274152,-0.024060,0.017462
291Lys 296Glu Saltbr,0.020903,-0.009276,0.024442,0.037536,-0.066725,0.010450,0.026784,-0.056362,0.026018,-0.014295,...,-0.013045,-0.067694,0.069087,0.085581,0.063733,0.345861,1.000000,-0.028148,-0.033025,0.030808
291Lys 297Asp Saltbr,-0.018033,0.049259,-0.003348,-0.075393,0.022219,0.000975,-0.002924,0.021980,0.011054,0.012849,...,0.015990,0.078078,-0.121252,-0.013489,-0.012775,-0.274152,-0.028148,1.000000,0.174846,-0.012025
292Glu 295Hip Hbond,-0.031990,0.027462,0.023709,-0.031564,0.023398,0.022428,-0.002693,0.006501,0.028494,-0.005325,...,-0.069012,0.031103,-0.026607,-0.015098,-0.058035,-0.024060,-0.033025,0.174846,1.000000,0.010252


Whilst the above correlation matrix could be used for network analysis, it may be more intuituative to represent the data at the per residue level.
With the below code block we can do exactly that. 
Note that this will both write the file to disk and return the per residue correlation matrix

In [11]:
import os
os.makedirs("WISP_Inputs", exist_ok=True) # we'll save the results here. 

per_res_corr_matrix = correlation_network.gen_res_correl_matrix(
    out_file="WISP_Inputs/per_res_matrix.csv" 
)
per_res_corr_matrix

WISP_Inputs/per_res_matrix.csv saved to disk.


array([[ 1.        ,  0.49071944,  0.63354943, ...,  0.17903615,
        -0.07539259,  0.2458738 ],
       [ 0.49071944,  1.        ,  0.35425494, ...,  0.08135103,
        -0.04688685,  0.11536796],
       [ 0.63354943,  0.35425494,  1.        , ...,  0.14856428,
         0.06739234,  0.18996462],
       ...,
       [ 0.17903615,  0.08135103,  0.14856428, ...,  1.        ,
         0.4096029 , -0.33201737],
       [-0.07539259, -0.04688685,  0.06739234, ...,  0.4096029 ,
         1.        , -0.2346412 ],
       [ 0.2458738 ,  0.11536796,  0.18996462, ..., -0.33201737,
        -0.2346412 ,  1.        ]])

Alongside a correlation matrix we will also want to generate some form of contact map to define which nodes (residues) are in close proximity to one another. 

Our program provides you with two options of how to generate this matrix:

1. Using the method "gen_res_contact_matrix". Here, we define residues as in contact with one another if they share an interaction in the dataframe. (Essentially using the column names to identify pairs of residues in contact). 
 

2. Using the method "heavy_atom_contact_map_from_pdb". Here, we calculate the minimum heavy atom distance between each residue pair and if the minimum distance is below the defined distance cut-off (d_cut), the two atoms are considered in contact. 
    
    * Note that if you have multiple PDB files, you can use the method "heavy_atom_contact_map_from_multiple_pdbs" instead. Here, if in any of the frames provided, the two residues are below the contact distance cut-off (d_cut), they will be considered in contact. 

Option 2 is the more standard method to determine close contacts and is what we will use in this tutorial as well. 

In [12]:
# Option 1:
per_res_contact_matrix = correlation_network.gen_res_contact_matrix(
    out_file=r"WISP_Inputs/per_res_contact_matrix.csv"
)

WISP_Inputs/per_res_contact_matrix.csv saved to disk.


In [13]:
# Option 2, single PDB file example.  
heavy_atom_contact_map = network_analysis.heavy_atom_contact_map_from_pdb(
    pdb_file=in_dir + "WT_PTP1B_Phospho_Enzyme_Closed.pdb",
    first_res=1,
    last_res=298,
    d_cut=5,
    out_file=r"WISP_Inputs/heavy_atom_contact_map.csv"
)

# Option 2, multiple PDB files  
heavy_atom_contact_map = network_analysis.heavy_atom_contact_map_from_multiple_pdbs(
    pdb_files=[in_dir + "WT_PTP1B_Phospho_Enzyme_Closed.pdb",
               in_dir + "WT_PTP1B_Phospho_Enzyme_Open.pdb"],
    first_res=1,
    last_res=298,
    d_cut=6,
    out_file=r"WISP_Inputs/heavy_atom_contact_map_MultiPDB.csv"
)

# We can safely ignore the warning from MDAnalysis as this is not needed for our calculation. 



WISP_Inputs/heavy_atom_contact_map.csv saved to disk.
WISP_Inputs/heavy_atom_contact_map_MultiPDB.csv saved to disk.


### Step 4. Run the Network Analysis of your Choice. 

There are many possible way to analyse these networks generated and several packages developed specifically towards this. For this reason, we choose not to build this into our program. Instead, an example is provided of how the datasets generated here can be used with Bio3D to perform WISP calculations. 