## Tutorial for Network Analysis on Simulations of PTP1B. 

In this jupyter notebook we will use the network_analysis.py module on the PTP1B PyContact dataset to generate inputs for various network analysis methods. 

There already exist several tools to perform network analysis and therefore..... 




This notebook will also cover all the pre-processing steps required alongside an example of how to use the program 

The dataset used here is for PTP1B is the same as what we used in the manuscript. 

This approach does not require class labels (i.e. unsupervised datasets are fine). 

If you do have class labels and want to study the differences between classes (for example to study alterations in allosteric networks in different conformational states), then you can first split your trajectory frames and perform the analysis seperartely. 

Here however, we shall combine the snapshots of PTP1B in both conformational states, (1) the closed WPD-loop state and (2) the open WPD-loop conformation. 


<center><img src="miscellaneous/PTP1B_Stat_Analysis_Banner.png" alt="Drawing" style="width: 70%" /></center>

In [None]:
import pandas as pd
import numpy as np

import sys # note temporary... 
sys.path.append("..") # note temporary... 

from key_interactions_finder import pycontact_processing
from key_interactions_finder import data_preperation
from key_interactions_finder import network_analysis

### Step 1. Process PyContact files with the pycontact_processing.py module 

In this section we will work with the PyContact output files generated. 
Here we will merge our seperate runs together and remove any false interactions that can be generated by the PyContact library. 

In [None]:
pycontact_files_horizontal = ["PyContact_Per_Frame_Interactions_Block1.csv", "PyContact_Per_Frame_Interactions_Block2.csv",
                              "PyContact_Per_Frame_Interactions_Block3.csv", "PyContact_Per_Frame_Interactions_Block4.csv",
                              "PyContact_Per_Frame_Interactions_Block5.csv", "PyContact_Per_Frame_Interactions_Block6.csv",
                              "PyContact_Per_Frame_Interactions_Block7.csv", "PyContact_Per_Frame_Interactions_Block8.csv",
                              "PyContact_Per_Frame_Interactions_Block9.csv", "PyContact_Per_Frame_Interactions_Block10.csv",
                              "PyContact_Per_Frame_Interactions_Block11.csv", "PyContact_Per_Frame_Interactions_Block12.csv",
                              "PyContact_Per_Frame_Interactions_Block13.csv", "PyContact_Per_Frame_Interactions_Block14.csv",
                              "PyContact_Per_Frame_Interactions_Block15.csv", "PyContact_Per_Frame_Interactions_Block16.csv",
                              "PyContact_Per_Frame_Interactions_Block17.csv", "PyContact_Per_Frame_Interactions_Block18.csv",
                              "PyContact_Per_Frame_Interactions_Block19.csv", "PyContact_Per_Frame_Interactions_Block20.csv",
                              ]

pycontact_dataset = pycontact_processing.PyContactInitializer(
    pycontact_files=pycontact_files_horizontal,
    multiple_files=True,
    merge_files_method="horizontal",  
    remove_false_interactions=True,
    in_dir="datasets/PTP1B_Data/example_horizontal_data",
)

In [None]:
# As outputted above, we can inspect the newly prepared dataset by accessing the '.prepared_df' class attribute as follows:
pycontact_dataset.prepared_df

### Step 2 (Optional) Prepare the Dataset for Network Analysis with the data_preperation.py module. 

In this step, we we will generate a instance of the UnsupervisedFeatureData class to prepare the dataset for the network analysis. We will use this instance to perform some (optional) filtering to limit the features we make use of in our analysis. 

If you want to skip this step, then you can simply take the dataframe produced in the prior step (pycontact_dataset.prepared_df) and use that in Step 3. I would recommend though to filter features with a low %occupancy as we will do in this step. 

In [None]:
# First we generate an instance of the UnsupervisedFeatureData class (because we have per frame class labels).
unsupervised_dataset = data_preperation.UnsupervisedFeatureData(
    input_df=pycontact_dataset.prepared_df,
)

# See attributes availale to the class
unsupervised_dataset.__dict__.keys()

As we can see from above, our instance of the UnsupervisedFeatureData class has two attributes, the 'input_df' and the 'df_filtered' (currently empty). Once we use a filtering method in the next code block we will create a populated 'df_filtered' and be able to use it in the next step. 

In [None]:
# In this case, I am going to remove features/contacts that are: 
# 1. Not present for at least 50% of the simulation time.
# 2. Of type "Other" - In PyContact lingo this means a van der Waals interaction between two residues
# that are not both hydrophobic. 
unsupervised_dataset.filter_by_occupancy(min_occupancy=50)
unsupervised_dataset.filter_by_interaction_type(
    interaction_types_included=["Hbond", "Saltbr", "Hydrophobic"])

print(f"Number of features before filtering: {len(unsupervised_dataset.input_df.columns)}")
print(f"Number of features after filtering: {len(unsupervised_dataset.df_filtered.columns)}")
unsupervised_dataset.df_filtered

### Step 3 Perform the Network Analysis with the network_analysis.py module. 

Described


Some Nomenclature:
Node - DEFINE
Edge - DEFINE

Relating the above to biomolecules, we can normally think of a residue as a node in a network and a
connection between two residues as an edge. 

Check out this paper if you're new to the topic --- LINK TODO. 


In Network/Correlation analysis in biomolecular simulations tend to require two sets of data:
1. A correlation matrix for every node in the network to every other node.
2. A distance matrix (sometimes called a distance/contact map) which says whether two nodes are in contact with each other (usually defined by whether they are in close contact with one another). 

In this section, we will generate both of these and save them so that you can make use of the plethora of methods out there as you wish. 

In [None]:
# First lets generate an instance of the CorrelationNetwork class. 
correlation_network = network_analysis.CorrelationNetwork(
    dataset=unsupervised_dataset.df_filtered, 
    out_dir="outputs/PTP1B_network_analysis"
)

# As before lets take a look at the class attributes
correlation_network.__dict__.keys()

In [None]:
# As we can see from the above, we have now generated a correlation matrix to each feature. 
correlation_network.feature_corr_matrix 

Whilst the above correlation matrix could be used for network analysis, it may be more intuituative to represent the data at the per residue level.
With the below code block we can do exactly that. 
Note that this will both write the file to disk and return the object. 

In [None]:
per_res_corr_matrix = correlation_network.gen_res_correl_matrix(
    out_file="outputs/PTP1B_network_analysis/per_res_matrix.csv"
)
per_res_corr_matrix

Alongside a correlation matrix we will also want to generate some form of contact map to define which nodes (residues) are in close proximity to one another. 

Here two main options are provided:

1. Using the method "gen_res_contact_matrix". Here, we define residues as in contact with one another if they share an interaction in the dataframe. (Essentially using the column names to identify pairs of residues in contact).  
2. Using the method "heavy_atom_contact_map_from_pdb". Here, we calculate the minimum heavy atom distance between each residue pair and if the minimun distance is below the defined distance cut-off (d_cut), the two atoms are considered in contact. 
    
    * Note that if you have multiple PDB files, you can use the method "heavy_atom_contact_map_from_multiple_pdbs" instead. Here, if in any of the frames provided, the two residues are below the contact distance cut-off (d_cut), they will be considered in contact. 

Option 2 is the more standard method to determine close contacts and is what we will use in this tutorial as well. 

In [None]:

# Option 1:
per_res_contact_matrix = correlation_network.gen_res_contact_matrix(
    out_file="outputs/PTP1B_network_analysis/per_res_contact_matrix.csv"
)

In [None]:
# Option 2, single PDB file example.  
# heavy_atom_contact_map = network_analysis.heavy_atom_contact_map_from_pdb(
#     pdb_file="datasets/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Closed.pdb",
#     first_res=1,
#     last_res=298,
#     d_cut=5,
#     out_file="outputs/PTP1B_network_analysis/heavy_atom_contact_map.csv"
# )

# Option 2, multiple PDB files  
network_analysis.heavy_atom_contact_map_from_multiple_pdbs(
    pdb_files=["datasets/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Closed.pdb",
               "datasets/PTP1B_data/WT_PTP1B_Phospho_Enzyme_Open.pdb"],
    first_res=1,
    last_res=298,
    d_cut=5,
    out_file="outputs/PTP1B_network_analysis/heavy_atom_contact_map_MultiPDB.csv"
)

### Step 4 Run the Network Analysis of your Choice. 

There exists so many types of the network analysis and several packages developed towards this. For this reason I choose not to build this into this program. Instead, I provide an example of how to use Bio3D to analyse the results. There are many other methods/programs out there that require the same inputs as what we have produced here, so feel free to use what you wish of course.