## Tutorial for Statistical Analysis on Simulations of PTP1B. 

In this jupyter notebook we will use the stat_modelling.py module to identify differences in the molecular interactions across PTP1B
when the WPD-loop of PTP1B is in the Closed state, versus when the WPD-loop is in the Open state.
This notebook will also cover all the pre- and post-processing steps requireds to prepare, analyse and visualise the results.

The dataset used here is for PTP1B is the same as what we used in the manuscript. 

<center><img src="miscellaneous/PTP1B_Stat_Analysis_Banner.png" alt="Drawing" style="width: 70%" /></center>

In [1]:
import sys # note temporary... 
sys.path.append("..") # note temporary...

import pandas as pd
import numpy as np

from key_interactions_finder import pycontact_processing
from key_interactions_finder import data_preperation
from key_interactions_finder import stat_modelling
from key_interactions_finder import post_proccessing
from key_interactions_finder import pymol_projections

### Step 1. Process PyContact files with the pycontact_processing.py module 

In this section we will work with the PyContact output files generated. 
Here we will merge our seperate runs together and remove any false interactions that can be generated by the PyContact library. 

In [2]:
pycontact_files_horizontal = ["PyContact_Per_Frame_Interactions_Block1.csv", "PyContact_Per_Frame_Interactions_Block2.csv",
                              "PyContact_Per_Frame_Interactions_Block3.csv", "PyContact_Per_Frame_Interactions_Block4.csv",
                              "PyContact_Per_Frame_Interactions_Block5.csv", "PyContact_Per_Frame_Interactions_Block6.csv",
                              "PyContact_Per_Frame_Interactions_Block7.csv", "PyContact_Per_Frame_Interactions_Block8.csv",
                              "PyContact_Per_Frame_Interactions_Block9.csv", "PyContact_Per_Frame_Interactions_Block10.csv",
                              "PyContact_Per_Frame_Interactions_Block11.csv", "PyContact_Per_Frame_Interactions_Block12.csv",
                              "PyContact_Per_Frame_Interactions_Block13.csv", "PyContact_Per_Frame_Interactions_Block14.csv",
                              "PyContact_Per_Frame_Interactions_Block15.csv", "PyContact_Per_Frame_Interactions_Block16.csv",
                              "PyContact_Per_Frame_Interactions_Block17.csv", "PyContact_Per_Frame_Interactions_Block18.csv",
                              "PyContact_Per_Frame_Interactions_Block19.csv", "PyContact_Per_Frame_Interactions_Block20.csv",
                              ]

pycontact_dataset = pycontact_processing.PyContactInitializer(
    pycontact_files=pycontact_files_horizontal,
    multiple_files=True,
    merge_files_method="horizontal",  
    remove_false_interactions=True,
    in_dir="datasets/PTP1B_Data/example_horizontal_data/",
)

Your PyContact file(s) have been succefully processed.
You have 2790 features and 10000 observations.
The fully processed dataframe is accesible from the '.prepared_df' class attribute.


In [3]:
# As outputted above, we can inspect the newly prepared dataset by accessing the '.prepared_df' class attribute as follows:
pycontact_dataset.prepared_df

Unnamed: 0,1Glu 241Ser Hbond sc-bb,1Glu 240Pro Hbond sc-bb,1Glu 3Glu Hbond bb-sc,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,1Glu 6Phe Other bb-bb,2Met 240Pro Hbond sc-bb,2Met 4Lys Hbond bb-bb,2Met 5Glu Hbond bb-bb,2Met 245Ile Hbond sc-sc,...,290Trp 186Ser Other bb-sc,288Asp 298Leu Other sc-sc,298Leu 196Lys Hbond bb-sc,297Asp 152Tyr Other bb-sc,297Asp 151Tyr Other sc-bb,287Gln 298Leu Other bb-sc,298Leu 287Gln Other sc-bb,289Gln 298Leu Other bb-sc,298Leu 289Gln Other sc-bb,290Trp 284Ser Other sc-bb
0,2.79062,2.22760,6.19015,11.75476,3.85765,0.01413,0.91267,2.30422,4.97278,0.83832,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.48888,0.15072,3.75961,11.56569,2.01823,0.00000,1.89134,1.81649,3.61045,0.08871,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.65394,0.11149,5.94822,7.74254,3.23310,0.00000,0.29681,2.29761,4.05759,0.01007,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.67279,0.26855,3.82198,7.00332,1.61461,0.00000,3.11450,3.07824,4.76641,0.15005,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.73688,0.23481,3.52825,5.10867,1.28357,0.00000,1.68206,2.29991,3.98947,1.06603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,5.88591,0.01018,1.72647,9.95938,3.59306,0.00000,1.78412,3.08623,5.69829,1.32305,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,8.34367,0.01030,2.03138,9.86854,2.94537,0.00000,0.16410,2.92621,4.74343,2.71557,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,9.10518,0.02668,1.27821,6.73513,1.70116,0.00000,0.09681,1.98417,3.34138,1.84982,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,2.33875,0.00000,4.65840,8.18129,2.79464,0.00000,0.64758,3.07499,5.16315,2.49588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 2 Prepare the Dataset for Statistical Analysis with the data_preperation.py module. 

In this step, we take our dataframe and merge our per frame classifications file to it.
We can also optionally perform several forms of filtering to select what types of interactions we
would like to study.  

In [4]:
# First we generate an instance of the SupervisedFeatureData class (because we have per frame class labels).

classifications_file = "datasets/PTP1B_Data/example_horizontal_data/WT_PTP1B_Class_Assingments.txt"

supervised_dataset = data_preperation.SupervisedFeatureData(
    input_df=pycontact_dataset.prepared_df,
    classifications_file=classifications_file,
    header_present=False # If your classifications_file has a header present, set to True.
)

Your features and class datasets have been succesufully merged.
You can access this dataset through the class attribute: '.df_feat_class'.


In [5]:
# As stated above to access the newly generated dataframe we can use the class attribute as follows
supervised_dataset.df_feat_class 

Unnamed: 0,Classes,1Glu 241Ser Hbond sc-bb,1Glu 240Pro Hbond sc-bb,1Glu 3Glu Hbond bb-sc,1Glu 4Lys Hbond bb-sc,1Glu 5Glu Hbond bb-bb,1Glu 6Phe Other bb-bb,2Met 240Pro Hbond sc-bb,2Met 4Lys Hbond bb-bb,2Met 5Glu Hbond bb-bb,...,290Trp 186Ser Other bb-sc,288Asp 298Leu Other sc-sc,298Leu 196Lys Hbond bb-sc,297Asp 152Tyr Other bb-sc,297Asp 151Tyr Other sc-bb,287Gln 298Leu Other bb-sc,298Leu 287Gln Other sc-bb,289Gln 298Leu Other bb-sc,298Leu 289Gln Other sc-bb,290Trp 284Ser Other sc-bb
0,Closed,2.79062,2.22760,6.19015,11.75476,3.85765,0.01413,0.91267,2.30422,4.97278,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Closed,1.48888,0.15072,3.75961,11.56569,2.01823,0.00000,1.89134,1.81649,3.61045,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Closed,2.65394,0.11149,5.94822,7.74254,3.23310,0.00000,0.29681,2.29761,4.05759,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Closed,0.67279,0.26855,3.82198,7.00332,1.61461,0.00000,3.11450,3.07824,4.76641,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Closed,0.73688,0.23481,3.52825,5.10867,1.28357,0.00000,1.68206,2.29991,3.98947,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Open,5.88591,0.01018,1.72647,9.95938,3.59306,0.00000,1.78412,3.08623,5.69829,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9996,Open,8.34367,0.01030,2.03138,9.86854,2.94537,0.00000,0.16410,2.92621,4.74343,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,Open,9.10518,0.02668,1.27821,6.73513,1.70116,0.00000,0.09681,1.98417,3.34138,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,Open,2.33875,0.00000,4.65840,8.18129,2.79464,0.00000,0.64758,3.07499,5.16315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### Optional Feature Filtering

In the above dataframe we have 2791 columns (so 2791 features). We can take all of these forward for the stastical analysis or we can perform some filtering in advance (the choice is yours). 
There are four built in filtering methods available to you:

1. filter_by_occupancy(min_occupancy) - Remove features that have an %occupancy less than the provided cut-off. %Occupancy is the % of frames with a non 0 value, i.e. the interaction is present in that frame.

2. filter_by_interaction_type(interaction_types_included). - PyContact defines four types of interactions ("Hbond", "Saltbr", "Hydrophobic", "Other"). You select the interactions your want to INCLUDE.

3. filter_by_main_or_side_chain(main_side_chain_types_included) PyContact can also define if each interaction is primarily from the backbone or side-chain for each residue. You select the interaction combinations you want to INCLUDE. Options are: "bb-bb", "sc-sc", "bb-sc", "sc-bb". Where bb = backbone and sc = sidechain.

4. filter_by_avg_strength(average_strength_cut_off) - PyContact calculates a per frame contact score/strength for each interaction. You can filter features by the average score. Values below the cut-off are removed. 

Further, you can also reset the filtering with the following method: 

5. TODO TODO. Reset Filtering method so you can restart if unhappy??? 


In [6]:
# An example of filtering the dataset using the 4 available methods. 

print(f"Number of features before any filtering: {len(supervised_dataset.df_feat_class.columns)}")

# Features with a %occupancy of less than 25% are removed. 
supervised_dataset.filter_by_occupancy(min_occupancy=25)
print(f"Number of features after filtering by occupancy: {len(supervised_dataset.df_filtered.columns)}")

# Remove features with interaction type "Other".  
supervised_dataset.filter_by_interaction_type(
    interaction_types_included=["Hbond", "Saltbr", "Hydrophobic"]) 
print(f"Number of features after filtering by interaction type: {len(supervised_dataset.df_filtered.columns)}")

# No filtering performed here as all possible combinations are included. 
supervised_dataset.filter_by_main_or_side_chain(
    main_side_chain_types_included=["bb-bb", "sc-sc", "bb-sc", "sc-bb"]  
)
print(f"Number of features after NOT filtering by main or side chain: {len(supervised_dataset.df_filtered.columns)}")

# Features with average interaction scores less than 0.5 will be removed. 
supervised_dataset.filter_by_avg_strength(
    average_strength_cut_off=0.5,  
)
print(f"Number of features after filtering by average interaction scores: {len(supervised_dataset.df_filtered.columns)}")

Number of features before any filtering: 2791
Number of features after filtering by occupancy: 1745
Number of features after filtering by interaction type: 1076
Number of features after NOT filtering by main or side chain: 1076
Number of features after filtering by average interaction scores: 962


Now if we look at the class attributes of our SupervisedFeatureData() instance (we called it: supervised_dataset) using the special "\_\_dict__" method we can see two dataframes we could use in the stastical analysis to follow. 

In [7]:
supervised_dataset.__dict__.keys()

dict_keys(['input_df', 'classifications_file', 'header_present', 'df_feat_class', 'df_filtered'])

They are: 
- 'df_feat_class' - The unfiltered dataframe, 2791 features
- 'df_filtered' - The filtered dataframe. Less than 2791 features. 

In the following section we will use the filtered dataframe but either dataframe could be justified based on your goals. 

### Step 3 Perform the Statistical Analysis with the stat_modelling.py module. 

Now we will perform the actual statistical modelling to compare the differences in the features between the closed and open WPD-loop. 

With this module can calculate two different metrics to evaluate how different/similar each feature is when the protein is in the closed WPD-loop conformation or the open-WPD-loop conformation. 

For this module you must only use two classes (i.e. binary classification). You can select what classes you want to use in the next code block.

1. The Jensen-Shannon distance. This xxxx. After first generating probability distributions of the interaction strengths for each feature when in the two states provided, this is calculated using the [implementation available in SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html).
2. Mutual information: This calculates the xxxx. This is calculated using the [implementation available in Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html).

In both cases, the higher the score, the more "different" the feature is when in the two different states.  

In [8]:
stat_model = stat_modelling.ProteinStatModel(
    dataset=supervised_dataset.df_filtered, 
    class_names=["Closed", "Open"], # select the two class labels to compare. Has to be 2 labels. 
    out_dir="outputs/PTP1B_stat_analysis",
    interaction_types_included=["Hbond"]  # "Saltbr", "Hydrophobic", "Other"
)

In [9]:
# First we can calculate the Jensen-Shannon distances (Took me about 20 seconds).
stat_model.calc_js_distances()

Jensen-Shannon (JS) distances calculated.
example_output_data/Jensen_Shannon_Per_Feature_Scores.csv written to disk.
You can also access these results via the class attribute: 'js_distances'.


In [10]:
# Now we can calculate the mutual informations (Took me about a minute).
stat_model.calc_mutual_info_to_target()
# TODO - IF I AM SAVING DATA HERE, I NEED IT TO PRINT THAT OUT. 

Mutual information scores calculated.
example_output_data/Mutual_Information_Per_Feature_Scores.csv written to disk.
You can also access these results via the class attribute: 'mutual_infos'.


In [11]:
# As printed above we can access the results from these calculations from the class instance's (we called it stat_model) attributes. 
mi_results = stat_model.mutual_infos
js_results = stat_model.js_distances
# stat_model.__dict__.keys() # uncomment to see all attributes available. 

### Step 4 Work up the Statistical Analysis with the post_proccessing.py module. 

In this module we have access to 3 methods to analyse the results in more detail.

1. We can use generate these results to be at the per-residue level, by summing (and then normalising) the feature scores that each residue is involved in.
This can allow us to identify residues which seem to have the largest overall difference in interactions between each state. 

2. We can also try to predict the "direction" each feature favours. 

3. We can obtain the probability distrubitions of the interaction scores for a selected set of features to visually compare the distributions. 

In [12]:
# First generate an instance of the class. 
post_proc = post_proccessing.StatisticalPostProcessor(
    stat_model=stat_model,
    out_dir="outputs/PTP1B_stat_analysis"
)

# As you have already seen in the prior steps, we can take a look at class attributes as follows.
# Note that some of these attributes will be empty until we run the next few code blocks.
post_proc.__dict__.keys()

dict_keys(['stat_model', 'out_dir', 'per_residue_js_distances', 'per_residue_mutual_infos', 'feature_directions'])

In [13]:
# Now we can run the get_per_res_importance() method, changing the stat_method accordingly.
js_per_res_importances = post_proc.get_per_res_importance(
    stat_method="jenson_shannon")
mi_per_res_importances = post_proc.get_per_res_importance(
    stat_method="mutual_information")

example_output_data/Jenson_Shannon_Distances_Per_Residue.csv written to disk.
example_output_data/Mutual_Information_Scores_Per_Residue.csv written to disk.


In [14]:
post_proc.estimate_feature_directions()
# TODO - ESTIMATE RESIDUE DIRECTIONS TOO?



example_output_data/feature_direction_estimates.csv written to disk.
You can also access these predictions through the 'feature_directions' class attribute.


In [15]:
# Now we can plot the probability distrubitions for the most different Features. 
x_values, selected_prob_distribs = post_proc.get_probability_distributions(
    number_features=10)

# TODO - DESCRIBE THE OUTPUT, MAKE A NICE PLOTLY GRAPH :) 

### Part 5 Projecting the Results onto Protein Structures with the pymol_projections.py module. 
 
Naturally, we may want to visualise some of the results we have generated above onto a protein structure. We can take advantage of
the functions provided in the pymol_projections.py module to do this. 

As the name suggests this will output [PyMOL](https://pymol.org/) compatible python scripts which can be run to represent the results
at the: 

1. Per feature level. (Cylinders are drawn between each feature, with the cylinder radii marking how strong the relative difference is. 
2. Per residue level. The Carbon alpha of each residue will be depicted as a sphere, with the sphere radii depicting how strong the the relative difference is. 

In [16]:
# Write PyMOL compatable scripts for the per residue results.
# Simply swap between the two statistical methods as shown below. 
pymol_projections.project_pymol_top_features(
    per_feature_scores=stat_model.js_distances,
    model_name="jenson_shannon",
    numb_features=100,
    out_dir="outputs/PTP1B_stat_analysis"
)

pymol_projections.project_pymol_top_features(
    per_feature_scores=stat_model.mutual_infos,
    model_name="mutual_information",
    numb_features=100,
    out_dir="outputs/PTP1B_stat_analysis"
)

The file: example_output_data/jenson_shannon_Pymol_Per_Feature_Scores.py was written to disk.
The file: example_output_data/mutual_information_Pymol_Per_Feature_Scores.py was written to disk.


In [17]:
# Write PyMOL compatable scripts for the per feature results.
# Simply swap between the two statistical methods as shown below. 
pymol_projections.project_pymol_per_res_scores(
    per_res_scores=js_per_res_importances,
    model_name="jenson_shannon",
    out_dir="outputs/PTP1B_stat_analysis"
)

pymol_projections.project_pymol_per_res_scores(
    per_res_scores=mi_per_res_importances,
    model_name="mutual_information",
    out_dir="outputs/PTP1B_stat_analysis"
)

The file: example_output_data/jenson_shannon_Pymol_Per_Res_Scores.py was written to disk.
The file: example_output_data/mutual_information_Pymol_Per_Res_Scores.py was written to disk.


In [18]:

# TODO ADD Picture of the outputs here as an example. 
