### Tutorial: Regression ML and Statistical Analysis on the R1 and R4 KE07s

In this jupyter notebook we will use the model_building.py and stat_modelling.py modules to perform machine learning (ML) and statisical analysis (both regression) on a kemp eliminase enzyme (KE07). Our target variable is a continous value (hence regression) and is the value of the KE07's W50 Chi2 angle. 

This notebook will also cover all the pre- and post-processing steps requireds to prepare, analyse and visualise the results.

The dataset used here is for WT PTP1B and is the same data as what was used in the manuscript. 


<center><img src="./miscellaneous/ke07_banner.png" style="width: 70%" /></center>


Test 2
<center><img src="tutorials/miscellaneous/ke07_banner.png.png" style="width: 70%" /></center>

Test 3


![test](tutorials/miscellaneous/KE07_ML_Stat_Banner.png)

Test 4 

![test](./miscellaneous/ke07_banner.png)~

Test 5

<center><img src="/blob/main/tutorials/miscellaneous/ke07_banner.png" style="width: 70%" /></center>

Test 6

![test](https://github.com/kamerlinlab/key-interactions-finder/blob/2d6fa2d0de1e890ff9d5a19deef70cba88663b76/tutorials/miscellaneous/ke07_banner.png)


## Setup 

In [51]:
import sys # note temporary... 
sys.path.append("..") # note temporary...

import pandas as pd

from key_interactions_finder import pycontact_processing
from key_interactions_finder import data_preperation
from key_interactions_finder import stat_modelling
from key_interactions_finder import model_building
from key_interactions_finder import post_proccessing
from key_interactions_finder import pymol_projections


In [52]:
# TODO load in datasets...

In [53]:
# inputs
in_dir = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\KE07_Workup\Raw_Datasets\R1_5d2w"
target_file = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\KE07_Workup\Raw_Datasets\R1_5d2w\R1_5d2w_1in10_Trp50_Chi2.dat"

# Path to the variable we will use to filter frames with (optional addition for this system). 
w50_chi1_file = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\KE07_Workup\Raw_Datasets\R1_5d2w\R1_5d2w_1in10_Trp50_Chi1.dat"

# output folders
stats_out_dir = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis"
ml_out_dir = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_ml"

### Preperation Step 1. Process PyContact files with the pycontact_processing.py module 

In this section we will work with the PyContact output files generated. Here we will merge our seperate runs together and remove any false interactions that can be generated by the PyContact library. Note that these blocks are from the same frames but with different ranges of residues to search through. 

In [54]:
# List of PyContact files to process. 
pycontact_files_horizontal = [
    "PyContact_Per_Frame_Interactions_Block1.csv", "PyContact_Per_Frame_Interactions_Block2.csv",
    "PyContact_Per_Frame_Interactions_Block3.csv", "PyContact_Per_Frame_Interactions_Block4.csv",
    "PyContact_Per_Frame_Interactions_Block5.csv", "PyContact_Per_Frame_Interactions_Block6.csv",
    "PyContact_Per_Frame_Interactions_Block7.csv", "PyContact_Per_Frame_Interactions_Block8.csv",
    "PyContact_Per_Frame_Interactions_Block9.csv", "PyContact_Per_Frame_Interactions_Block10.csv",
    "PyContact_Per_Frame_Interactions_Block11.csv", "PyContact_Per_Frame_Interactions_Block12.csv",
    "PyContact_Per_Frame_Interactions_Block13.csv", "PyContact_Per_Frame_Interactions_Block14.csv",
    "PyContact_Per_Frame_Interactions_Block15.csv", "PyContact_Per_Frame_Interactions_Block16.csv",
    "PyContact_Per_Frame_Interactions_Block17.csv"
]


pycontact_dataset = pycontact_processing.PyContactInitializer(
    pycontact_files=pycontact_files_horizontal,
    multiple_files=True,
    merge_files_method="horizontal",  
    remove_false_interactions=True,
    in_dir=in_dir,
)

Your PyContact file(s) have been succefully processed.
You have 1699 features and 10001 observations.
The fully processed dataframe is accesible from the '.prepared_df' class attribute.


In [55]:
pycontact_dataset.prepared_df.head(3)

Unnamed: 0,1Leu 246Asn Hbond bb-sc,2Ala 246Asn Hbond bb-bb,2Ala 248Arg Hbond sc-sc,2Ala 247Val Other sc-bb,2Ala 218Asp Other bb-bb,3Lys 218Asp Hbond sc-bb,3Lys 246Asn Hbond bb-bb,3Lys 247Val Other bb-bb,3Lys 213Phe Hbond sc-sc,3Lys 245Val Other sc-sc,...,241Lys 247Val Hbond sc-sc,241Lys 244Gly Hbond bb-bb,241Lys 249Leu Hbond sc-sc,246Asn 0Ala Hbond sc-bb,241Lys 248Arg Hbond sc-bb,244Gly 0Ala Other bb-sc,241Lys 246Asn Hbond sc-sc,245Val 0Ala Other bb-sc,248Arg 0Ala Hbond sc-sc,247Val 0Ala Other bb-sc
0,1.41015,8.47489,0.8178,0.23936,0.19355,5.2061,2.61757,0.45713,6.19496,0.66955,...,1.43806,2.33001,5.25162,0.50716,0.0,0.0,0.0,0.0,0.0,0.0
1,0.60903,8.71948,0.15372,0.00953,0.10791,5.83633,3.65514,0.15333,6.69782,0.05481,...,0.82788,2.40022,3.90114,0.30295,0.0,0.0,0.0,0.0,0.0,0.0
2,4.12849,4.07022,0.01666,0.0,0.11559,6.32922,3.34465,0.38738,7.31453,0.9187,...,0.4246,2.86327,3.4879,1.09194,0.0,0.0,0.0,0.0,0.0,0.0


If we look at the dataframe above we can see an issue in that there is a resiude named 0Ala instead of being called 1Ala. This can sometimes happen with PyContact depending on the type of parameter and trajectory file used for PyContact (residue renumbered from 0). 

If this happends to you, the function below will take care of the problem. In this case we just need to add 1 to each residue number in every column to resolve the issue. 

In [56]:
prepared_renumberd_df = pycontact_processing.modify_column_residue_numbers(
    dataset=pycontact_dataset.prepared_df, 
    constant_to_add=1 # can be positive or negative
)

prepared_renumberd_df.head(3)

Unnamed: 0,2Leu 247Asn Hbond bb-sc,3Ala 247Asn Hbond bb-bb,3Ala 249Arg Hbond sc-sc,3Ala 248Val Other sc-bb,3Ala 219Asp Other bb-bb,4Lys 219Asp Hbond sc-bb,4Lys 247Asn Hbond bb-bb,4Lys 248Val Other bb-bb,4Lys 214Phe Hbond sc-sc,4Lys 246Val Other sc-sc,...,242Lys 248Val Hbond sc-sc,242Lys 245Gly Hbond bb-bb,242Lys 250Leu Hbond sc-sc,247Asn 1Ala Hbond sc-bb,242Lys 249Arg Hbond sc-bb,245Gly 1Ala Other bb-sc,242Lys 247Asn Hbond sc-sc,246Val 1Ala Other bb-sc,249Arg 1Ala Hbond sc-sc,248Val 1Ala Other bb-sc
0,1.41015,8.47489,0.8178,0.23936,0.19355,5.2061,2.61757,0.45713,6.19496,0.66955,...,1.43806,2.33001,5.25162,0.50716,0.0,0.0,0.0,0.0,0.0,0.0
1,0.60903,8.71948,0.15372,0.00953,0.10791,5.83633,3.65514,0.15333,6.69782,0.05481,...,0.82788,2.40022,3.90114,0.30295,0.0,0.0,0.0,0.0,0.0,0.0
2,4.12849,4.07022,0.01666,0.0,0.11559,6.32922,3.34465,0.38738,7.31453,0.9187,...,0.4246,2.86327,3.4879,1.09194,0.0,0.0,0.0,0.0,0.0,0.0


Now we will add one more row to our dataframe which corresponds to the W50 Chi1 angle. This will allow us to filter the dataframe to removed those frames that belong to  "Conformation B". The reasons for this is are described in full in the paper, but this is essentially to help focus the analysis on the differences between the "A" and "C" state. 
- Note the target variable is the W50 Chi**2** angle, and we will filter data on the Chi**1** angle.

In [57]:
just_chi1_df = pd.read_csv(w50_chi1_file)
just_chi1_df = just_chi1_df.set_axis(["W50Chi1"], axis=1)

chi1_df = pd.concat([just_chi1_df, prepared_renumberd_df], axis=1)
chi1_df.head(3)

Unnamed: 0,W50Chi1,2Leu 247Asn Hbond bb-sc,3Ala 247Asn Hbond bb-bb,3Ala 249Arg Hbond sc-sc,3Ala 248Val Other sc-bb,3Ala 219Asp Other bb-bb,4Lys 219Asp Hbond sc-bb,4Lys 247Asn Hbond bb-bb,4Lys 248Val Other bb-bb,4Lys 214Phe Hbond sc-sc,...,242Lys 248Val Hbond sc-sc,242Lys 245Gly Hbond bb-bb,242Lys 250Leu Hbond sc-sc,247Asn 1Ala Hbond sc-bb,242Lys 249Arg Hbond sc-bb,245Gly 1Ala Other bb-sc,242Lys 247Asn Hbond sc-sc,246Val 1Ala Other bb-sc,249Arg 1Ala Hbond sc-sc,248Val 1Ala Other bb-sc
0,193.934,1.41015,8.47489,0.8178,0.23936,0.19355,5.2061,2.61757,0.45713,6.19496,...,1.43806,2.33001,5.25162,0.50716,0.0,0.0,0.0,0.0,0.0,0.0
1,188.05,0.60903,8.71948,0.15372,0.00953,0.10791,5.83633,3.65514,0.15333,6.69782,...,0.82788,2.40022,3.90114,0.30295,0.0,0.0,0.0,0.0,0.0,0.0
2,184.593,4.12849,4.07022,0.01666,0.0,0.11559,6.32922,3.34465,0.38738,7.31453,...,0.4246,2.86327,3.4879,1.09194,0.0,0.0,0.0,0.0,0.0,0.0


### Preperation Step 2. Prepare the Dataset for calculations with the data_preperation.py module. 

In this step, we take our dataframe and merge our per frame target file to it.

We can also optionally perform several forms of filtering on the PyContact features to select what types of interactions we would like to study.  

In [58]:
# Generate supervised dataset instance as we have a target variable.  
supervised_dataset = data_preperation.SupervisedFeatureData(
    input_df=chi1_df,
    target_file=target_file,
    is_classification=False,
    header_present=True 
)

supervised_dataset.df_processed.head(3)

Your PyContact features and target variable have been succesufully merged.
You can access this dataset through the class attribute: '.df_processed'.


Unnamed: 0,Target,W50Chi1,2Leu 247Asn Hbond bb-sc,3Ala 247Asn Hbond bb-bb,3Ala 249Arg Hbond sc-sc,3Ala 248Val Other sc-bb,3Ala 219Asp Other bb-bb,4Lys 219Asp Hbond sc-bb,4Lys 247Asn Hbond bb-bb,4Lys 248Val Other bb-bb,...,242Lys 248Val Hbond sc-sc,242Lys 245Gly Hbond bb-bb,242Lys 250Leu Hbond sc-sc,247Asn 1Ala Hbond sc-bb,242Lys 249Arg Hbond sc-bb,245Gly 1Ala Other bb-sc,242Lys 247Asn Hbond sc-sc,246Val 1Ala Other bb-sc,249Arg 1Ala Hbond sc-sc,248Val 1Ala Other bb-sc
0,85.8483,193.934,1.41015,8.47489,0.8178,0.23936,0.19355,5.2061,2.61757,0.45713,...,1.43806,2.33001,5.25162,0.50716,0.0,0.0,0.0,0.0,0.0,0.0
1,93.7505,188.05,0.60903,8.71948,0.15372,0.00953,0.10791,5.83633,3.65514,0.15333,...,0.82788,2.40022,3.90114,0.30295,0.0,0.0,0.0,0.0,0.0,0.0
2,102.094,184.593,4.12849,4.07022,0.01666,0.0,0.11559,6.32922,3.34465,0.38738,...,0.4246,2.86327,3.4879,1.09194,0.0,0.0,0.0,0.0,0.0,0.0


##### Optional Feature Filtering

In the above dataframe we have 1701 columns (so 1700 features + 1 target). We can take all of these forward for the stastical analysis or we can perform some filtering in advance (the choice is yours). 
There are four built in filtering methods available to you to perform filtering:

1. **filter_by_occupancy(min_occupancy)** - Remove features that have an %occupancy less than the provided cut-off. %Occupancy is the % of frames with a non 0 value, i.e. the interaction is present in that frame.

2. **filter_by_interaction_type(interaction_types_included)** - PyContact defines four types of interactions ("Hbond", "Saltbr", "Hydrophobic", "Other"). You select the interactions your want to include.

3. **filter_by_main_or_side_chain(main_side_chain_types_included)** - PyContact can also define if each interaction is primarily from the backbone or side-chain for each residue. You select the interaction combinations you want to include. Options are: "bb-bb", "sc-sc", "bb-sc", "sc-bb". Where bb = backbone and sc = sidechain.

4. **filter_by_avg_strength(average_strength_cut_off)** - PyContact calculates a per frame contact score/strength for each interaction. You can filter features by the average score. Values below the cut-off are removed. 

Finally if at any point in time you want to reset any filtering you've already performed, you can use the following method: 

5. **reset_filtering()** 


In [59]:
supervised_dataset.reset_filtering() 
print(f"Number of features before any filtering: {len(supervised_dataset.df_processed.columns) - 1}")

# Features with a %occupancy of less than 25% are removed. 
supervised_dataset.filter_by_occupancy(min_occupancy=25)
print(f"Number of features after filtering by occupancy: {len(supervised_dataset.df_filtered.columns) - 1}")

# Features with an average interaction strength less than 0.5 will be removed. 
supervised_dataset.filter_by_avg_strength(
    average_strength_cut_off=0.5,  
)
print(f"Number of features after filtering by average interaction scores: {len(supervised_dataset.df_filtered.columns) - 1}")

Number of features before any filtering: 1700
Number of features after filtering by occupancy: 894
Number of features after filtering by average interaction scores: 674


Now if we look at the class attributes of our SupervisedFeatureData() instance (we called it: supervised_dataset) using the special "\_\_dict__" method we can see two dataframes we could use in the stastical analysis to follow. 

In [60]:
supervised_dataset.__dict__.keys()

dict_keys(['input_df', 'is_classification', 'target_file', 'header_present', 'df_processed', 'df_filtered'])

They are: 
- 'df_processed' - The unfiltered dataframe, 1700 features
- 'df_filtered' - The filtered dataframe. Less than 1700 features. 

In the following sections we will use the filtered dataframe but either dataframe could be justified based on your goals. 

As described above, we will filter frames to remove those that belong to conformation B, using the W50 chi1 angle to define this.

In [61]:
# Remove all frames with W50 Chi1: < 160 and > 240. 
chi1_chi2_df = supervised_dataset.df_filtered
filtered_df = chi1_chi2_df[( (chi1_chi2_df.W50Chi1 > 160) & (chi1_chi2_df.W50Chi1 < 240) )]

# Now remove "W50Chi1" as it jobs is done. 
df_ready = (filtered_df.drop("W50Chi1", axis=1)).reset_index(drop=True) 

print(f"Rows before filtering by W50 Chi1: {len(supervised_dataset.df_filtered)}")
print(f"Rows after filtering by W50 Chi1: {len(df_ready)}")
df_ready.head(3)

Rows before filtering by W50 Chi1: 10001
Rows after filtering by W50 Chi1: 9281


Unnamed: 0,Target,2Leu 247Asn Hbond bb-sc,3Ala 247Asn Hbond bb-bb,3Ala 249Arg Hbond sc-sc,3Ala 248Val Other sc-bb,4Lys 219Asp Hbond sc-bb,4Lys 247Asn Hbond bb-bb,4Lys 248Val Other bb-bb,4Lys 214Phe Hbond sc-sc,4Lys 246Val Other sc-sc,...,241Leu 244His Hbond bb-bb,241Leu 246Val Hbond bb-bb,241Leu 248Val Hydrophobic sc-sc,241Leu 245Gly Hbond bb-bb,235Arg 250Leu Hbond sc-sc,242Lys 246Val Hbond bb-bb,242Lys 248Val Hbond sc-sc,242Lys 245Gly Hbond bb-bb,242Lys 250Leu Hbond sc-sc,247Asn 1Ala Hbond sc-bb
0,85.8483,1.41015,8.47489,0.8178,0.23936,5.2061,2.61757,0.45713,6.19496,0.66955,...,4.59161,6.67933,0.68318,1.87831,0.0,1.14998,1.43806,2.33001,5.25162,0.50716
1,93.7505,0.60903,8.71948,0.15372,0.00953,5.83633,3.65514,0.15333,6.69782,0.05481,...,5.36271,7.96774,1.33746,1.35949,0.0,1.19067,0.82788,2.40022,3.90114,0.30295
2,102.094,4.12849,4.07022,0.01666,0.0,6.32922,3.34465,0.38738,7.31453,0.9187,...,4.09565,5.89572,2.41967,2.42847,0.0,0.08954,0.4246,2.86327,3.4879,1.09194


As can be seen above, we have now filtered our dataframe as desired and removed the W50Chi1 column from the dataframe so it is now ready for the stats and ML analysis

## Analysis Time! 

Our dataset is now ready for either statistical analysis or ML or both! 

### Part 1.1. Perform Statistical Analysis with the stat_modelling.py module. 

Now we will perform the actual statistical modelling to compare the differences for each feature against the target variable. 

With this module, we can calculate two different metrics to evaluate how dependant each feature is on the target variables value. They are:

1. The [mutual information](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html) using the implementation available in Scikit-learn. The mutual information can capture any kind of dependancy/relationship between variables and score their dependancy.

2. The [linear correlation](). I assume this does not need an introduction. 

In both cases, the higher the absolute value, the more dependant the feature is on the target variable. The mutual information has scores in the range 0 and 1, whilst the linear correlation scores are in the range -1 to 1. 

In [62]:
# Now time for stat regression analysis
stat_model = stat_modelling.RegressionStatModel(
    dataset=df_ready, # The dataframe we just made. 
    out_dir=stats_out_dir,
    interaction_types_included=["Hbond", "Saltbr", "Hydrophobic", "Other"] 
)

Now we can determine the values for our two metrics, these wont take long to run (maybe 2 mins in total)

In [63]:
stat_model.calc_mutual_info_to_target()

Mutual information scores calculated.
C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/Mutual_Information_Per_Feature_Scores.csv written to disk.
You can also access these results via the class attribute: 'mutual_infos'.


In [64]:
stat_model.calc_linear_correl_to_target()

Linear correlations calculated.
C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/Linear_Correlations_Per_Feature_Scores.csv written to disk.
You can also access these results via the class attribute: 'linear_correlations'.


In [65]:
# As printed above we can access the results from these calculations from the class instance's (we called it stat_model) attributes. 
mi_results = stat_model.mutual_infos
lc_results = stat_model.linear_correlations
# stat_model.__dict__.keys() # uncomment to see all attributes available. 

### Part 1.2. Work up the Statistical Analysis with the post_proccessing.py module. 

In this module we can convert the per feature importances to per residue importances, by summing (and then normalising) every feature importance score that each residue is involved in. This can allow us to identify residues which seem to differ the most between each state. 


In [66]:
# First generate an instance of the class. 
post_proc = post_proccessing.StatRegressorPostProcessor(
    stat_model=stat_model,
    out_dir=stats_out_dir
)

In [67]:
# Now we can run the get_per_res_importance() method, changing the stat_method accordingly.
mi_per_res_importances = post_proc.get_per_res_importance(
    stat_method="mutual_information")

lin_correl_per_res_importances = post_proc.get_per_res_importance(
    stat_method="linear_correlation")

C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/Mutual_Information_Scores_Per_Residue.csv written to disk.
C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/Linear_Correlation_Scores_Per_Residue.csv written to disk.


The per residue and feature importances are not just saved to disk but available as variables so you can analyse them within python using whatever graphing program you like. 

For inspiration feel free to take a look at the article or the tutorial "Tutorial_PTP1B_Classification_ML_Stats.ipynb". 

### Part 1.3. Project the Results onto Protein Structures with the pymol_projections.py module. 
 
Naturally, we may want to visualise some of the results we have generated above onto a protein structure. We can take advantage of
the functions provided in the pymol_projections.py module to do this. 

As the name suggests this will output [PyMOL](https://pymol.org/) compatible python scripts which can be run to represent the results
at the: 

1. Per feature level. (Cylinders are drawn between each feature, with the cylinder radii marking how strong the relative difference is. 
2. Per residue level. The Carbon alpha of each residue will be depicted as a sphere, with the sphere radii depicting how strong the the relative difference is. 

In [68]:
# Write PyMOL compatable scripts for the per feature results.
# Simply swap between the two statistical methods as shown below. 
pymol_projections.project_pymol_top_features(
    per_feature_scores=stat_model.linear_correlations,
    model_name="linear_correlation",
    numb_features=100, # top features to project, set to any integer or "all" for all features. 
    out_dir=stats_out_dir
)

pymol_projections.project_pymol_top_features(
    per_feature_scores=stat_model.mutual_infos,
    model_name="mutual_information",
    numb_features=100, # top features to project, set to any integer or "all" for all features. 
    out_dir=stats_out_dir
)

The file: C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/linear_correlation_Pymol_Per_Feature_Scores.py was written to disk.
The file: C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/mutual_information_Pymol_Per_Feature_Scores.py was written to disk.


In [69]:
# Write PyMOL compatable scripts for the per residue results.
# Simply swap between the two statistical methods as shown below. 
pymol_projections.project_pymol_per_res_scores(
    per_res_scores=lin_correl_per_res_importances,
    model_name="linear_correlation",
    out_dir=stats_out_dir
)  

pymol_projections.project_pymol_per_res_scores(
    per_res_scores=mi_per_res_importances,
    model_name="mutual_information",
    out_dir=stats_out_dir
)

The file: C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/linear_correlation_Pymol_Per_Res_Scores.py was written to disk.
The file: C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_stat_analysis/mutual_information_Pymol_Per_Res_Scores.py was written to disk.


Now we are complete with the stats module. Here is an example of the kind of figures you can make with the pymol projections generated:


<center><img src="miscellaneous/KE07_Example_Output.png" style="width: 70%" /></center>

### Part 2.1 Perform Machine Learning (ML) with the model_building.py module. 

Now we will use ML to generate models trained on our target variable. 

With this module, we use the feature importance scores from each ML model to to evaluate how different/similar each feature is against the target variable.

**In the paper we used three ensemble based regression models:**

1. [Categorical Boosting](https://catboost.ai/) - (Refered to as: CatBoost)

2. [Extreme Gradient Boosting](https://xgboost.readthedocs.io/en/stable/) - (Refered to as: XGBoost)

3. [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)  (Refered to as: Random_Forest)

**In this tutorial, we will only use the CatBoost algorithim, to reduce the time required (from maybe an hour to less than 5 mins)**



In all cases, the higher the score, the more "different" the feature is when in the two different states.

We can use the same dataframe that we used for the statistical analysis in Part 1.1 below.

In [74]:
# Instantiate the ClassificationModel class. 
# Clearly there are many parameters here, using your IDE you can hover over RegressionModel to see what each parameter does. 
ml_model = model_building.RegressionModel(
    dataset=df_ready,
    evaluation_split_ratio=0.15,
    models_to_use=["CatBoost"], # "XGBoost", "Random_Forest"] # You can add the other methods back in you want. 
    scaling_method="min_max",
    out_dir=ml_out_dir, 
    cross_validation_splits=5, 
    cross_validation_repeats=3,
    search_approach="none",
)


Below is a summary of the machine learning you have planned.
You will use 5-fold cross validation and perform 3 repeats.
You will use up to 674 features to build each model, with 85.0% of your data used for training the model, which is 7888 observations. 
15.0% of your data will be used for evaluating the best models produced by the 5-fold cross validation, which is 1393 observations.
You have selected to build 1 machine learning model(s), with the following hyperparameters: 
 
A CatBoost model, with grid search parameters: 
{'iterations': [100]} 

If you're happy with the above, lets get model building!


Now we can go ahead and build the models.

We have one optional parameter in the command below which is to save the models generated. This can be useful if you ever want to come back and do the post-processing later.

If you set this to true all the files required will be saved to a folder called "temporary_files" in your current working directory. 

With the current setup this calculation will not take long (maybe TODO mins to run on a standard laptop). However, you could perform a very exhausitve calculation using grid search CV (possible by changing the "search_approach" parameter in model_building.RegressionModel() ), in which case it might be useful.

In [75]:
ml_model.build_models(save_models=True)

Model saved to disk at: temporary_files/CatBoost_Model.pickle
Model building complete, returning final results with train/test datasets to you.


Unnamed: 0,model,best_params,best_score,best_standard_deviation,Time taken to build model (minutes)
0,CatBoost,{'iterations': 100},0.862531,0.008974,2.77


With the model now built, we can see how long it took to build and some metrics describing the regression error for the train and test sets. 

We can now evaluate the quality of the model on the validation dataset (also sometimes refered to as the hold-out set).

In [76]:
reports = ml_model.evaluate_models()
reports

Unnamed: 0,Model,Explained Variance,Mean Absolute Error,MSE,RMSE,Mean Squared Log Error,r squared
0,CatBoost,0.8762,17.7497,685.7573,26.187,0.0414,0.8762


The report produced is a dataframe with 6 regressions metrics to enable you to evaluate the quality of the model. If you had more built more than one model this dataframe would contain additional rows for each additional model built.

The MSE and RMSE stands for the mean squarred error and the root mean squared error respectively.

Personally, I think the Mean Absolute Error (MAE) and RMSE are very useful metrics as they have the same units as your target dataset, meaning it is quite easy to use them to think about how good your model is. 

## Part 2.2. Work up the ML results with the post_proccessing.py module. 

In order to perform the analysis we will need to provide the models generated in Part 2.1 Shown below are the two possible ways to do this. 

In [77]:
# First we will make an instance of the SupervisedPostProcessor class.
ml_post_proc = post_proccessing.SupervisedPostProcessor(
    out_dir=ml_out_dir,
)

# Option 1 - Load models from the instance of the SupervisedModel class. 
ml_post_proc.load_models_from_instance(supervised_model=ml_model)

# Option 2 - Load models from disk. (If you've run the model building, shut down the kernel and now want to post-process).
#ml_post_proc.load_models_from_disk(models_to_use=["XGBoost", "CatBoost", "Random_Forest"]) 

In [78]:
# After preparing the class we can now determine the feature importances for each model made.
ml_post_proc.get_feature_importance()

C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_ml/CatBoost_Feature_Importances.csv written to disk.
All feature importances have now been saved to disk.


In [79]:
# We can also project these per feature importances onto the per-residue level. 
ml_post_proc.get_per_res_importance()

C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_ml/CatBoost_Per_Residue_Importances.csv written to disk.
All per residue importance scores have now been saved to disk.


In [80]:
print(ml_post_proc.__dict__.keys())

dict_keys(['out_dir', 'feat_names', 'best_models', 'all_feature_importances', 'all_per_residue_scores'])


If we take a look at the class attributes we can see the per feature and per residue importances were not just saved to disk, but are also stored inside the class.

This means you could easily analyse them within Python if you want. 

In [81]:
all_per_res_scores = ml_post_proc.all_per_residue_scores
all_feature_scores = ml_post_proc.all_feature_importances

For ideas on how to work up these results please see the manuscript or our other tutorial: "Tutorial_PTP1B_Classification_ML_Stats.ipynb"

### Part 2.3. Project the Results onto Protein Structures with the pymol_projections.py module. 
 
This section is essentially identical to 1.3, only that now we will output the ml results instead of the stats results

In [83]:
# Here you do not need to specify what model you would like to output results for, all will be outputted simultaneously.
pymol_projections.project_multiple_per_res_scores(
    all_per_res_scores=ml_post_proc.all_per_residue_scores,
    out_dir=ml_out_dir
)

pymol_projections.project_multiple_per_feature_scores(
    all_feature_scores=ml_post_proc.all_feature_importances,
    numb_features=100, # top features to project, set to any integer or "all" for all features.  
    out_dir=ml_out_dir
)

The file: C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_ml/CatBoost_Pymol_Per_Res_Scores.py was written to disk.
The file: C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Workup\Tutorial_outputs\KE07_ml/CatBoost_Pymol_Per_Feature_Scores.py was written to disk.


Done!