### Example of how to process a dataset generated using the PyContact GUI 

This Jupyter notebook will provide a short example of how to use the "pycontact_processing.py" module to load in a PyContact dataset generated via the PyContact GUI. Please note it is recommended to use the ["run_pycontact.py"](https://github.com/kamerlinlab/key-interactions-finder/blob/main/key_interactions_finder/run_pycontact.py) script provided in this repo instead. This is because the GUI provides many possible levels of detail to output the contact data, making it hard to have a consistent way to handle this data.


**Note:** To generate the example data used in this tutorial (that we know works). The data was exported as a plain text file with all 6 check boxes ticked. 

**Note:** The GUI output does not preserve whether the interaction is from the side chain (sc) or backbone (bb) unlike the recommended approach, so each feature will be appended with "bb-bb"


In [1]:
import sys # note temporary... 
sys.path.append("..") # note temporary...

from key_interactions_finder import pycontact_processing
from key_interactions_finder import data_preperation

In [2]:
# First, we can define 
in_dir="datasets/Example_PyContact_Session_Output"
pycontact_gui_file = "R1_5d2w_Contacts.csv"
classifications_file = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\MachLearnConformationalFeatures\Github_Project\key-interactions-finder\tutorials\datasets\Example_PyContact_Session_Output\R1_5d2w_Classified.csv"

In [3]:
# Here, we need to define the "pycontact_output_type" as coming from the GUI. 
pycontact_dataset = pycontact_processing.PyContactInitializer(
    pycontact_files=pycontact_gui_file,
    in_dir=in_dir,
    pycontact_output_type="GUI",
    multiple_files=False,
    remove_false_interactions=True,
)

Your PyContact file(s) have been succefully processed.
You have 1456 features and 10001 observations.
The fully processed dataframe is accesible from the '.prepared_df' class attribute.


In [4]:
pycontact_dataset.prepared_df.head(3)

Unnamed: 0,15Arg 10Ile Other bb-bb,16Val 10Ile Other bb-bb,16Val 11Val Other bb-bb,16Val 9Leu Hydrophobic bb-bb,17Val 10Ile Hbond bb-bb,17Val 11Val Other bb-bb,17Val 12Lys Other bb-bb,27Asp 15Arg Saltbr bb-bb,28Ser 22Phe Hbond bb-bb,28Ser 20Ser Hbond bb-bb,...,200His 102Asn Other bb-bb,112Ile 103Thr Other bb-bb,148Thr 104Ala Other bb-bb,231Ile 5Ile Hydrophobic bb-bb,83His 52Thr Other bb-bb,231Ile 178Lys Other bb-bb,209Phe 6Ile Other bb-bb,167Ile 152Leu Hydrophobic bb-bb,233Val 7Ala Hydrophobic bb-bb,126Val 103Thr Other bb-bb
0,0.09703,2.57596,0.31338,0.17245,6.29711,1.86227,1.0542,5.73353,0.69356,9.6895,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.25009,2.59637,0.23426,0.05546,4.83585,2.41831,1.91265,7.03353,1.69754,7.36098,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.13662,2.20884,0.25849,0.2108,3.38742,0.36503,0.6635,6.49379,0.98932,6.09595,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# renumbering residues to start from 1 and not 0. 
prepared_renumberd_df = pycontact_processing.modify_column_residue_numbers(
    dataset=pycontact_dataset.prepared_df, 
    constant_to_add=1
)

In [6]:
supervised_dataset = data_preperation.SupervisedFeatureData(
    input_df=prepared_renumberd_df,
    target_file=classifications_file,
    is_classification=True,
    header_present=True 
)
supervised_dataset.df_processed

Your PyContact features and target variable have been succesufully merged.
You can access this dataset through the class attribute: '.df_processed'.


Unnamed: 0,Target,16Arg 11Ile Other bb-bb,17Val 11Ile Other bb-bb,17Val 12Val Other bb-bb,17Val 10Leu Hydrophobic bb-bb,18Val 11Ile Hbond bb-bb,18Val 12Val Other bb-bb,18Val 13Lys Other bb-bb,28Asp 16Arg Saltbr bb-bb,29Ser 23Phe Hbond bb-bb,...,201His 103Asn Other bb-bb,113Ile 104Thr Other bb-bb,149Thr 105Ala Other bb-bb,232Ile 6Ile Hydrophobic bb-bb,84His 53Thr Other bb-bb,232Ile 179Lys Other bb-bb,210Phe 7Ile Other bb-bb,168Ile 153Leu Hydrophobic bb-bb,234Val 8Ala Hydrophobic bb-bb,127Val 104Thr Other bb-bb
0,ConfA,0.09703,2.57596,0.31338,0.17245,6.29711,1.86227,1.05420,5.73353,0.69356,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ConfA,0.25009,2.59637,0.23426,0.05546,4.83585,2.41831,1.91265,7.03353,1.69754,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ConfA,0.13662,2.20884,0.25849,0.21080,3.38742,0.36503,0.66350,6.49379,0.98932,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ConfA,0.05721,1.82277,1.27105,0.76842,5.56519,1.76725,0.69602,6.99378,1.17269,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ConfA,0.52289,3.41862,0.49625,0.02106,5.31743,2.71661,2.13434,7.96917,2.61228,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,ConfA,0.19024,2.98221,0.87067,0.31181,5.82410,2.56616,1.54288,8.13798,2.21046,...,0.0,0.05753,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9997,ConfA,0.18190,3.85467,0.38426,0.08002,3.15277,0.97987,1.53554,7.72484,2.32024,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9998,ConfA,0.04451,3.32610,0.69906,0.01307,3.40833,1.41239,0.81152,7.97902,2.66831,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9999,ConfA,0.02395,1.39884,0.41597,1.17861,7.26684,2.96571,1.90169,7.49457,3.26276,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
supervised_dataset.df_processed.Target.value_counts()

ConfA    6927
ConfC    2059
ConfB     532
None      483
Name: Target, dtype: int64

In [8]:
# Filtering 
supervised_dataset.reset_filtering()
supervised_dataset.filter_by_occupancy_by_class(min_occupancy=25)
supervised_dataset.filter_by_avg_strength(average_strength_cut_off=1.0)
print(f"Number of features after filtering by average interaction scores: {len(supervised_dataset.df_filtered.columns)}")

Number of features after filtering by average interaction scores: 364


That's all for this tutorial, at this point we would be ready to perform whatever form of analysis we would like with this dataset.