<div style="text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
<img src="https://www.cbi.cnptia.embrapa.br/SMS/images/logo_topo_centro_dir.gif" width="180px" align="right" style="padding: 20px">

<a id="introduction"></a>

<h1> STINGAlloBench- FB_Omage_etal_2023
</h1>

<br>
<br>
<br>
<p  style="text-align: justify">STINGAlloBench: A benchmarking dataset for precise allosteric site prediction using experimentally validated data, tailored for computational biology and machine learning.</p>


<br>
The steps included in this data analysis and visualisation workflow are: 
<br>

1. <a href="#1">Import Packages</a><br>
2. <a href="#2">Load Data & Peak Sheet</a><br>
3. <a href="#3">Extract X & Y</a><br>
4. <a href="#4">Split Data into Train & Test Set</a><br>
5. <a href="#5">Extract, Transform, & Scale X Data with Missing Values Imputed</a><br>
6. <a href="#6">Hyperparameter Optimisation</a><br>
    6.1. <a href="#6.1">Plot R² & Q²</a><br>
    6.2. <a href="#6.2">Plot Latent Projections: Full & CV</a><br>
7. <a href="#7">Build Model & Evaluate</a><br>
8. <a href="#8">Permutation Test</a><br>
9. <a href="#9">Bootstrap Resampling of the Model</a><br> 
10. <a href="#10">Model Evaluation using Bootstrap Resampling</a><br> 
11. <a href="#11">Model Visualisation</a><br> 
    11.1. <a href="#11.1">Plot Latent Projections: in-bag & out-of-bag</a><br>
    11.2. <a href="#11.2">Plot Weight Vectors</a><br>
12. <a href="#12">Variable Contribution Plots</a><br>  
13. <a href="#12">Export Results</a><br>

</div>


<a id="1"></a>
## 1. Import Packages

Packages provide additional tools that extend beyond the basic functionality of Python programming. Prior to usage, packages need to be imported into the environment. The following packages need to be imported for this computational workflow:

- numpy: A standard package primarily used for the manipulation of arrays.
- pandas: A standard package primarily used for the manipulation of data tables.
- matplotlib: A standard package primarily used for creating static, animated, and interactive visualizations in Python.
- seaborn: A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- shap: A unified approach to explain the output of any machine learning model.
- torch: An open source machine learning framework that accelerates the path from research prototyping to production deployment.
- platform: A standard Python library to access underlying platform's identifying data.
- psutil: A cross-platform library used to access system details and process utilities.
- time: A standard Python library for time-related tasks.
- warnings: A standard Python library to warn the developer about changes that might affect their program.
- sklearn: A standard package with tools for machine learning.
- feature_engine: A Python library with multiple feature engineering techniques.
- xgboost: An optimized distributed gradient boosting library.


In [1]:
import platform
import psutil
import time
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


warnings.filterwarnings('ignore')
sns.set(style='darkgrid', font_scale=1.4)

# Print CPU information
print('CPU:')
print(f'  Number of cores: {psutil.cpu_count(logical=False)}')
print(f'  Number of threads: {psutil.cpu_count(logical=True)}')
print(f'  Architecture: {platform.processor()}')

# Print RAM information
print('RAM:')
print(f'  Total: {psutil.virtual_memory().total / 1e9:.1f} GB')

print('All packages successfully loaded')

CPU:
  Number of cores: 40
  Number of threads: 80
  Architecture: x86_64
RAM:
  Total: 1081.8 GB
All packages successfully loaded


<div style="background-color:rgb(255, 250, 250); padding:20px;">

<h2>2. Load and Process Training Data</h2>

<p>This section outlines the method for loading and processing the training dataset for modeling. It involves reading data from CSV files, cleaning, and preparing it for further analysis. The key steps include verifying and converting data types, merging data frames, and handling missing values. The process is detailed as follows:</p>
<ul>
    <li><strong>Data Loading:</strong> Training data is loaded from CSV files using <code>pd.read_csv()</code>.</li>
    <li><strong>Data Cleaning:</strong> Non-numeric values in 'number' columns are filtered out, and the column is converted to integer type.</li>
    <li><strong>Data Merging:</strong> Two datasets are merged on specific columns ('pdb_code', 'chain_name', 'number') to combine relevant information.</li>
    <li><strong>Handling Missing Values:</strong> Columns and rows with excessive missing values are identified and removed to maintain data integrity.</li>
</ul>

</div>
