<div style="text-align: justify; padding:5px; background-color:rgb(252, 253, 255); border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">
<img src="https://www.cbi.cnptia.embrapa.br/SMS/images/logo_topo_centro_dir.gif" width="180px" align="right" style="padding: 20px">

<a id="introduction"></a>

<h1> STINGAlloBench- FB_Omage_etal_2023
</h1>

<br>
<br>
<br>
<p  style="text-align: justify">STINGAlloBench: A benchmarking dataset for precise allosteric site prediction using experimentally validated data, tailored for computational biology and machine learning.</p>


<br>
The steps included in this data analysis and visualisation workflow are: 
<br>

1. <a href="#1">Import Packages</a><br>
2. <a href="#2">Load Data & Peak Sheet</a><br>
3. <a href="#3">Extract X & Y</a><br>
4. <a href="#4">Split Data into Train & Test Set</a><br>
5. <a href="#5">Extract, Transform, & Scale X Data with Missing Values Imputed</a><br>
6. <a href="#6">Hyperparameter Optimisation</a><br>
    6.1. <a href="#6.1">Plot R² & Q²</a><br>
    6.2. <a href="#6.2">Plot Latent Projections: Full & CV</a><br>
7. <a href="#7">Build Model & Evaluate</a><br>
8. <a href="#8">Permutation Test</a><br>
9. <a href="#9">Bootstrap Resampling of the Model</a><br> 
10. <a href="#10">Model Evaluation using Bootstrap Resampling</a><br> 
11. <a href="#11">Model Visualisation</a><br> 
    11.1. <a href="#11.1">Plot Latent Projections: in-bag & out-of-bag</a><br>
    11.2. <a href="#11.2">Plot Weight Vectors</a><br>
12. <a href="#12">Variable Contribution Plots</a><br>  
13. <a href="#12">Export Results</a><br>

</div>


<a id="1"></a>
## Packages

Packages provide additional tools that extend beyond the basic functionality of Python programming. Prior to usage, packages need to be imported into the environment. The following packages need to be imported for this computational workflow:

- numpy: A standard package primarily used for the manipulation of arrays.
- pandas: A standard package primarily used for the manipulation of data tables.
- matplotlib: A standard package primarily used for creating static, animated, and interactive visualizations in Python.
- seaborn: A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- shap: A unified approach to explain the output of any machine learning model.
- torch: An open source machine learning framework that accelerates the path from research prototyping to production deployment.
- platform: A standard Python library to access underlying platform's identifying data.
- psutil: A cross-platform library used to access system details and process utilities.
- time: A standard Python library for time-related tasks.
- warnings: A standard Python library to warn the developer about changes that might affect their program.
- sklearn: A standard package with tools for machine learning.
- feature_engine: A Python library with multiple feature engineering techniques.
- xgboost: An optimized distributed gradient boosting library.


In [1]:
import platform
import psutil
import time
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


warnings.filterwarnings('ignore')
sns.set(style='darkgrid', font_scale=1.4)

# Print CPU information
print('CPU:')
print(f'  Number of cores: {psutil.cpu_count(logical=False)}')
print(f'  Number of threads: {psutil.cpu_count(logical=True)}')
print(f'  Architecture: {platform.processor()}')

# Print RAM information
print('RAM:')
print(f'  Total: {psutil.virtual_memory().total / 1e9:.1f} GB')

print('All packages successfully loaded')

CPU:
  Number of cores: 40
  Number of threads: 80
  Architecture: x86_64
RAM:
  Total: 1081.8 GB
All packages successfully loaded


<div style="background-color:rgb(255, 250, 250); padding:20px;">

<h2>Analysis of the Allosteric Site Database (ASD) for Benchmark Creation</h2>

<p>This section details the methodology for loading, processing, and analyzing the Allosteric Site Database (ASD) to create a comprehensive benchmark for allosteric site modeling. The process involves several key steps, from initial data acquisition to the final preparation of the dataset for analysis. The steps are outlined below:</p>

<ul>
 <li><strong>Data Acquisition:</strong> The ASD dataset is acquired from the official ASD website at <a href="https://mdl.shsmu.edu.cn/ASD/module/download/download.jsp?tabIndex=1" target="_blank">https://mdl.shsmu.edu.cn/ASD</a>. The data is downloaded in a .txt file format, which is subsequently converted to a .csv format for compatibility with data processing tools.</li>
    <li><strong>Data Loading:</strong> The converted .csv file is loaded into the data processing environment using <code>pd.read_csv()</code> or similar commands, ensuring the integrity and completeness of the imported data.</li>
</ul>

<p>The resulting benchmark dataset from the ASD will serve as a foundation for rigorous evaluation and comparison of computational methods in predicting and analyzing allosteric sites, facilitating advancements in the field of allosteric modulation research.</p>

</div>


In [2]:
df = pd.read_csv('ASD_Release_202309_AS.csv')

In [4]:
df.columns

Index(['target_id', 'target_gene', 'organism', 'pdb_uniprot', 'allosteric_pdb',
       'modulator_serial', 'modulator_alias', 'modulator_chain',
       'modulator_class', 'modulator_feature', 'modulator_name',
       'modulator_resi', 'function', 'position', 'pubmed_id', 'ref_title',
       'site_overlap', 'allosteric_site_residue'],
      dtype='object')

<div style="background-color:rgb(255, 250, 250); padding:20px;">

<h2>Step 1: Fetching X-ray Resolution Data</h2>

<p>This initial step in our methodology involves programmatically retrieving X-ray resolution data for protein structures identified by their PDB IDs. This process is crucial for assessing the quality of the crystallographic data associated with each allosteric site in the dataset. The following outlines the key components of this step:</p>

<ul>
    <li><strong>Function Definition:</strong> A Python function <code>fetch_resolution</code> is defined to automate the fetching of X-ray resolution data for a given PDB ID. This function utilizes the RCSB PDB REST API to access structural data.</li>
    <li><strong>API Request:</strong> For each PDB ID, an HTTP GET request is sent to the RCSB PDB REST API. The request URL is dynamically constructed to include the specific PDB ID of interest.</li>
    <li><strong>Data Extraction:</strong> Upon a successful API response, the function parses the returned JSON data to extract the X-ray resolution value, if available. This is accomplished by accessing nested fields within the JSON structure.</li>
    <li><strong>Error Handling:</strong> In cases where the API request is unsuccessful or the PDB ID does not exist in the database, the function returns <code>NaN</code> to indicate the absence of resolution data.</li>
    <li><strong>Applying the Function:</strong> The <code>fetch_resolution</code> function is applied across a DataFrame <code>df</code>, which contains a column <code>'allosteric_pdb'</code> with the PDB IDs. This application results in a new column <code>'resolution'</code> in the DataFrame, populated with the fetched X-ray resolution values.</li>
</ul>

<p>This procedure ensures that each entry in our dataset is augmented with precise resolution data, facilitating a more informed analysis and selection of high-quality allosteric sites for further study and benchmarking.</p>

</div>


In [5]:
import requests

# Function to fetch X-ray resolution for a given PDB ID using the RCSB PDB REST API
def fetch_resolution(pdb_id):
    url = f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}"
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        # Extracting resolution from the response JSON, if available
        resolution = data.get('rcsb_entry_info', {}).get('resolution_combined', [None])[0]
        return resolution
    else:
        # Return NaN if the request was unsuccessful or the PDB ID is not found
        return float('nan')

# Apply the function to fetch X-ray resolution for each PDB ID in the DataFrame
df['resolution'] = df['allosteric_pdb'].apply(fetch_resolution)

# Display the updated DataFrame


In [7]:
df.to_csv('asd_step_1a.csv', index=False)

In [8]:
#Statistics on the number of initial entries at this stage

# Total number of entries before removing entries with resolution >= 3.0
total_entries_before = len(df)
print(f"Total entries before removal: {total_entries_before}")

# Remove entries with resolution greater than or equal to 3.0
df_filtered = df[df['resolution'] < 3.0]

# Total number of entries after removal
total_entries_after = len(df_filtered)
print(f"Total entries after removal: {total_entries_after}")

Total entries before removal: 3102
Total entries after removal: 2701
