<a href="https://colab.research.google.com/github/ococrook/hdx-sFDR/blob/main/examples/hdx_sFDR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**HDX p-value adjustment and reighting**

Welcome to the HDX-sFDR analysis notebook! This tool helps you analyze hydrogen-deuterium exchange mass spectrometry (HDX-MS) data using structural False Discovery Rate (sFDR) methodology. Even if you've never used Google Colab before, this guide will help you get started.

**What is Google Colab?**

Google Colab is an interactive notebook environment that runs in your browser. It allows you to run Python code, visualize data, and create reports all in one place.

**How to Use This Notebook**

**Running Code Cells**

1. Code cells contain Python code and have a play button (▶️) on the left.
2. To execute a code cell, click the play button or press Shift+Enter.
3. Wait for the cell to finish running (a spinning circle will appear and then change to a checkmark).
4. Results will appear below the cell.

**Running the Entire Notebook**

1. You can run all cells in order by selecting Runtime > Run all from the menu at the top.
2. It's recommended to run cells in order from top to bottom.

**Uploading Data**
When prompted, you'll need to upload:

1. A CSV file containing your HDX-MS data
2. A protein structure file (.cif format)

Note that the code won't progress if you don't upload a file. It will eventually timeout and your files will be deleted.

Simply click the "Choose Files" button when it appears and select your files in the explorer.


**Common Questions**
**What if I get an error?**

1. Read the error message carefully - it often tells you what went wrong.
2. Check that your uploaded files have the correct format.
3. Try running the cell again.
4. If problems persist, try restarting the runtime (Runtime > Restart runtime).

**How do I save my results?**

1. The notebook will automatically offer to download results as a CSV file.
2. You can save the entire notebook by going to File > Save a copy in Drive.

**Can I modify the parameters?**

1. Yes! But we warn you that they may significantly impact your results
2. If you differ from the defaults in your analysis make sure you can justify that choice

**How long does the analysis take?**

1. Processing time depends on the size of your dataset and protein structure.
2. Most analyses complete within a few minutes.

**Getting Started**

Ready to begin? Start by running the first cell below to install the necessary packages. Then follow the step-by-step instructions throughout the notebook.

In [1]:
# Install package with all dependencies
!pip install --no-cache-dir --upgrade git+https://github.com/ococrook/hdx-sFDR.git

# Import modules
from hdx_sFDR import hdx_utils, hdx_plot, hdx_structure_utils, statistical_inference, evalutions

# For refreshing modules after changes
import importlib
def reload_modules():
    importlib.reload(hdx_utils)
    importlib.reload(hdx_plot)
    importlib.reload(hdx_structure_utils)
    importlib.reload(statistical_inference)
    importlib.reload(evalutions)
    print("All modules reloaded successfully!")


Collecting git+https://github.com/ococrook/hdx-sFDR.git
  Cloning https://github.com/ococrook/hdx-sFDR.git to /tmp/pip-req-build-fb1uhy2g
  Running command git clone --filter=blob:none --quiet https://github.com/ococrook/hdx-sFDR.git /tmp/pip-req-build-fb1uhy2g
  Resolved https://github.com/ococrook/hdx-sFDR.git to commit be991e34188a5f87fc9157f47839a84be21db4ea
  Preparing metadata (setup.py) ... [?25l[?25hdone


The following code chunk allows you to upload your hdx results data from statistical testing. You will be prompted to upload file with a "Choose files" button that should appear at the bottom.

In [2]:
# Upload CSV data file
from google.colab import files
import pandas as pd

print("Please upload your CSV data file:")
uploaded_csv = files.upload()

# Get the filename of the uploaded CSV
csv_filename = list(uploaded_csv.keys())[0]

# Read the CSV into a pandas DataFrame
df = pd.read_csv(csv_filename)
print(f"Successfully loaded CSV with {df.shape[0]} rows and {df.shape[1]} columns")

Please upload your CSV data file:


Saving MBP_ttest_results.csv to MBP_ttest_results (4).csv
Successfully loaded CSV with 460 rows and 7 columns


Check that your file is valid using the following function, if there are issues then you will be indicated what you need to fix e.g. renaming columns.

In [3]:
# check that hdx is the correct csv file

hdx_utils.validate_hdx_csv(df)

CSV validation successful!


True

The next step is to upload your protein structure in CIF format (e.g. from Alphafold 3 prediction). Scroll down if you do not see the button to upload files.

In [4]:
import os
from Bio.PDB import MMCIFParser, PDBExceptions

print("Please upload your protein structure file (.cif):")
uploaded_files = files.upload()

if not uploaded_files:
    print("No file uploaded.")

cif_filename = list(uploaded_files.keys())[0]

# Check file extension
if not cif_filename.lower().endswith('.cif'):
    print(f"Warning: File '{cif_filename}' doesn't have .cif extension. The file might not be in mmCIF format.")

# Try to parse the CIF file
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("structure", cif_filename)

# Basic validation checks
chains = list(structure.get_chains())
if not chains:
    print("Warning: Structure contains no chains.")


# Count models, chains, residues, atoms
model_count = len(structure)
chain_count = len(chains)
residue_count = sum(1 for _ in structure.get_residues())
atom_count = sum(1 for _ in structure.get_atoms())

print("Structure validation successful!")
print(f"Models: {model_count}")
print(f"Chains: {chain_count}")
print(f"Residues: {residue_count}")
print(f"Atoms: {atom_count}")

# Show first few chain IDs
chain_ids = [chain.id for chain in chains[:5]]
if len(chains) > 5:
    print(f"Chain IDs (first 5): {', '.join(chain_ids)}...")
else:
    print(f"Chain IDs: {', '.join(chain_ids)}")


Please upload your protein structure file (.cif):


Saving MBP.cif to MBP (3).cif
Structure validation successful!
Models: 1
Chains: 1
Residues: 381
Atoms: 2954
Chain IDs: A


The next step is to get final analysis results. The default parameters are used:
- lambda_seq = 1 (decay of correlation of sequence - higher correlations decay faster)
- lambda_struct = 1 (decay of correlation of structure - higher correlations decay faster)
- alpha = 0.5 (structure versus sequence weighting, 0 = only sequence, 1 = only structure)
- transform_sum = TRUE (how the transformations are performed)

We do not recommend changing the defaults unless you know what you are doing. Safe changes are increasing lambda_seq and lambda_struct and/or reducing alpha.

In [5]:
    # Run analysis with default parameters
    results = statistical_inference.analyze_hdx_data(df, structure)

    # Display the first few results
    display(results.head())




Processing structure...
Processing peptide data...
Calculating weights...
Calculating corrected statistics...
Analysis complete! Processed 460 peptides.
Estimated effective number of tests: 48.12


Unnamed: 0,start,end,original_p,original_q,corrected_q,weighted_p,weighted_q
0,19.0,30.0,0.628977,0.788882,0.330086,0.569976,0.298114
1,19.0,30.0,0.772529,0.880835,0.368561,0.806143,0.349463
2,19.0,30.0,0.105592,0.27598,0.115476,0.203132,0.179621
3,19.0,30.0,0.078803,0.236925,0.099135,0.389553,0.237275
4,19.0,31.0,0.49388,0.688438,0.288058,0.557609,0.294042


Run the code chunk below to download results. Note this will initiate a download.

In [6]:
# Download results
download_filename = 'hdx_analysis_results.csv'
results.to_csv(download_filename, index=False)
files.download(download_filename)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>