This repository contains code, data, and figures for analyzing and classifying ion channel and receptor subtypes from MOD files using machine learning and simulation-derived features. For more information, please review the preprint: https://www.biorxiv.org/content/10.64898/2026.04.23.720371v1
├── annotations/ # Dataset (ModelDB annotations and labels)
├── code/ # All scripts and notebooks for data processing, modeling, and figures
├── figures/ # Output figures used in the manuscript
├── README # This file
-
annotations/
- Contains the dataset used for training and evaluation.
- Includes original and processed annotation files (e.g.,
model_db_annotations.xlsx).
-
code/
- Contains all scripts used in the pipeline.
- Files are prefixed with numbers (
0-,1-,2-, etc.) indicating approximate execution order. - Note: Some scripts are independent and can be run separately.
-
figures/
- Contains all generated figures (main + supplemental).
- Includes Sankey diagrams, panel plots, and performance visualizations.
The workflow consists of the following major steps:
-
Data Acquisition & Preparation
0-download_*.py- Downloads and prepares MOD files and metadata.
-
Compilation & Simulation
0-compile.py,0-simulate.py- Compiles MOD files and extracts simulation-based features.
-
Feature Engineering
_get_mod_dynamics.py- Extracts dynamic features (e.g., time-to-peak, decay metrics).
-
Modeling & Analysis
1-query-modeldb-mod-files-full.py2-ml_pipeline.ipynb- Uses an LLM to classify biological mechanisms inside of a mod-file and stores unique results in a database
- Trains machine learning models and evaluates predictions.
-
Evaluation & Metrics
kappa.R- Computes agreement, confusion matrices, and performance metrics.
-
Visualization
3-sankey.R3-scatterpie.ipynb3-heatmap.R- Generates all manuscript figures.
1. Clone the repository
git clone https://github.com/innacohen/mod-annotation.git
cd mod-annotation2. Set up environment Python (recommended: 3.9+) R (for visualization scripts)
pip install -r requirements.txt 3. Run pipeline Run scripts in approximate order:
# Data + preprocessing
python code/0-download_*.py
python code/0-compile.py
python code/0-simulate.py
python code/0-combine.py
# Feature extraction
python code/_get_mod_dynamics.py
# Modeling
jupyter notebook code/2-ml_pipeline.ipynb- Some scripts assume specific file paths (e.g., cluster environments). You may need to modify paths locally.
- Intermediate files (e.g, CSV outputs) are reused across steps.
- Not all scripts must be run sequentially. Figure scripts can often be run independently once data is prepared.
This project is released under the BSD 3-Clause License, allowing reuse and modification with attribution.
- ModelDB for providing MOD file data
- Python and R open-source libraries used throughout the project
- Portions of the pipeline were developed iteratively with assistance from LLM-based tools (e.g., ChatGPT, Claude) for code structuring and debugging.