Genomic language models (gLMs) with evolutionary conservation supervision.
gamba contains genome language models trained on human DNA sequence together with Zoonomia 241-mammalian phyloP conservation scores. The released checkpoints include autoregressive ArGamba models and bidirectional BiGamba models trained with sequence-only, conservation-only, or dual sequence-plus-conservation objectives.
The released checkpoints can be loaded directly with Hugging Face transformers.
import torch
from transformers import AutoModel
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
REPO_ID = "micanonsens/bigamba-dual-step44000"
model = AutoModel.from_pretrained(
REPO_ID,
trust_remote_code=True,
).eval().to(DEVICE)
print(f"Loaded {REPO_ID} on {DEVICE}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")Example repositories:
| Checkpoint name | Architecture | Training task |
|---|---|---|
| ArGamba-dual | ArGamba (Jamba autoregressive) | NTP + CEP |
| ArGamba-seq_only | ArGamba (Jamba autoregressive) | NTP |
| ArGamba-cons_only | ArGamba (Jamba autoregressive) | CEP |
| BiGamba-dual | BiGamba (Mamba bidirectional) | MLM + MEM |
| BiGamba-seq_only | BiGamba (Mamba bidirectional) | MLM |
| BiGamba-cons_only | BiGamba (Mamba bidirectional) | MEM |
A Colab notebook (src/gamba_notebook.ipynb) is provided for loading the released Hugging Face models, scoring genomic intervals, comparing predictions with phyloP from a bigWig file, and exporting predictions as bedGraph files.
Recommended notebook workflow:
- Install the environment dependencies.
- Restart the Colab runtime once after installation.
- Load ArGamba and/or BiGamba from Hugging Face.
- Upload or define BED regions.
- Run tiled conservation prediction over those regions.
- Optionally compare predictions to true phyloP values from a bigWig file.
- Export predictions as bedGraph files.
For long regions, the notebook tiles windows differently for autoregressive and bidirectional models:
ArGamba / causal:
[upstream context | scored positions]
BiGamba / bidirectional:
[left context | scored positions | right context]
Context-only positions are discarded to reduce edge effects.
Clone the repository and navigate to the project directory:
git clone ...
cd gamba/Install dependencies in your preferred Python environment. The exact CUDA/PyTorch/Mamba versions may depend on your system. The released Hugging Face models require trust_remote_code=True.
Core dependencies include:
pip install torch transformers safetensors huggingface_hub accelerate einops
pip install mamba-ssm causal-conv1d
pip install pyfaidx pyBigWig pandas numpy scipy scikit-learn matplotlib seaborn tqdm
pip install evodiffOn managed GPU systems, install PyTorch and CUDA-compatible packages according to the cluster or Colab runtime.
To set up the main genome/phyloP data:
mkdir -p data_processing/data/240-mammalian/
# Download human chromosome sizes.
curl https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes \
> data_processing/data/240-mammalian/hg38.chrom.sizes
python data_processing/generate_human_bed.py
# Download full human genome FASTA.
curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz \
> data_processing/data/240-mammalian/hg38.ml.fa.gz
gunzip data_processing/data/240-mammalian/hg38.ml.fa.gz
# Download centromere locations.
curl https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/centromeres.txt.gz \
> data_processing/data/240-mammalian/centromeres.txt.gz
gunzip data_processing/data/240-mammalian/centromeres.txt.gz
# Download repeat locations from the UCSC Genome Browser RepeatMasker track.
# Save the file as data_processing/data/240-mammalian/repeats_hg38.bed.gz.
gunzip data_processing/data/240-mammalian/repeats_hg38.bed.gz
# Download Zoonomia 241-mammalian phyloP scores.
curl https://cgl.gi.ucsc.edu/data/cactus/241-mammalian-2020v2-hub/Homo_sapiens/241-mammalian-2020v2.bigWig \
> data_processing/data/240-mammalian/241-mammalian-2020v2.bigWigCreate chromosome splits:
cat > data_processing/data/240-mammalian/splits.json <<'EOF'
{
"train": [
"1", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"17", "18", "19", "20", "21", "X"
],
"valid": [
"3", "16"
],
"test": [
"2", "22"
]
}
EOFGenerate exclusion regions and clean phyloP arrays:
python data_processing/exclusion_regions.py
for chrom in {1..22} X; do
echo "Running chr${chrom}"
python data_processing/generate_clean_phyloP.py --chromosome "chr${chrom}"
doneOptionally generate FASTA files from the same cleaned regions:
python data_processing/generate_same_data_fasta.pyUncompress .npz files and verify chromosome sizes:
python data_processing/uncompress_npz.py --type "small"
python assert_chromosome_sizes.py --type "small"Expected structure:
data_processing/data/240-mammalian/
├── train/
│ ├── 1_conservation_small.npy
│ ├── 1_sequence_small.npy
│ ├── 1.npz
│ └── ...
├── valid/
│ ├── 3_conservation_small.npy
│ ├── 3_sequence_small.npy
│ ├── 3.npz
│ ├── 16_conservation_small.npy
│ ├── 16_sequence_small.npy
│ └── 16.npz
└── test/
├── 2_conservation_small.npy
├── 2_sequence_small.npy
├── 2.npz
├── 22_conservation_small.npy
├── 22_sequence_small.npy
└── 22.npz
Run a basic data sanity check:
python src/test_sequence.pyTraining scripts load a JSON experiment config, construct the corresponding model/task wrapper, build ConservationDataset dataloaders, and save checkpoint directories during training.
All released Gamba checkpoints were trained with scripts in src/ using:
configs/jamba-small-240mammalian.jsonThe training scripts support a shared command-line interface:
out_fpath Output/checkpoint directory. Optional positional argument.
data_root Root directory containing prepared data. Optional positional argument.
--config_fpath Experiment config JSON.
--mini_run Run on a small subset for debugging.
--checkpoint_freq Save/validate every N steps.
--random_seed Random seed.
--run_type train or test.
--dtype float32, float16, or bfloat16.
--verbose Verbose logging.
--no_wandb Disable Weights & Biases logging.
--last_step Resume from latest checkpoint (-1) or a specific checkpoint step.
The default config path is:
configs/jamba-small-240mammalian.jsonIn the examples below, data_processing/data/ is used as the data root. The training scripts append the dataset name from the config, e.g. 240-mammalian.
Note: several BiGamba training scripts retain historical caduceus_train filenames.
| Model family | Task | Script | Config |
|---|---|---|---|
| ArGamba | dual: NTP + CEP | src/test_train.py |
configs/jamba-small-240mammalian.json |
| ArGamba | sequence-only: NTP | src/test_train_noPHYLOP.py |
configs/jamba-small-240mammalian.json |
| ArGamba | conservation-only: CEP | src/test_train_noALM.py |
configs/jamba-small-240mammalian.json |
| BiGamba | dual: MLM + MEM | src/caduceus_train.py |
configs/jamba-small-240mammalian.json |
| BiGamba | sequence-only: MLM | src/caduceus_train_noPHYLOP.py |
configs/jamba-small-240mammalian.json |
| BiGamba | conservation-only: MEM | src/caduceus_train_noMLM.py |
configs/jamba-small-240mammalian.json |
Use --mini_run to verify that the environment, data paths, model construction, and checkpoint writing work before launching a full run.
Example for ArGamba dual:
python src/test_train.py \
checkpoints/argamba-dual \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--mini_run \
--checkpoint_freq 100 \
--dtype bfloat16 \
--no_wandbExample for BiGamba dual:
python src/caduceus_train.py \
checkpoints/bigamba-dual \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--mini_run \
--checkpoint_freq 100 \
--dtype bfloat16 \
--no_wandbArGamba dual:
python src/test_train.py \
checkpoints/argamba-dual \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--checkpoint_freq 2000 \
--dtype bfloat16ArGamba sequence-only:
python src/test_train_noPHYLOP.py \
checkpoints/argamba-seq-only \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--checkpoint_freq 2000 \
--dtype bfloat16ArGamba conservation-only:
python src/test_train_noALM.py \
checkpoints/argamba-cons-only \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--checkpoint_freq 2000 \
--dtype bfloat16BiGamba dual:
python src/caduceus_train.py \
checkpoints/bigamba-dual \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--checkpoint_freq 2000 \
--dtype bfloat16BiGamba sequence-only:
python src/caduceus_train_noPHYLOP.py \
checkpoints/bigamba-seq-only \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--checkpoint_freq 2000 \
--dtype bfloat16BiGamba conservation-only:
python src/caduceus_train_noMLM.py \
checkpoints/bigamba-cons-only \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--checkpoint_freq 2000 \
--dtype bfloat16Resume from the latest checkpoint in the output directory:
python src/test_train.py \
checkpoints/argamba-dual \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--last_step -1Resume from a specific checkpoint step:
python src/test_train.py \
checkpoints/argamba-dual \
data_processing/data \
--config_fpath configs/jamba-small-240mammalian.json \
--last_step 44000Training writes checkpoint directories into out_fpath. Most scripts save checkpoints as dcp_<step>/; some task-specific scripts may use a different prefix.
Checkpoints contain model weights, optimizer state, scheduler state, current epoch, step count, token count, and sequence count.
After preparing the main genome/phyloP training data, additional scripts can be used to generate downstream evaluation regions and run representation-level evaluations.
The script data_processing/sample_regions.py creates BED files for biologically defined genomic categories used in downstream evaluation.
Some small region annotation files are included in:
data_processing/region_info/
Currently included:
data_processing/region_info/
├── experiments.tsv
├── hg38_UCNE_coordinates.bed
├── promoters.bed
└── ucne_paralogues.txt
These files are small enough to keep in the repository. See the paper for details on how these annotation files were derived.
Large annotation files are not included in the repository and should be downloaded or generated locally:
data_processing/region_info/
├── repeats_hg38.bed
├── UCSC_3UTR_exons.bed
└── UCSC_5UTR_exons.bed
The following inputs are required or optional depending on which categories you want to generate:
| Category | Source |
|---|---|
coding_regions |
GTF |
noncoding_regions |
inferred from GTF-derived annotated regions |
exons |
GTF |
introns |
inferred from GTF exon structure |
upstream_TSS |
inferred from GTF transcript boundaries |
start_codon |
GTF |
stop_codon |
GTF |
promoters |
data_processing/region_info/promoters.bed |
UTR5 |
UCSC 5′ UTR BED export |
UTR3 |
UCSC 3′ UTR BED export |
repeats |
UCSC RepeatMasker BED export |
UCNE |
data_processing/region_info/hg38_UCNE_coordinates.bed |
vista_enhancer |
data_processing/region_info/experiments.tsv |
phyloP_positive |
sampled from phyloP bigWig |
phyloP_neutral |
sampled from phyloP bigWig |
phyloP_negative |
sampled from phyloP bigWig |
Prepare per-chromosome GTF files in:
data_processing/data/gtfs/
Expected structure:
data_processing/data/gtfs/
├── chr1.gtf
├── chr2.gtf
├── ...
└── chrX.gtf
These are derived from GENCODE.
Repeat annotations can be exported from the UCSC Genome Browser RepeatMasker track and saved locally as:
data_processing/region_info/repeats_hg38.bed
UTR annotations can be exported from UCSC as BED files and saved locally as:
data_processing/region_info/UCSC_5UTR_exons.bed
data_processing/region_info/UCSC_3UTR_exons.bed
Promoter annotations can be downloaded from EPD:
https://epd.expasy.org/ftp/epdnew/human/current/
By default, generated region BED files are written to:
data_processing/data/regions/
with one subdirectory per category and one BED file per chromosome:
data_processing/data/regions/
├── coding_regions/
│ ├── chr1.bed
│ ├── chr2.bed
│ └── ...
├── UCNE/
├── repeats/
├── UTR3/
├── UTR5/
└── vista_enhancer/
Example command:
python data_processing/sample_regions.py \
--bigwig_file data_processing/data/240-mammalian/241-mammalian-2020v2.bigWig \
--genome_fasta data_processing/data/240-mammalian/hg38.ml.fa \
--gtf_dir data_processing/data/gtfs/ \
--vista_tsv data_processing/region_info/experiments.tsv \
--promoters_bed data_processing/region_info/promoters.bed \
--utr5_bed data_processing/region_info/UCSC_5UTR_exons.bed \
--utr3_bed data_processing/region_info/UCSC_3UTR_exons.bed \
--repeats_bed data_processing/region_info/repeats_hg38.bed \
--ucne_bed data_processing/region_info/hg38_UCNE_coordinates.bed \
--ucne_paralogues data_processing/region_info/ucne_paralogues.txt \
--output_dir data_processing/data/regions \
--chromosomes auto \
--num_regions 10000 \
--region_length 2048 \
--limit_per_category 10000 \
--phylop_num_samples 10000 \
--seed 42Small chr22 test:
python data_processing/sample_regions.py \
--chromosomes chr22 \
--num_regions 100 \
--phylop_num_samples 1000 \
--limit_per_category 100 \
--output_dir data_processing/data/regions_testThe script enforces non-overlap between categories using a priority order, so higher-priority feature classes are retained first.
The script data_processing/make_ATG_data.py generates a 5-way ATG benchmark. For each transcript with a valid ATG start codon, it identifies:
- the true start codon ATG
- a nearby noncoding ATG, 2–5 kb away by default
- a far noncoding ATG, at least 100 kb away by default
- an in-frame internal methionine from the same coding sequence
- an out-of-frame ATG motif from the same coding sequence
Default output directory:
data_processing/data/ATGs_simplified/
Each chromosome produces a TSV:
data_processing/data/ATGs_simplified/
├── chr1_atg_5way_labels.tsv
├── chr2_atg_5way_labels.tsv
└── ...
Example for one chromosome:
python data_processing/make_ATG_data.py \
--chrom chr22 \
--gtf_dir data_processing/data/gtfs \
--genome data_processing/data/240-mammalian/hg38.ml.fa \
--out data_processing/data/ATGs_simplified \
--n_sample 10000 \
--random_seed 42Loop over autosomes:
for i in $(seq 1 22); do
chrom="chr${i}"
echo "[RUN] ${chrom}"
python data_processing/make_ATG_data.py \
--chrom "$chrom" \
--gtf_dir data_processing/data/gtfs \
--genome data_processing/data/240-mammalian/hg38.ml.fa \
--out data_processing/data/ATGs_simplified \
--n_sample 10000 \
--random_seed 42
doneConcatenate chromosome-level TSVs:
mkdir -p data_processing/data/ATGs
head -n 1 data_processing/data/ATGs_simplified/chr1_atg_5way_labels.tsv \
> data_processing/data/ATGs/all_chr_atg_5way.tsv
for f in data_processing/data/ATGs_simplified/chr*_atg_5way_labels.tsv; do
tail -n +2 "$f" >> data_processing/data/ATGs/all_chr_atg_5way.tsv
doneThe script src/evaluation/ATG_reps.py loads the ATG 5-way TSV, extracts sequence contexts around each ATG, embeds them with Gamba/Caduceus or baseline features, and evaluates whether representations distinguish the five ATG classes using leave-one-out 1-nearest-neighbor classification.
Default ATG input:
data_processing/data/ATGs/all_chr_atg_5way.tsv
Default output directory:
data_processing/data/240-mammalian/ATG_reps_5way/
Example Gamba dual-task evaluation:
python src/evaluation/ATG_reps.py \
--atg_tsv_path data_processing/data/ATGs/all_chr_atg_5way.tsv \
--bigwig_file data_processing/data/240-mammalian/241-mammalian-2020v2.bigWig \
--genome_fasta data_processing/data/240-mammalian/hg38.ml.fa \
--checkpoint_dir /home/mica/gamba/ \
--config_fpath configs/jamba-small-240mammalian.json \
--output_dir data_processing/data/240-mammalian/ATG_reps_5way \
--model_type gamba \
--training_task dual \
--last_step 44000 \
--batch_size 32 \
--n_examples 2000 \
--seed 42Other trained model variants:
python src/evaluation/ATG_reps.py \
--model_type gamba \
--training_task seq_only \
--last_step 44000 \
--n_examples 2000
python src/evaluation/ATG_reps.py \
--model_type gamba \
--training_task cons_only \
--last_step 44000 \
--n_examples 2000Random-initialized comparison:
python src/evaluation/ATG_reps.py \
--model_type gamba \
--training_task dual \
--last_step 0 \
--n_examples 2000Baselines:
python src/evaluation/ATG_reps.py \
--baseline kmer6 \
--n_examples 2000
python src/evaluation/ATG_reps.py \
--baseline kmer6_flanked \
--n_examples 2000
python src/evaluation/ATG_reps.py \
--baseline phylop \
--n_examples 2000Use a 6-nt ROI around each ATG instead of the default 3-nt ROI:
python src/evaluation/ATG_reps.py \
--model_type gamba \
--training_task dual \
--last_step 44000 \
--use_6mer_roiThis evaluation saves files such as:
reps_<model_tag>_ATG_5way_all_labels.npz
reps_<model_tag>_ATG_5way_all_labels_meta.parquet
reps_<model_tag>_ATG_5way_all_labels_full.npz
reps_<model_tag>_ATG_5way_all_labels_full_meta.parquet
knn_heatmap_<model_tag>_ATG5way_all_labels.png
knn_heatmap_<model_tag>_ATG1_vs_2.png
knn_heatmap_<model_tag>_ATG1_vs_3.png
knn_heatmap_<model_tag>_ATG1_vs_4.png
knn_heatmap_<model_tag>_ATG1_vs_5.png
balanced_accuracy_<model_tag>_ATG5way.json
sampled_examples_atg5.tsv
sampled_examples_atg5.meta.json
General representation evaluations can be run with:
python src/evaluation/run_eval.py \
--checkpoint_dir /home/mica/gamba/ \
--config_fpath configs/jamba-small-240mammalian.json \
--regions_dir data_processing/data/regions \
--bigwig_file data_processing/data/240-mammalian/241-mammalian-2020v2.bigWig \
--genome_fasta data_processing/data/240-mammalian/hg38.ml.fa \
--model_type gamba \
--training_task dual \
--last_step 44000Adjust --training_task for different model variants:
--training_task dual
--training_task seq_only
--training_task cons_onlyUse --last_step 0 for random-initialized comparisons where supported.
Training checkpoints are saved under the requested checkpoint/output directory. Evaluation outputs are written under the specified evaluation output directory and typically include representation arrays, metadata tables, plots, and metrics JSON files.
Common output types:
*.npz compressed embeddings and labels
*.parquet per-example metadata
*.png heatmaps and plots
*.json evaluation metrics
*.bed genomic regions
*.bedGraph genome-browser-compatible prediction tracks