MHCflurry 2.3.0rc2
Pre-releaseMHCflurry 2.3.0
Release-candidate notes for 2.3.0. Held in this file (vs upstream into a
single CHANGELOG.md) until after the validation training run completes
and any last revisions land; will move to CHANGELOG.md at tag time.
Headline
Pan-allele training pipeline modernization. fit() now defaults to
device-resident tensors on CUDA — the inner training loop, random-
negative pool, and validation forward pass all stay on-GPU, closing the
GPU-starvation gap that the host-side batching path was working
around. The post-training pipeline (select → calibrate → eval → plot)
is unified into a single resumable script, and calibration runs on
device end-to-end (PercentRankTransform.fit_batch_torch,
motif_summary.motif_summary_chunk_gpu) for an additional per-worker speedup on
top of the legacy --gpu-batched allele batching. Recipe tightening
(min_delta=1e-7, max_epochs=500) kills the patience-reset
noise tail. Auto-resolvers pick workers/dataloader/compile settings
from hardware so per-box tuning stops being manual.
The orchestrator-as-locus-of-control architecture is documented in
docs/orchestrator.md — read that for the
"who owns what" picture across parallelism, tensor residency, and env
knobs.
No changes to the prediction interface. Saved 2.2.x model bundles
load and predict identically — the changes are entirely in how new
models are trained.
Performance
- ~2–3× per-task training speedup from device-resident affinity
training tensors (closes 0–30% GPU utilization observed on the
2026-04-25 8×A100 baseline run). - ~10–20× calibration speedup from
--gpu-batched, larger work
chunks, and the affinity release wrapper calibrating at 50 K peptides
per length (the--num-peptides-per-lengthCLI default is unchanged at
100 K). - 30–40% fewer wasted training epochs from the recipe changes
(min_delta=1e-7,max_epochs=500) terminating noise-floor
patience-reset trajectories.
New public API
mhcflurry/class1_affinity_training_data.py— device-resident affinity
training row space.AffinityDeviceTrainingDatakeeps real examples and
random negatives as torch tensors on the active device for onefit()call.mhcflurry/training_benchmark.py— micro-benchmarks for the
training inner loop (used for sweep_workers analysis).
Recipe changes (scripts/training/release_exact/generate_hyperparameters.py)
These produce different model weights from the published 2.2.x
release. Quantitative deltas vs the 2.2.0 ensemble on the
data_evaluation hit/decoy benchmark are reported in
validation results below once the 2.3.0
validation run completes.
| Hyperparameter | 2.2.x | 2.3.0 | Why |
|---|---|---|---|
max_epochs |
5000 | 500 | Median observed was 67; max 174. The 5000 ceiling was theatrical and let pathological patience-reset tasks burn unbounded compute. 500 leaves comfortable headroom. |
min_delta |
0.0 | 1e-7 | With min_delta=0, a 1e-9 RMSprop noise-floor improvement resets the 20-epoch patience counter, stretching some tasks to 174+ epochs at val_loss ~0.28 with no real signal. 1e-7 is two orders of magnitude above the observed noise rate; preserves real escape trajectories (typically ≥1e-3/epoch). |
validation_interval |
1 (always validated) | 5 | Skip the validation forward pass on 4 of 5 epochs; saves ~150 ms/epoch + a GPU sync barrier. The final epoch and any patience-trigger epoch are always measured (the saved model reflects an up-to-date val_loss). |
dataloader_num_workers (job-env default) |
0 | 1 | Applies to streaming pretraining batches only. Affinity fine-tuning no longer uses a per-fit DataLoader; it batches from device-resident tensors. One streaming worker per fit is the release wrapper default; tune upward only when CPU headroom and measurements justify it. |
peptide_amino_acid_encoding_torch |
n/a | true |
Renamed replacement for the legacy peptide_amino_acid_encoding_gpu key, which is still accepted as an alias. Fixed peptide vector expansion moves from a numpy lookup at encode time to a frozen torch embedding table in the network's forward pass. peptides_to_network_input now returns int8 amino-acid indices by default; CUDA/MPS/CPU widens to the configured fixed vector encoding (BLOSUM62, one-hot, PMBEC, contact, physchem explicit descriptors, atchley factors, or composites such as BLOSUM62+physchem). Encodings may use a :minmax suffix, e.g. PMBEC:minmax+contact:minmax, to scale non-X values to [-1, 1] while preserving X as zero. Eliminates the ~17 sec/epoch CPU bottleneck in random-negative regeneration with random_negative_pool_epochs=1. Forward parity vs numpy path verified by test_peptide_amino_acid_encoding_torch_forward_parity. |
patience stays at 20.
CLI changes
-
Unified
mhcflurryparent command. Every tool is now reachable as
mhcflurry <subcommand>(mhcflurry predict,mhcflurry downloads fetch,
mhcflurry class1-train-pan-allele-models, …) under onemhcflurry --help
surface. The historicalmhcflurry-<subcommand>console scripts still work
as compat shims (same entry points). Two tools are new and unified-only:
mhcflurry compare-modelsandmhcflurry plot-model-comparison. -
mhcflurry-class1-train-pan-allele-models --max-workers-per-gpu
default changed from1000(effectively unlimited per-GPU) to
auto. Auto-detect picksmin(num_jobs/num_gpus, 0.6×free_vram/per_worker_gb, hard_cap=4)without importing torch or
initializing CUDA in the parent process.per_worker_gbdefaults to
4 GB (the affinity-fit footprint).Cross-checks: 8 GPUs + 16 jobs → 2 (num-jobs-limited); 8 GPUs +
32 jobs → 4 (hard cap, ample VRAM); CPU-only → 1.Pass
--max-workers-per-gpu Nto pin explicitly. -
mhcflurry-class1-train-pan-allele-models --dataloader-num-workers
new flag, defaultauto. Orchestrator derives the per-fit-worker
DataLoader prefetch child count from the box's vCPUs / RAM /
resolved fit-worker plan via
auto_dataloader_num_workers, capped at 4. The resolved value
overrides anydataloader_num_workersset in component-model
hyperparameters at planning time, so saved configs reflect the
actual choice. On 8×A100-80GB Verda (176v / 16 fit / 400 G) this
resolves to 4 — the 2026-04-26 production benchmark — and steps
down on tighter boxes (3 on 8×L40S, 1 on tight cluster nodes, 0 on
RAM-starved or CPU-oversubscribed configs). The release recipe
passesDATALOADER_NUM_WORKERS=autoby default; pin a literal int
only when re-benchmarking.The flag is added via shared
add_local_parallelism_argsso every
train_*_commandaccepts it. Affinity (pan-allele, allele-specific)
applies it viaapply_dataloader_num_workers_to_work_items.
Processing accepts the flag for argv uniformity but is a no-op
untilClass1ProcessingNeuralNetworkgrows the same prefetch
hyperparameter; presentation runs single-process and ignores it. -
mhcflurry-calibrate-percentile-rankswrapper-default now
passes--gpu-batchedand uses larger chunk sizes. Bit-identical
on CUDA per the existing flag's behavior (issue #272).
Behavioral changes worth knowing
Training and calibration are reproducible by default (--random-seed)
Every CLI command that involves randomness — mhcflurry-class1-train-pan-allele-models,
-train-allele-specific-models, -train-processing-models,
-select-allele-specific-models, and mhcflurry-calibrate-percentile-ranks —
now takes a single --random-seed that controls all of its randomness:
fold/held-out assignment, weight initialization, example/batch shuffles,
random-negative sampling, random peptide universes, and genotype sampling.
The master seed is logged and, for the two-phase pan-allele/processing
pipelines, persisted into training_init_info.pkl so it survives an
--only-initialize / --continue-incomplete split.
The default is 42, not entropy — so a run reproduces bit-for-bit out of
the box (same data, folds, replicates, hyperparameters → identical models).
This is a change from 2.2.x, where each fit drew independent OS entropy and
runs were not reproducible. Pass --random-seed N for a different, still
reproducible run. Ensemble members and per-fit work stay decorrelated (each
derives a distinct sub-seed from the master), so seeding does not reduce
diversity. The neural-network fit() / fit_streaming_batches() and
Class1AffinityPredictor.fit_allele_specific_predictors() APIs gained a
matching seed= keyword (defaults to None = the prior stochastic behavior
for direct API callers).
Reproducibility caveats. "Bit-for-bit" is exact on CPU and for the default
(Linear/RMSprop) affinity/processing architecture. Two scope conditions are
worth knowing:
- Fixed effective minibatch size.
fit()may shrink the minibatch to fit
available VRAM, and that shrink depends on free GPU memory and how many
workers share the card — so the same seed on a busier or smaller GPU can
produce a different model. A warning is logged whenever the shrink fires
under an explicit seed, andfit_info["effective_minibatch_size"]records
the value actually used. Pin the minibatch (or run on matching hardware) for
cross-machine bit-for-bit reproduction. - CUDA kernel determinism. Seeding covers the RNGs, but mhcflurry does not
forcetorch.use_deterministic_algorithms(True), and opting into
MHCFLURRY_MATMUL_PRECISIONenablescudnn.benchmarkautotuning. The
default MLP triggers no cuDNN kernels so it stays deterministic;
convolutionallocally_connected_layersvariants are not guaranteed
bit-identical run-to-run on CUDA.
mhcflurry-class1-train-presentation-models also accepts --random-seed for
uniformity (and logs the resolved value), though it has no stochastic step
today (the logistic-regression fit is deterministic and the parallel feature
path is pure inference).
Because the framework moved from TF/Keras to a Torch-resident loop, 2.3.0 does
not reproduce 2.2.x outputs at an equal seed even on CPU: the per-epoch
training shuffle moved from NumPy to torch.randperm, and scan/presentation
result="best" ties now break deterministically by peptide (a stable
secondary sort key), so the specific tied peptide reported can differ from
2.2.x. These changes are intentional; only exact-tie outputs and cross-version
seed-equality are affected.
--held-out-fraction-seed default is now None (allele-specific)
In mhcflurry-class1-train-allele-specific-models, the
--held-out-fraction-seed default changed from 0 to None. With no flag,
the held-out split is now derived from --random-seed (so the whole run
reproduces from one value) instead of the implicit seed=0 split 2.2.0 used.
The no-flag held-out partition therefore differs from 2.2.0; pass
--held-out-fraction-seed 0 to recover the previous split exactly.
Calibrate silently filters unsupported alleles
mhcflurry-calibrate-percentile-ranks now drops alleles from
predictor.supported_alleles that fail mhcgnomes.parse annotation
checks (pseudogenes, null, questionable) before iterating, with a
logged sample. Previously these would crash the calibration partway
through with ValueError("Unsupported annotation on MHC allele: ...").
User-visible asymmetry: the percent-rank table now lacks rows for
those alleles. Runtime predict() on a dropped allele still raises
the same ValueError it always did. To list the dropped alleles for
a specific predictor:
from mhcflurry import Class1AffinityPredictor
from mhcflurry.calibrate_percentile_ranks_command import (
filter_canonicalizable_alleles,
)
predictor = Class1AffinityPredictor.load(models_dir)
all_alleles = predictor.supported_alleles
kept = filter_canonicalizable_alleles(all_alleles)
dropped = sorted(set(all_alleles) - set(kept))
print(f"{len(dropped)} dropped:", dropped[:10])validation_interval > 1 and the saved val_loss
When validation_interval > 1, fit_info["val_loss"] is still one
entry per epoch (the on-interval values get carried forward into the
intervening rows for plotting compatibility). Three triggers force a
real measurement:
- on the cadence (
epoch % interval == 0), - on the final epoch of the loop,
- when patience would trigger this epoch (so the saved val_loss
reflects the actual stop state, not a stale carried-forward value).
Affinity fit is device-resident
Affinity fit() no longer routes minibatches through a per-fit
DataLoader. AffinityDeviceTrainingData owns the row space for one
fit call as torch tensors on the active backend, and the training loop
forms batches by index-selecting from those resident tensors. Random
negatives are refilled into the top slice of that row space each epoch.
New tools
| Tool | Purpose |
|---|---|
mhcflurry compare-models |
Compare two ensembles (run-vs-run or run-vs-public) across affinity, presentation, and training-stats components. Markdown to stdout, CSVs to --out. Each component runs only when both sides have the matching artifact. |
mhcflurry plot-model-comparison |
Render ROC/PR/scatter/delta plots from a compare-models output directory. |
scripts/training/plot_loss_curves.py |
Per-model train + val loss curves from manifest (no weight files needed). Three PNGs + summary CSV. |
When to use which:
compare-models --b public— a single run vs the published 2.2.0
baseline (--bdefaults topublic). The eval stage of
pan_allele_release_affinity.shruns this by default.compare-models --a run1 --b run2— any two runs against each other.
Use when comparing recipe variants, hyperparameter sweeps, or 2.3.0
candidates against each other.plot_loss_curves.py— diagnostic. Doesn't need a baseline.
Dev-workstation helper: scripts/dev/relocate_run_outputs.sh moves
brev_runs/ and results/ outside the repo (with symlinks) so runplz's
rsync_up doesn't ship 15+ GB of stale prior-run artifacts to the box on every
launch. Run with --apply once per workstation.
Pipeline orchestration
scripts/training/pan_allele_release_affinity.sh is now end-to-end:
fetch_pretrain_data → fetch_data_curated → train_combined
→ select_combined → calibrate_combined → fetch_eval_data
→ eval_compare_new_vs_public → plot_loss_curves
Each stage runs through run_logged_step with its own log file under
$MHCFLURRY_OUT/. Both new stages (eval + plot) skip cleanly via
SKIP_EVAL=1 / SKIP_PLOTS=1 env knobs for incremental reruns. CI
now runs bash -n over every scripts/**/*.sh to catch syntax
regressions before a multi-hour training run discovers them.
Validation results
TODO: filled in after the 2.3.0 validation training run completes.
Will include
mhcflurry compare-modelsoutput comparing the 2.3.0
candidate vs the 2026-04-25 baseline run:
- End-to-end wall time delta.
- Per-task training time distribution shift.
- Per-allele eval metric deltas (mean + p25) on the data_evaluation
benchmark.Acceptance: existing-allele PR-AUC / ROC-AUC mean delta ≥ 0,
p25 ≥ −0.005. Not shipped to master until this passes.
Dependencies
No required dependency version changes vs 2.2.x. PyTorch 2.0+ is already
required and is used for device-resident training and optional
torch.compile.
Migration notes
- Models trained with 2.3.0 will produce different weights from
2.2.x even on identical seeds. Predictions on the same(peptide, allele)pair will differ — quantified in
validation results.- Two contributing factors beyond the obvious framework switch:
RandomNegativesPoolwithrandom_negative_pool_epochs > 1
generates one batch of random negatives and slices it across N
epochs, rather than re-sampling fresh negatives every epoch as
2.2.x did. Within a pool cycle consecutive epochs see distinct
slices of the same pool; a new pool is drawn at each
epoch // pool_epochsboundary. Setrandom_negative_pool_epochs=1
to recover the pre-2.3.0 "fresh negatives every epoch" semantics
(at the ~17 s/epoch encode cost).- The 1-batch-per-architecture warmup primes torch.compile's
on-disk cache with one synthetic forward+backward; the
compiled-graph cache it writes does not affect weights, but
running it does advance the global RNG before training proper
starts. Pin a per-arch seed if you need bit-equivalence across
runs. - Device-resident random-negative sampling
(encode_random_negatives_on_device) draws negative peptides as
amino-acid indices viatorch.multinomialrather than the host
numpyrandom_peptidesstream. Because this is a different RNG
stream than 2.2.x used, even at an identical--random-seedthe
actual random-negative peptides differ (not just their row
layout) — an additional contributor, beyond the framework switch
and therandom_negative_pool_epochsslicing above, to why 2.3.0
models differ from 2.2.x.
- Two contributing factors beyond the obvious framework switch:
- Training ingestion now canonicalizes allele names, so retraining on
data that contained aliased / retired / alternative spellings can change
which rows are included and therefore the resulting weights. Previously the
training commands exact-string-matched theallelecolumn and assumed it was
pre-normalized: non-canonical rows were silently dropped (pan-allele, no
matching pseudosequence key) or fragmented into separate models
(allele-specific). 2.3.0 maps each name to its canonical key no-alias-first —
an allele keeps its own pseudosequence when it has one, otherwise its alias
target — matching how prediction already resolves names. If your training
CSVs were already fully normalized this is a no-op; otherwise expect more
rows retained and previously-fragmented alleles merged. (Prediction and
calibration behavior is unchanged.) - Saved 2.2.x model bundles still work unchanged in 2.3.0 for
prediction; no migration needed for downstream users running
inference on existing bundles. Class1PresentationPredictor.save()keywordwrite_metdatarenamed to
write_metadata(the prior spelling was a typo). The misspelled form would
have raisedTypeErrorfor in-tree callers, so this is a no-op for code that
used the correct spelling; any external caller passingwrite_metdata=must
update towrite_metadata=.- Deprecated: the dense-vector amino-acid encoding path. Peptides and
processing-model sequences are now always index-encoded ((N, L)int8) and
embedded on device. Thepeptide_amino_acid_encoding_torch=False/
amino_acid_encoding_torch=Falsehyperparameters (and the
peptide_amino_acid_encoding_gpualias) no longer select a dense(N, L, V)
path — they are accepted but coerced to index encoding with a one-time
deprecation warning, so existing configs still load and predict identically.
EncodableSequences.variable_length_to_fixed_length_vector_encodingand the
network's defensive dense-input branch are retained only for tests and are
marked for removal (grepDEPRECATED (scheduled for removal)). The shared
vector-encoding table machinery stays — it backs the index embedding and the
allele encoder. - The pan-allele release training pipeline is the
primary thing that's changed. Allele-specific and processing
training paths inherit shared backend selection and worker sizing,
but their wrapper scripts are unaffected.