Skip to content

feat: multi-model ensemble separation with 9 community-curated presets#265

Merged
beveradb merged 25 commits into
mainfrom
add-ensemble
Mar 16, 2026
Merged

feat: multi-model ensemble separation with 9 community-curated presets#265
beveradb merged 25 commits into
mainfrom
add-ensemble

Conversation

@beveradb
Copy link
Copy Markdown
Collaborator

@beveradb beveradb commented Mar 16, 2026

Summary

Adds multi-model ensemble separation to audio-separator, allowing users to combine the outputs of multiple separation models for better quality results. Includes a preset system with 9 community-curated configurations, stem labeling fixes, and a comprehensive multi-stem test framework.

Core ensemble support (by @makhlwf — thank you!)

  • New Ensembler class with 11 algorithms: avg_wave, median_wave, min_wave, max_wave, avg_fft, median_fft, min_fft, max_fft, uvr_max_spec, uvr_min_spec, ensemble_wav
  • Multi-model load_model() and _separate_ensemble() pipeline with temp directory management
  • CLI support via --extra_models for specifying additional models
  • Weighted ensembling support via --ensemble_weights

Ensemble presets

  • ensemble_presets.json with 9 presets sourced from deton24's community guide:
    • Instrumental: `instrumental_clean`, `instrumental_full`, `instrumental_balanced`, `instrumental_low_resource`
    • Vocal: `vocal_balanced`, `vocal_clean`, `vocal_full`, `vocal_rvc`
    • Karaoke: `karaoke` (3-model)
  • `--ensemble_preset ` CLI flag and `Separator(ensemble_preset=...)` Python API
  • `--list_presets` to show available presets
  • Preset algorithm/weights can be overridden by explicit user arguments
  • Contributors can add presets via PR to `ensemble_presets.json`

Bug fixes

  • Stem label swap: Fixed `common_separator.py` to swap primary/secondary stem names when `target_instrument` doesn't match `instruments[0]` — fixes `bs_roformer_instrumental_resurrection_unwa` whose stems were backwards
  • Contextual stem mapping: In 2-stem models where one stem is vocal, "other" is now mapped to "Instrumental" (previously kept as separate group, causing broken ensembles)
  • CLI backward compat: Reverted `-m` from `nargs="+"` to single value, added `--extra_models` for additional models (old syntax `audio-separator -m model audio.wav` preserved)
  • Missing init attributes: `model_filename`/`model_filenames` initialized in `init`
  • State mutation: `original_output_dir` captured outside per-model loop
  • List reference copy: `self.model_filenames = list(model_filename)` instead of shared reference
  • Mono preservation: Track and restore original channel count through ensemble

Output naming

  • Preset ensembles: `audio_(Vocals)_preset_vocal_balanced.flac`
  • Manual ensembles: `audio_(Vocals)custom_ensemble_.flac`

Test framework

Unit tests (233 total, 37 new):

  • Ensembler algorithms (all 11)
  • Preset loading, validation, override, error handling
  • Stem name swap (target_instrument mismatch)
  • CLI flags (--extra_models, --ensemble_preset, --list_presets)
  • Output filename format (preset and custom slugs)

Ensemble preset integration tests (9 parametrized):

  • Run each preset on mardy20s.flac
  • Verify correct stem labels via correlation
  • Compare spectrograms against 36 committed reference images (SSIM > 0.80)

Multi-stem integration tests (5 test clips, 39 reference stems):

  • Test clips: Led Zeppelin (drums/guitar/vocals), Coldplay (piano/instruments), Benny Goodman (brass/wind), Enya (reverb), Queen (backing harmonies)
  • Vocal/instrumental, 4-stem, drumsep pipeline, karaoke, wind extraction, dereverb pipeline
  • Each test verifies output correlates > 0.70 with best-model reference stems

Meaningful ensemble tests:

  • Vocal ensemble matches best single model (>0.90 correlation)
  • Karaoke ensemble extracts only lead vocals (<0.90 correlation with standard vocals on Under Pressure)
  • Karaoke on extracted vocals produces distinct lead/backing split (correlation <0.50)

On-demand regression test (163 models):

  • Verifies every supported model's output stems contain what their labels claim
  • Uses correlation against known vocal/instrumental references
  • Handles utility models (de-echo, de-noise, de-reverb), sub-stems, drumsep
  • Run locally: `pytest tests/regression/test_all_models_stem_verification.py -v -s`

Documentation

  • README updated with ensemble CLI examples, Python API, preset table
  • `ensemble_presets.json` self-documenting with name/description/contributor fields
  • `docs/deton24-model-mapping-and-ensemble-guide.md` — model naming lookup table, ensemble recommendations, and phase fix documentation

Version

Bumped to 0.42.0

Credits

The core ensemble functionality was originally implemented by @makhlwf in #261. This PR builds on that work with bug fixes, the preset system, stem labeling corrections, and comprehensive test coverage. Thank you @makhlwf for the foundational contribution!

Test plan

  • 233 unit tests passing (Python 3.10–3.13, macOS/Ubuntu/Windows)
  • All 11 ensemble algorithms produce valid, non-silent output
  • All 9 presets produce correctly-labeled stems (18/18 verified via correlation)
  • CLI backward compatibility verified (old `-m model audio.wav` syntax works)
  • Vocal ensemble output correlates >0.97 with best single model
  • Karaoke ensemble extracts lead-only (0.748 vs standard on Under Pressure)
  • Lead/backing split produces uncorrelated stems (0.052 correlation)
  • 163-model stem verification: 160 pass, 0 content mismatches, 3 download skips
  • Multi-stem tests pass across 5 clips × 6 model types
  • CI passes on all platforms

@coderabbitai ignore

🤖 Generated with Claude Code

makhlwf and others added 22 commits March 11, 2026 21:15
… coverage

- Revert -m to single value, add --extra_models for ensemble (fixes CLI breaking change)
- Initialize model_filename/model_filenames in __init__ (prevents AttributeError)
- Fix list reference copy in load_model (use list() instead of shared reference)
- Move original_output_dir capture outside per-model loop (state mutation fix)
- Extract stem name map to module-level STEM_NAME_MAP constant
- Preserve mono channel count through ensemble (avoid fake stereo)
- Add trailing newlines to all files
- Add 8 new unit tests: median/min/max_fft, uvr_max/min_spec, invalid algo, weight mismatch
- Add 3 CLI tests: --extra_models, single model string compat, old syntax backward compat
- Update README ensemble examples for new --extra_models flag

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a JSON-based ensemble preset system that lets users select known-good
model combinations by name instead of specifying every detail manually.

Presets are sourced from deton24's community-maintained audio separation guide
and cover instrumental (4), vocal (4), and karaoke (1) use cases.

New features:
- ensemble_presets.json with 9 presets (instrumental_clean/full/balanced/low_resource,
  vocal_balanced/clean/full/rvc, karaoke)
- --ensemble_preset CLI flag and Separator(ensemble_preset=...) Python API
- --list_presets CLI flag to show available presets
- Preset algorithm/weights can be overridden by explicit user args
- ensemble_algorithm parameter now accepts None (defaults to avg_wave)
- 10 new unit tests for preset loading, validation, override, JSON validity
- 2 new CLI tests for --ensemble_preset and --list_presets
- README updated with preset documentation and usage examples

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rument, map "other" to "Instrumental"

Three fixes for stem name handling in ensemble mode:

1. common_separator.py: When a model's target_instrument doesn't match
   instruments[0], swap primary/secondary stem names so the model's
   prediction gets the correct label. Fixes bs_roformer_instrumental_
   resurrection_unwa whose "vocals" output was actually instrumental.

2. separator.py: In _separate_ensemble, when a model produces exactly
   2 stems and one is vocal-like, map "other" to "Instrumental" instead
   of keeping it as a separate group. This ensures all 2-stem models
   contribute to the same Vocals/Instrumental ensemble regardless of
   whether they label their non-vocal stem "Instrumental" or "other".

3. separator.py: Use preset name in ensemble output filenames
   (preset_<name>) and descriptive slugs for manual ensembles
   (custom_ensemble_<slug1>_<slug2>).

Also adds tests/utils_audio_verification.py — a content verification
utility that correlates output stems against known references to detect
label mismatches programmatically.

Verified: all 9 presets now produce exactly 2 correctly-labeled stems
(18/18 OK, 0 mismatches).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ence spectrograms

- 36 reference spectrogram/waveform PNGs for 9 presets × 2 stems each
- test_ensemble_integration.py: parametrized test that for each preset:
  1. Runs the preset separation on mardy20s.flac
  2. Verifies stems contain correct content (correlation-based)
  3. Compares spectrograms against committed references (SSIM)
- generate_reference_images_ensemble.py: script to regenerate references
- utils_audio_verification.py: content verification utility (already committed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…logic

- 5 tests for CommonSeparator stem name swap (target_instrument mismatch,
  no swap when matching, edge cases)
- 2 tests for STEM_NAME_MAP completeness and lowercase invariant
- 2 tests for ensemble output filename format (preset and custom slugs)
- 5 tests for preset validation edge cases (bad weights length, bad
  algorithm, single model, weights applied, weights override)

Total: 233 unit tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The nargs="+" change on -m was reverted in favor of --extra_models,
so the old CLI arg order (audio-separator -m model audio.wav) works
again. No need to change these tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… models

Runs every supported model on mardy20s.flac and verifies each output stem's
label matches its actual content using correlation against known vocal and
instrumental references.

Usage:
  pytest tests/regression/test_all_models_stem_verification.py -v -s
  pytest ... -k "VR" (single architecture)
  pytest ... -k "resurrection" (single model)
  STEM_VERIFY_REPORT_ONLY=1 pytest ... (report without failing)

Handles:
- Vocal/Instrumental stems: verified via Pearson correlation (>0.7 threshold)
- Sub-stems (drums, bass, guitar, piano): verified not-full-mix; near-silence OK
- Full mix detection: any stem with >0.95 correlation to original mix fails
- Demucs 6-stem models: sub-stems like Piano can be legitimately silent

Not run in CI — requires downloading all models.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Utility models (de-echo, de-noise, de-reverb, BVE) get relaxed
  verification — their stems don't follow standard vocal/instrumental
  patterns on clean source audio
- Sub-stems (drums, bass, guitar, "No X" variants) skip the full-mix
  check since "No X" is legitimately ≈ the mix when X isn't present
- Partial vocal stems (backing/lead vocals) skip full-vocal correlation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…piration, etc.)

Full 163-model run revealed stem types not yet in SUB_STEMS or UTILITY_STEMS:
- Drumsep: kick, snare, toms, hh, ride, crash
- Gender split: male, female
- Specialized: aspiration, bleed, no bleed
- Utility: noreverb

160 passed, 0 real failures, 3 skipped (download failures).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New test input audio clips with diverse instrumentation for testing
instrument-specific separation models:

- levee_drums.flac (20s, 24-bit) — Led Zeppelin, drums+guitar+vocals
- clocks_piano.flac (20s, 16-bit) — Coldplay, piano+instruments+vocals
- sing_sing_sing_brass.flac (25s, 16-bit) — Benny Goodman, drums+brass+wind
- only_time_reverb.flac (25s, 16-bit) — Enya, reverb-heavy vocal+synths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New integration test suite verifying instrument-specific separation models
across 4 test clips with diverse instrumentation:

Test matrix:
  - Vocal/Instrumental: resurrection model on all 4 clips
  - 4-stem (drums/bass/other/vocals): htdemucs_ft on levee + clocks
  - DrumSep pipeline: mix → htdemucs_ft drums → drumsep kit parts
  - Karaoke: aufr33/viperx model on levee + clocks
  - Wind/Brass: 17_HP-Wind on sing_sing_sing
  - De-reverb pipeline: mix → resurrection vocals → dereverb

30 reference stems generated by best-in-class models, committed as
tests/inputs/reference/ref_*.flac. Tests verify new model outputs
correlate > 0.70 with references.

Includes generate_multi_stem_references.py for regenerating references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Karaoke models remove lead vocals while preserving backing vocals.
The test now additionally checks that karaoke vocal output differs
from standard vocal output (correlating < 0.95), confirming the model
is doing karaoke-specific extraction, not just a generic split.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Queen & David Bowie — Under Pressure 1:35-1:55 (20s, 16-bit).
Section has clear lead vocal over dense backing harmonies, making
karaoke vs standard vocal separation measurably different (0.740
correlation vs 0.961 for Clocks which lacks strong backing vocals).

Karaoke test now runs on 3 clips: levee, clocks, under_pressure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New minor version for:
- Multi-model ensemble separation
- 9 community-curated ensemble presets
- Stem label fixes (target_instrument swap, contextual "other" mapping)
- New CLI flags: --extra_models, --ensemble_preset, --list_presets
- Multi-stem integration test framework

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three tests verifying ensemble presets produce semantically correct output:

1. test_vocal_ensemble_matches_best_single_model: vocal_balanced ensemble
   output should correlate >0.90 with the best single model (Resurrection),
   confirming ensemble doesn't degrade quality.

2. test_karaoke_ensemble_extracts_lead_only: On Under Pressure (prominent
   backing harmonies), karaoke ensemble vocals should differ from standard
   vocal extraction (<0.90 correlation), confirming it extracts only lead.

3. test_karaoke_on_vocals_produces_lead_backing_split: Pipeline test —
   mix → vocal model → karaoke model should produce distinct lead and
   backing vocal stems (both non-silent, correlation <0.50).

Includes 9 new reference stems for these tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@beveradb beveradb enabled auto-merge (squash) March 16, 2026 05:58
@beveradb beveradb mentioned this pull request Mar 16, 2026
beveradb and others added 3 commits March 16, 2026 18:10
…on tests

find_stem() matched the first _(StemName) group in filenames, which broke
pipeline tests where the input filename already contained a parenthesized
stem from a prior step. Now uses the last match. Also handle near-silent
stems (e.g. vocals from instrumental-only audio) returning nan correlation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@beveradb beveradb merged commit adc5539 into main Mar 16, 2026
21 of 27 checks passed
@beveradb
Copy link
Copy Markdown
Collaborator Author

Would love to get y'alls opinions/input on the ensemble presets which are now live in the latest audio-separator if you're interested @Politrees @Eddycrack864 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants