# Vocal Separation Model Training (BS-RoFormer)

This notebook trains a state-of-the-art vocal separation model using the BS-RoFormer architecture.

**Features:**
- Band-Split RoPE Transformer for frequency-domain processing
- State-of-the-art SDR (9.8+ dB on MUSDB18)
- Can be fine-tuned for lead/backing vocal separation

**Dataset:** MUSDB18-HQ (150 tracks, 10 hours)

**Estimated time:** 24-48 hours on A100, 3-5 days on T4

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Install dependencies
!pip install -q torch torchaudio einops rotary_embedding_torch wandb musdb museval soundfile
!pip install -q git+https://github.com/facebookresearch/demucs.git

In [None]:
# Clone MVSEP training framework
!git clone https://github.com/ZFTurbo/Music-Source-Separation-Training.git /content/mss_training
%cd /content/mss_training

In [None]:
# Download MUSDB18-HQ dataset
import os
import musdb

MUSDB_PATH = '/content/musdb18hq'

if not os.path.exists(MUSDB_PATH):
    print("Downloading MUSDB18-HQ dataset...")
    print("This is a 7GB download and may take 10-20 minutes.")
    !pip install -q musdb
    
    # Download using musdb
    mus = musdb.DB(root=MUSDB_PATH, download=True, is_wav=True)
    print(f"Downloaded {len(mus)} tracks")
else:
    print("MUSDB18-HQ already downloaded")
    mus = musdb.DB(root=MUSDB_PATH, is_wav=True)
    print(f"Found {len(mus)} tracks")

In [None]:
import torch
import yaml

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Create config for BS-RoFormer vocal separation
config = {
    'audio': {
        'chunk_size': 131072,  # ~3 seconds at 44.1kHz
        'sample_rate': 44100,
        'num_channels': 2,
        'min_mean_abs': 0.001
    },
    'model': {
        'type': 'bs_roformer',
        'dim': 384,
        'depth': 12,
        'stereo': True,
        'num_stems': 1,  # Vocals only (other = residual)
        'time_transformer_depth': 1,
        'freq_transformer_depth': 1,
        'num_bands': 60,
        'dim_head': 64,
        'heads': 8,
        'attn_dropout': 0.1,
        'ff_dropout': 0.1,
        'flash_attn': True,
        'stft_n_fft': 2048,
        'stft_hop_length': 512,
    },
    'training': {
        'batch_size': 4,
        'gradient_accumulation_steps': 4,
        'num_epochs': 100,
        'num_steps': 1000,
        'lr': 5e-5,
        'instruments': ['vocals', 'other'],
        'target_instrument': 'vocals',
        'use_amp': True,
        'optimizer': 'adamw',
    },
    'augmentations': {
        'enable': True,
        'loudness': True,
        'loudness_min': 0.5,
        'loudness_max': 1.5,
        'mixup': True,
        'mixup_alpha': 0.4,
    }
}

# Save config
config_path = '/content/mss_training/configs/config_vocals_bsroformer.yaml'
with open(config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

print("Config saved to:", config_path)

In [None]:
# Prepare dataset in expected format
import shutil
import soundfile as sf
from pathlib import Path

TRAIN_DIR = Path('/content/musdb_training')
TRAIN_DIR.mkdir(exist_ok=True)

print("Organizing dataset for training...")

mus = musdb.DB(root=MUSDB_PATH, is_wav=True, subsets='train')

for track in mus:
    track_dir = TRAIN_DIR / track.name
    track_dir.mkdir(exist_ok=True)
    
    # Save stems
    # Vocals
    sf.write(str(track_dir / 'vocals.wav'), track.targets['vocals'].audio, track.rate)
    
    # Other (everything except vocals)
    other = track.targets['drums'].audio + track.targets['bass'].audio + track.targets['other'].audio
    sf.write(str(track_dir / 'other.wav'), other, track.rate)
    
    # Mixture
    sf.write(str(track_dir / 'mixture.wav'), track.audio, track.rate)

print(f"Prepared {len(mus)} tracks for training")

In [None]:
# Start training
RESULTS_DIR = '/content/drive/MyDrive/vocal_model_results'
!mkdir -p {RESULTS_DIR}

!python train.py \
    --model_type bs_roformer \
    --config_path {config_path} \
    --data_path {TRAIN_DIR} \
    --results_path {RESULTS_DIR} \
    --dataset_type 1 \
    --device_ids 0 \
    --num_workers 0 \
    --pin_memory

In [None]:
# Evaluate on test set
import museval

print("Evaluating model on MUSDB18 test set...")

# Load best model checkpoint
checkpoints = list(Path(RESULTS_DIR).glob('*.ckpt'))
if checkpoints:
    best_ckpt = sorted(checkpoints, key=lambda x: x.stat().st_mtime)[-1]
    print(f"Using checkpoint: {best_ckpt}")

# Run evaluation
!python inference.py \
    --model_type bs_roformer \
    --config_path {config_path} \
    --checkpoint {best_ckpt} \
    --input_folder {MUSDB_PATH}/test \
    --output_folder /content/eval_output

In [None]:
# List saved models
print("Saved models:")
!ls -la {RESULTS_DIR}/*.ckpt 2>/dev/null || echo "No checkpoints yet"

## Training Complete!

Your trained vocal separation model is saved to `Google Drive/vocal_model_results/`

**Expected Results:**
- SDR (vocals): 9.5-10.0 dB on MUSDB18 test set
- Much better than default Demucs (~8.5 dB)

**Next Steps:**
1. Copy model to StemScribe backend
2. Update `enhanced_separator.py` to use it
3. For lead/backing separation, fine-tune on songs with known stereo panning