Author: Moisés Horta Valenzuela, hexorcismos
Date: May 2026
A generative model that synthesises audio in CoDiCodec's continuous latent space using Conditional Flow Matching (CFM) on a block-causal DiT architecture.
The model targets musical continuation / improvising accompaniment: given a short audio prompt, it generates an arbitrarily long continuation in a chunk-causal, streaming fashion on the codec's ~11.7 Hz, 64-channel latent sequence.
- CoDiCodec (Pasini et al., 2025)
encodes 48 kHz stereo audio to summary embeddings at ~11.7 Hz with 64
channels (128x compression) and exposes a streaming
decode_next()API. The continuous latents, after the codec'satanh / sigma_rescale=0.8transform, are approximately unit-Gaussian — a direct fit for flow matching. - Block-causal Flow Matching DiT is the simplest architecture that:
- respects the codec's chunk structure,
- supports KV-caching for efficient streaming inference,
- has unconditional dropout-based classifier-free guidance for free.
- The whole pipeline is MPS-friendly so it can be trained and run in real-time on a 36 GB Apple Silicon laptop.
Try CoDiCodec-Flow directly in your browser using the Google Colab notebook in the colab/ directory. The notebook provides a step-by-step guide for:
- Cloning the repository and setting up the environment
- Preprocessing audio data to latents
- Training a model on your data
- Generating audio continuations
Click the badge at the top of this README to open the notebook in Colab.
codicodec-flow/
codicodec/ upstream codec package (do not modify)
flow/ this project
__init__.py
codec_wrapper.py MPS-safe wrapper around codicodec.EncoderDecoder
config.py dataclass-based config
utils.py device, masks, logging
data/
preencode.py audio dir -> per-file .pt latent shards
latent_dataset.py
model/
dit.py block-causal flow-matching DiT
cfm.py CFM loss + Euler/Heun samplers
ema.py
train.py training loop
sample.py offline sampling
smoke_test.py end-to-end sanity check
requirements.txt
README.md (this file)
# Create conda environment
conda create -n codicodec-flow python=3.10
conda activate codicodec-flow
# Install dependencies
pip install -r requirements.txt
# Install the upstream CoDiCodec package
pip install -e ./codicodec
# Verify codec works on your machine (downloads checkpoint on first run)
python -m flow.smoke_test --device mps # Use 'cuda' for NVIDIA GPUscodicodec-flow provides a user-friendly CLI wrapper that simplifies training, preprocessing, and generation without requiring python -m flow... commands.
# Preprocess audio data
python cli.py preprocess --in-dir ~/music/training --out-dir ./data/latents --device mps
# Train a model (TUI monitoring enabled by default)
python cli.py train --data-dir ./data/latents --out-dir ./runs/v0 --device mps
# Generate audio
python cli.py sample --ckpt ./runs/v0/ema.pt --prompt-wav ./prompt.wav --out ./out.wav --device mpsBefore training, you need to convert your audio files into latent shards using the CoDiCodec encoder.
python cli.py preprocess \
--in-dir /path/to/your/audio \
--out-dir ./data/latents \
--device mps \
--max-seconds 60Arguments:
--in-dir: Directory containing audio files (WAV, MP3, FLAC, etc.)--out-dir: Output directory for latent shards (.pt files)--device:mpsfor Apple Silicon,cudafor NVIDIA GPUs,cpuas fallback--max-seconds: Maximum duration per file (default: 300s). Longer files are split.
Output:
- Each audio file produces a
.ptfile containing the encoded latent representation - Latents are stored as
[T, 8, 64]tensors (T = number of 0.683s chunks) - Files are stored with metadata for the dataset loader
Tips:
- Use diverse audio for better generalization (different styles, instruments, tempos)
- 48 kHz stereo audio is recommended (CoDiCodec's native rate)
- Aim for several hours of audio for reasonable training
- Train for at least 100K steps for meaningful results; the v3_okachihuali model was trained for 6,860,000 steps
Train a block-causal Flow Matching DiT model on the preprocessed latents.
python cli.py train \
--data-dir ./data/latents \
--out-dir ./runs/v0 \
--device mps \
--batch-size 4 \
--grad-accum 2 \
--crop-tokens 512 \
--max-steps 200000Key Arguments:
--data-dir: Directory containing preprocessed latent shards--out-dir: Output directory for checkpoints and logs--device:mps,cuda, orcpu--batch-size: Batch size per GPU (default: 8, use 4 on MPS)--grad-accum: Gradient accumulation steps (effective batch = batch_size × grad_accum)--crop-tokens: Random crop length in tokens (default: 768, must be multiple of 8)--max-steps: Total training steps (default: 200000)--dtype:bf16for bfloat16 (faster, less memory) orfp32for float32--lr: Learning rate (default: 1e-4 with cosine decay)--ema-decay: EMA decay rate (default: 0.9999)
Model Size Configuration:
Default (~97M params, recommended for 36GB+ RAM):
python cli.py train --data-dir ./data/latents --out-dir ./runs/v0 \
--device mps --batch-size 4 --grad-accum 2 --crop-tokens 512 \
--dtype bf16 --max-steps 200000Smaller (~20M params, faster iteration):
python cli.py train --data-dir ./data/latents --out-dir ./runs/v0 \
--device mps --batch-size 8 --grad-accum 2 --crop-tokens 512 \
--dtype bf16 --max-steps 200000 \
--dim 384 --n-layers 8 --n-heads 6 --cond-dim 384Training Details:
- Checkpoints are saved every 50 steps:
last.pt(latest) andema.pt(EMA copy) - Periodic audio samples are generated during training (unconditional by default)
- Use
--audio-continuationto enable continuation sampling during training - Use
--audio-sample-every Nto control sampling frequency (0 to disable) - Logs include loss, learning rate, and sample metrics
Generate audio continuations using a trained checkpoint.
python cli.py sample \
--ckpt ./runs/v0/ema.pt \
--prompt-wav ./prompt.wav \
--duration-s 20 \
--nfe 8 \
--solver heun \
--out ./out.wav \
--device mpsArguments:
--ckpt: Path to checkpoint (useema.ptfor best quality,last.ptfor latest)--prompt-wav: Audio prompt file (WAV, 48 kHz stereo recommended)--duration-s: Duration of continuation in seconds (default: 20)--nfe: Number of function evaluations (sampling steps, default: 8)--solver: ODE solver:euler(faster) orheun(better quality)--out: Output audio file path--device:mps,cuda, orcpu--temperature: Sampling temperature (default: 1.0, higher = more diverse)--n-steps: Number of diffusion steps (default: 32)
python cli.py sample \
--ckpt ./runs/v0/ema.pt \
--duration-s 20 \
--nfe 8 \
--solver heun \
--out ./out_uncond.wav \
--device mpsOmit --prompt-wav for unconditional generation (no prompt context).
Higher quality with more sampling steps:
python cli.py sample --ckpt ./runs/v0/ema.pt --prompt-wav ./prompt.wav --duration-s 30 --nfe 16 --solver heun --out ./out_high_quality.wav --device mpsFaster generation with fewer steps:
python cli.py sample --ckpt ./runs/v0/ema.pt --prompt-wav ./prompt.wav --duration-s 20 --nfe 4 --solver euler --out ./out_fast.wav --device mpsAdjust temperature for diversity:
python cli.py sample --ckpt ./runs/v0/ema.pt --prompt-wav ./prompt.wav --duration-s 20 --nfe 8 --solver heun --temperature 1.5 --out ./out_diverse.wav --device mpsSampling Trade-offs:
- NFE (steps): More steps = better quality but slower. 4-8 is real-time, 16+ is high quality.
- Solver: Heun is more accurate than Euler but ~2x slower.
- Temperature: Higher values increase diversity but may reduce coherence.
Sample audio generated by codicodec-flow is available in the examples/ directory, demonstrating the progression of the v3_okachihuali model during training:
okachihuali_v3_step_000000.wav- Generated at 0 training steps (initialization)okachihuali_v3_step_100000.wav- Generated at 100,000 training stepsokachihuali_v3_step_200000.wav- Generated at 200,000 training stepsokachihuali_v3_step_300000.wav- Generated at 300,000 training stepsokachihuali_v3_step_400000.wav- Generated at 400,000 training stepsokachihuali_v3_step_500000.wav- Generated at 500,000 training stepsokachihuali_v3_step_600000.wav- Generated at 600,000 training steps
The v3_okachihuali model was trained for approximately 700,000 steps on the Okachihuali dataset - a 60-track album by hexorcismos available at https://hexorcismos.bandcamp.com/album/--2. This dataset provides a diverse collection of musical material for training the generative model.
These examples demonstrate the model's ability to generate coherent musical continuations from unconditional generation.
CoDiCodec-Flow Architecture
- Moisés Horta Valenzuela, 2026
CoDiCodec
- The upstream CoDiCodec encoder/decoder is released by Sony CSL Paris under CC BY-NC 4.0
- Paper: Pasini et al., 2025 - CoDiCodec: UNIFYING CONTINUOUS AND DISCRETE COMPRESSED REPRESENTATIONS OF AUDIO
- Original repository: https://github.com/sony/codicodec
License
- This repository is licensed under CC BY-NC 4.0
- Code under
codicodec/is released under CC BY-NC 4.0 by Sony CSL Paris - The
flow/code is under the same license unless stated otherwise