MLX port of Mistral's Voxtral-4B-TTS-2603 text-to-speech model for on-device inference on Apple Silicon.
Converts the HuggingFace model (~8GB) into MLX format with optional quantization (Q2–Q8) for efficient local generation.
macOS — Original Model (fp16)
Text: "Kimi Antonelli took his second pole position in a row as he beat Mercedes team-mate George Russell in qualifying at the Japanese Grand Prix..."
https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output.mp3
macOS — Q4 Model
Same text as above
https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output_q4.mp3
iPhone 15 Pro — Q2 Model
Text: "Good morning! Nice to see you again!"
https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output_iPhone.wav
Three-stage pipeline:
Text → LLM Decoder → Flow-Matching Transformer → Codec → 24kHz WAV
| Component | Params | Description |
|---|---|---|
| Transformer Decoder | 3.4B | Mistral/Ministral-3B based LLM; text tokens → hidden states |
| Acoustic Transformer | 390M | Flow-matching with Euler ODE + CFG; hidden states → audio codes |
| Voxtral Codec | 300M | Conv-transformer autoencoder; codes → 24kHz waveform |
Each audio frame consists of 37 discrete tokens (1 semantic + 36 acoustic) at 12.5 Hz frame rate.
Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"python3 -m voxtral_tts.convert --inspectUse the build script to convert and prepare models with voice embeddings in one step:
# Build all three variants (fp16, q4, q2)
./scripts/build_models.sh
# Build specific variants
./scripts/build_models.sh q4 # Uniform Q4 only
./scripts/build_models.sh q4 q2 # Q4 and mixed Q4+Q2
./scripts/build_models.sh fp16 # Full precision only
# Use a local HF model directory (skip download)
./scripts/build_models.sh q4 --local-dir /path/to/Voxtral-4B-TTS-2603
# Preview commands without running
./scripts/build_models.sh --dry-runThis produces:
| Directory | Quantization | Size | Use case |
|---|---|---|---|
mlx_model/ |
None (fp16) | ~8 GB | Development / best quality |
mlx_model_q4/ |
Uniform Q4 | ~2.1 GB | Mac (8GB+ RAM) |
mlx_model_q2/ |
Q4 LLM/acoustic + Q2 codec | ~1.6 GB | iPhone / iPad |
# Full precision
python3 -m voxtral_tts.convert --output-dir mlx_model
# With uniform quantization (q2, q4, q6, q8)
python3 -m voxtral_tts.convert --output-dir mlx_model_q4 --quantize q4
# Mixed quantization for iOS (Q4 LLM/acoustic, Q2 codec for size savings)
python3 -m voxtral_tts.convert --output-dir mlx_model_q2 \
--quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2
# Convert voice embeddings (.pt → .safetensors) for iOS compatibility
python3 -m voxtral_tts.convert_voices mlx_model_q4| Target Device | RAM | Recommended Command | Estimated Size |
|---|---|---|---|
| Mac (M1/M2/M3/M4) | 16GB+ | --quantize q4 |
~2.1 GB |
| Mac (8GB) | 8GB | --quantize q4 |
~2.1 GB |
| iPhone 15 Pro / iPad Pro | 8GB | --quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2 |
~2.1 GB |
| iPhone 16 Pro | 8GB | --quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2 |
~2.1 GB |
Mixed quantization applies different bit widths per component:
--quantize-llm— Language model (3.4B params, minimum Q4 enforced)--quantize-acoustic— Acoustic transformer (390M params, minimum Q4 enforced)--quantize-codec— Codec (300M params, tolerates Q2 since it's not in autoregressive loop)
Note: The LLM and acoustic transformer require Q4 minimum for intelligible speech. Values below Q4 are automatically clamped with a warning.
Per-component flags override --quantize when both are specified.
# Construction test (no weights needed)
python3 -m voxtral_tts.test_model --test construction
# Weight loading test
python3 -m voxtral_tts.test_model --model-path mlx_model --test loading
# Weight analysis
python3 -m voxtral_tts.test_model --model-path mlx_model --test weightsThe VoxtralTTS/ directory contains a SwiftUI iOS app that runs the model on-device using MLX-Swift.
The iOS app includes several optimizations to fit within iOS jetsam memory limits:
- Quantized embeddings — Both
LinearandEmbeddinglayers (including the 131K-vocab token embedding table) are properly loaded as quantized modules, saving ~800MB compared to unquantized embedding loading. - Quantized output projection — Tied embedding output uses
quantizedMMinstead of dequantizing the full weight matrix. - GPU cache limit — MLX buffer cache is capped at 20MB on iOS (per MLX recommendations) to prevent unbounded memory growth during autoregressive generation.
- Periodic cache clearing —
Memory.clearCache()is called every 50 frames during generation. - Mixed quantization — The codec (not in autoregressive loop) can use Q2 while LLM and acoustic transformer stay at Q4 minimum.
- Build the iOS-optimized model:
./scripts/build_models.sh q2 - Copy
mlx_model_q2/to your device - Open the Xcode project and build for your device (simulator is not supported — MLX requires Metal GPU)
- Select the model directory in the app and generate
scripts/
└── build_models.sh # Build all model variants (fp16, q4, q2)
voxtral_tts/
├── voxtral_tts.py # Main model: generate() + load()
├── transformer_decoder.py # LLM decoder (Mistral-based)
├── acoustic_transformer.py # Flow-matching acoustic model
├── codec.py # Audio codec (codes → waveform)
├── config.py # Model configuration dataclasses
├── convert.py # HF → MLX weight converter
├── convert_voices.py # .pt → .safetensors voice converter
└── test_model.py # Construction/loading/weight tests
- MLX / mlx-lm — Apple ML framework
- mistral-common — Tekken tokenizer + voice embeddings
- safetensors — Weight file format
torch(dev only) — For.ptvoice embedding conversion
Apache-2.0

