Voxtral TTS MLX

MLX port of Mistral's Voxtral-4B-TTS-2603 text-to-speech model for on-device inference on Apple Silicon.

Converts the HuggingFace model (~8GB) into MLX format with optional quantization (Q2–Q8) for efficient local generation.

Screenshots

Audio Samples

macOS — Original Model (fp16)

Text: "Kimi Antonelli took his second pole position in a row as he beat Mercedes team-mate George Russell in qualifying at the Japanese Grand Prix..."

https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output.mp3

macOS — Q4 Model

Same text as above

https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output_q4.mp3

iPhone 15 Pro — Q2 Model

Text: "Good morning! Nice to see you again!"

https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output_iPhone.wav

Architecture

Three-stage pipeline:

Text → LLM Decoder → Flow-Matching Transformer → Codec → 24kHz WAV

Component	Params	Description
Transformer Decoder	3.4B	Mistral/Ministral-3B based LLM; text tokens → hidden states
Acoustic Transformer	390M	Flow-matching with Euler ODE + CFG; hidden states → audio codes
Voxtral Codec	300M	Conv-transformer autoencoder; codes → 24kHz waveform

Each audio frame consists of 37 discrete tokens (1 semantic + 36 acoustic) at 12.5 Hz frame rate.

Setup

Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Usage

1. Inspect HuggingFace model weights

python3 -m voxtral_tts.convert --inspect

2. Convert model to MLX format

Quick build (recommended)

Use the build script to convert and prepare models with voice embeddings in one step:

# Build all three variants (fp16, q4, q2)
./scripts/build_models.sh

# Build specific variants
./scripts/build_models.sh q4           # Uniform Q4 only
./scripts/build_models.sh q4 q2        # Q4 and mixed Q4+Q2
./scripts/build_models.sh fp16         # Full precision only

# Use a local HF model directory (skip download)
./scripts/build_models.sh q4 --local-dir /path/to/Voxtral-4B-TTS-2603

# Preview commands without running
./scripts/build_models.sh --dry-run

This produces:

Directory	Quantization	Size	Use case
`mlx_model/`	None (fp16)	~8 GB	Development / best quality
`mlx_model_q4/`	Uniform Q4	~2.1 GB	Mac (8GB+ RAM)
`mlx_model_q2/`	Q4 LLM/acoustic + Q2 codec	~1.6 GB	iPhone / iPad

Manual conversion

# Full precision
python3 -m voxtral_tts.convert --output-dir mlx_model

# With uniform quantization (q2, q4, q6, q8)
python3 -m voxtral_tts.convert --output-dir mlx_model_q4 --quantize q4

# Mixed quantization for iOS (Q4 LLM/acoustic, Q2 codec for size savings)
python3 -m voxtral_tts.convert --output-dir mlx_model_q2 \
    --quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2

# Convert voice embeddings (.pt → .safetensors) for iOS compatibility
python3 -m voxtral_tts.convert_voices mlx_model_q4

Quantization Guide

Target Device	RAM	Recommended Command	Estimated Size
Mac (M1/M2/M3/M4)	16GB+	`--quantize q4`	~2.1 GB
Mac (8GB)	8GB	`--quantize q4`	~2.1 GB
iPhone 15 Pro / iPad Pro	8GB	`--quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2`	~2.1 GB
iPhone 16 Pro	8GB	`--quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2`	~2.1 GB

Mixed quantization applies different bit widths per component:

--quantize-llm — Language model (3.4B params, minimum Q4 enforced)
--quantize-acoustic — Acoustic transformer (390M params, minimum Q4 enforced)
--quantize-codec — Codec (300M params, tolerates Q2 since it's not in autoregressive loop)

Note: The LLM and acoustic transformer require Q4 minimum for intelligible speech. Values below Q4 are automatically clamped with a warning.

Per-component flags override --quantize when both are specified.

3. Test the model

# Construction test (no weights needed)
python3 -m voxtral_tts.test_model --test construction

# Weight loading test
python3 -m voxtral_tts.test_model --model-path mlx_model --test loading

# Weight analysis
python3 -m voxtral_tts.test_model --model-path mlx_model --test weights

iOS App

The VoxtralTTS/ directory contains a SwiftUI iOS app that runs the model on-device using MLX-Swift.

iOS Memory Optimizations

The iOS app includes several optimizations to fit within iOS jetsam memory limits:

Quantized embeddings — Both Linear and Embedding layers (including the 131K-vocab token embedding table) are properly loaded as quantized modules, saving ~800MB compared to unquantized embedding loading.
Quantized output projection — Tied embedding output uses quantizedMM instead of dequantizing the full weight matrix.
GPU cache limit — MLX buffer cache is capped at 20MB on iOS (per MLX recommendations) to prevent unbounded memory growth during autoregressive generation.
Periodic cache clearing — Memory.clearCache() is called every 50 frames during generation.
Mixed quantization — The codec (not in autoregressive loop) can use Q2 while LLM and acoustic transformer stay at Q4 minimum.

Running on iOS

Build the iOS-optimized model: ./scripts/build_models.sh q2
Copy mlx_model_q2/ to your device
Open the Xcode project and build for your device (simulator is not supported — MLX requires Metal GPU)
Select the model directory in the app and generate

Project Structure

scripts/
└── build_models.sh              # Build all model variants (fp16, q4, q2)
voxtral_tts/
├── voxtral_tts.py               # Main model: generate() + load()
├── transformer_decoder.py       # LLM decoder (Mistral-based)
├── acoustic_transformer.py      # Flow-matching acoustic model
├── codec.py                     # Audio codec (codes → waveform)
├── config.py                    # Model configuration dataclasses
├── convert.py                   # HF → MLX weight converter
├── convert_voices.py            # .pt → .safetensors voice converter
└── test_model.py                # Construction/loading/weight tests

Dependencies

MLX / mlx-lm — Apple ML framework
mistral-common — Tekken tokenizer + voice embeddings
safetensors — Weight file format
torch (dev only) — For .pt voice embedding conversion

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude		.claude
VoxtralTTS-App		VoxtralTTS-App
VoxtralTTS		VoxtralTTS
audio_sample		audio_sample
screenshots		screenshots
scripts		scripts
voxtral_tts		voxtral_tts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
VOXTRAL_IOS_PLAN.md		VOXTRAL_IOS_PLAN.md
pyproject.toml		pyproject.toml
validate_model_dir.py		validate_model_dir.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxtral TTS MLX

Screenshots

Audio Samples

Architecture

Setup

Usage

1. Inspect HuggingFace model weights

2. Convert model to MLX format

Quick build (recommended)

Manual conversion

Quantization Guide

3. Test the model

iOS App

iOS Memory Optimizations

Running on iOS

Project Structure

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voxtral TTS MLX

Screenshots

Audio Samples

Architecture

Setup

Usage

1. Inspect HuggingFace model weights

2. Convert model to MLX format

Quick build (recommended)

Manual conversion

Quantization Guide

3. Test the model

iOS App

iOS Memory Optimizations

Running on iOS

Project Structure

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages