Skip to content

lbj96347/Mistral-TTS-iOS

Repository files navigation

Voxtral TTS MLX

MLX port of Mistral's Voxtral-4B-TTS-2603 text-to-speech model for on-device inference on Apple Silicon.

Converts the HuggingFace model (~8GB) into MLX format with optional quantization (Q2–Q8) for efficient local generation.

Screenshots

macOS app

iPhone 15 Pro app

Audio Samples

macOS — Original Model (fp16)

Text: "Kimi Antonelli took his second pole position in a row as he beat Mercedes team-mate George Russell in qualifying at the Japanese Grand Prix..."

https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output.mp3

macOS — Q4 Model

Same text as above

https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output_q4.mp3

iPhone 15 Pro — Q2 Model

Text: "Good morning! Nice to see you again!"

https://github.com/lbj96347/Mistral-TTS-iOS/raw/main/audio_sample/voxtral_output_iPhone.wav

Architecture

Three-stage pipeline:

Text → LLM Decoder → Flow-Matching Transformer → Codec → 24kHz WAV
Component Params Description
Transformer Decoder 3.4B Mistral/Ministral-3B based LLM; text tokens → hidden states
Acoustic Transformer 390M Flow-matching with Euler ODE + CFG; hidden states → audio codes
Voxtral Codec 300M Conv-transformer autoencoder; codes → 24kHz waveform

Each audio frame consists of 37 discrete tokens (1 semantic + 36 acoustic) at 12.5 Hz frame rate.

Setup

Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Usage

1. Inspect HuggingFace model weights

python3 -m voxtral_tts.convert --inspect

2. Convert model to MLX format

Quick build (recommended)

Use the build script to convert and prepare models with voice embeddings in one step:

# Build all three variants (fp16, q4, q2)
./scripts/build_models.sh

# Build specific variants
./scripts/build_models.sh q4           # Uniform Q4 only
./scripts/build_models.sh q4 q2        # Q4 and mixed Q4+Q2
./scripts/build_models.sh fp16         # Full precision only

# Use a local HF model directory (skip download)
./scripts/build_models.sh q4 --local-dir /path/to/Voxtral-4B-TTS-2603

# Preview commands without running
./scripts/build_models.sh --dry-run

This produces:

Directory Quantization Size Use case
mlx_model/ None (fp16) ~8 GB Development / best quality
mlx_model_q4/ Uniform Q4 ~2.1 GB Mac (8GB+ RAM)
mlx_model_q2/ Q4 LLM/acoustic + Q2 codec ~1.6 GB iPhone / iPad

Manual conversion

# Full precision
python3 -m voxtral_tts.convert --output-dir mlx_model

# With uniform quantization (q2, q4, q6, q8)
python3 -m voxtral_tts.convert --output-dir mlx_model_q4 --quantize q4

# Mixed quantization for iOS (Q4 LLM/acoustic, Q2 codec for size savings)
python3 -m voxtral_tts.convert --output-dir mlx_model_q2 \
    --quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2

# Convert voice embeddings (.pt → .safetensors) for iOS compatibility
python3 -m voxtral_tts.convert_voices mlx_model_q4

Quantization Guide

Target Device RAM Recommended Command Estimated Size
Mac (M1/M2/M3/M4) 16GB+ --quantize q4 ~2.1 GB
Mac (8GB) 8GB --quantize q4 ~2.1 GB
iPhone 15 Pro / iPad Pro 8GB --quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2 ~2.1 GB
iPhone 16 Pro 8GB --quantize-llm q4 --quantize-acoustic q4 --quantize-codec q2 ~2.1 GB

Mixed quantization applies different bit widths per component:

  • --quantize-llm — Language model (3.4B params, minimum Q4 enforced)
  • --quantize-acoustic — Acoustic transformer (390M params, minimum Q4 enforced)
  • --quantize-codec — Codec (300M params, tolerates Q2 since it's not in autoregressive loop)

Note: The LLM and acoustic transformer require Q4 minimum for intelligible speech. Values below Q4 are automatically clamped with a warning.

Per-component flags override --quantize when both are specified.

3. Test the model

# Construction test (no weights needed)
python3 -m voxtral_tts.test_model --test construction

# Weight loading test
python3 -m voxtral_tts.test_model --model-path mlx_model --test loading

# Weight analysis
python3 -m voxtral_tts.test_model --model-path mlx_model --test weights

iOS App

The VoxtralTTS/ directory contains a SwiftUI iOS app that runs the model on-device using MLX-Swift.

iOS Memory Optimizations

The iOS app includes several optimizations to fit within iOS jetsam memory limits:

  • Quantized embeddings — Both Linear and Embedding layers (including the 131K-vocab token embedding table) are properly loaded as quantized modules, saving ~800MB compared to unquantized embedding loading.
  • Quantized output projection — Tied embedding output uses quantizedMM instead of dequantizing the full weight matrix.
  • GPU cache limit — MLX buffer cache is capped at 20MB on iOS (per MLX recommendations) to prevent unbounded memory growth during autoregressive generation.
  • Periodic cache clearingMemory.clearCache() is called every 50 frames during generation.
  • Mixed quantization — The codec (not in autoregressive loop) can use Q2 while LLM and acoustic transformer stay at Q4 minimum.

Running on iOS

  1. Build the iOS-optimized model: ./scripts/build_models.sh q2
  2. Copy mlx_model_q2/ to your device
  3. Open the Xcode project and build for your device (simulator is not supported — MLX requires Metal GPU)
  4. Select the model directory in the app and generate

Project Structure

scripts/
└── build_models.sh              # Build all model variants (fp16, q4, q2)
voxtral_tts/
├── voxtral_tts.py               # Main model: generate() + load()
├── transformer_decoder.py       # LLM decoder (Mistral-based)
├── acoustic_transformer.py      # Flow-matching acoustic model
├── codec.py                     # Audio codec (codes → waveform)
├── config.py                    # Model configuration dataclasses
├── convert.py                   # HF → MLX weight converter
├── convert_voices.py            # .pt → .safetensors voice converter
└── test_model.py                # Construction/loading/weight tests

Dependencies

  • MLX / mlx-lm — Apple ML framework
  • mistral-common — Tekken tokenizer + voice embeddings
  • safetensors — Weight file format
  • torch (dev only) — For .pt voice embedding conversion

License

Apache-2.0

About

Mistral-TTS-iOS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors