ALPHA — EXPERIMENTAL SOFTWARE This is an early-stage, community-built inference engine. Expect rough edges, missing features, and breaking changes. Not production-ready.
Diverges from rodrigomatta/s2.cpp upstream in the following ways:
- Separate
s2-serverbinary — the HTTP server is split out of thes2CLI into its own executable. The--server,-H, and-Pflags have been removed froms2. - OpenAI-compatible API —
s2-serverexposes/v1/audio/speech,/v1/audio/voices, and/v1/modelsendpoints, modeled on the qwen3-tts.cpp server surface. Voice cloning is ICL-only (reference audio +ref_text). - HuggingFace auto-download —
s2-serveraccepts-hf <repo[:quant]>and resolves the GGUF via thehfCLI. - Deterministic sampling —
GenerateParams::seed(and asampler_set_seed()entry point) allow reproducible generation;seed=0preserves the prior non-deterministic behavior.
s2.cpp — Fish Audio's S2 Pro Dual-AR text-to-speech model running locally via a pure C++/GGML inference engine with CPU, Vulkan, and CUDA GPU backends. No Python runtime required after build.
Built on Fish Audio S2 Pro The model weights are licensed under the Fish Audio Research License, Copyright © 39 AI, INC. All Rights Reserved. See LICENSE.md for full terms. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.
This repository contains:
s2.cpp— a self-contained C++17 inference engine built on ggml (v0.9.11), handling tokenization, Dual-AR generation, audio codec encode/decode, and WAV output with no Python dependencytokenizer.json— Qwen3 BPE tokenizer with ByteLevel pre-tokenization- GGUF model files are not included here — see Model variants below
The engine runs the full pipeline: text → tokens → Slow-AR transformer (with KV cache) → Fast-AR codebook decoder → audio codec → WAV file.
GGUF files are available at rodrigomt/s2-pro-gguf on Hugging Face.
| File | Size | Notes |
|---|---|---|
s2-pro-f16.gguf |
9.9 GB | Full precision — reference quality |
s2-pro-q8_0.gguf |
5.6 GB | Near-lossless — recommended for 8+ GB VRAM |
s2-pro-q6_k.gguf |
4.5 GB | Good quality/size balance — recommended for 6+ GB VRAM |
s2-pro-q5_k_m.gguf |
4.0 GB | Smaller with still-good quality |
s2-pro-q4_k_m.gguf |
3.6 GB | Best compact variant so far in quick RU validation |
s2-pro-q3_k.gguf |
3.0 GB | Usable, but starts stretching short words |
s2-pro-q2_k.gguf |
2.6 GB | Lowest-size experimental variant |
All variants include both the transformer weights and the audio codec in a single file.
The quantized variants above were regenerated with the codec tensors (c.*) kept in F16, so only the AR transformer is quantized.
- CMake ≥ 3.14
- C++17 compiler (GCC ≥ 10, Clang ≥ 11, MSVC 2019+)
- For Vulkan GPU support: Vulkan SDK and
glslc - For CUDA/NVIDIA GPU support: CUDA Toolkit ≥ 12.4
- MSVC 2019+ note: MSVC 2019 and later require CUDA ≥ 12.4 when building GGML. Older CUDA versions will produce compiler compatibility errors; upgrade to 12.4+ to resolve them.
# Ubuntu / Debian
sudo apt install cmake build-essential
# Vulkan (optional, for AMD/Intel GPU acceleration)
sudo apt install vulkan-tools libvulkan-dev glslc
# CUDA (optional, for NVIDIA GPU acceleration)
# Install from https://developer.nvidia.com/cuda-downloadsNo Python or PyTorch required. The binary links only against the ggml shared libraries built alongside it.
Clone with submodules (ggml is a submodule):
git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cppcmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_CUDA=ON
cmake --build build --parallel $(nproc)cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_METAL=ON
cmake --build build --parallel $(nproc)The binary is produced at build/s2.
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-text "The quick brown fox jumps over the lazy dog." \
-o output.wavtokenizer.json is searched automatically in the same directory as the model file, then the parent directory. If not found in either, it falls back to tokenizer.json in the current working directory.
Provide a short reference clip (5–30 seconds, WAV or MP3) and a transcript of it:
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-pa reference.wav \
-pt "Transcript of what the reference speaker says." \
-text "Now synthesize this text in that voice." \
-o output.wavBy default, the engine uses fish-speech-aligned sampling defaults: --min-tokens-before-end 0, no trailing-silence trim, no peak normalization, and no dynamic loudness normalization. All of these behaviors are optional and can be enabled from the CLI.
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-text "Text to synthesize." \
-v 0 \
-o output.wav-v 0 selects the first Vulkan device. All model weights are loaded into GPU VRAM. The audio codec always runs on CPU (executes only twice per synthesis).
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-text "Text to synthesize." \
-c 0 \
-o output.wav-c 0 selects the first CUDA device.
CUDA + quantized models: The CUDA backend supports
ggml_get_rowsfor F16, F32, BF16, Q4_0, Q4_1, Q5_0, Q5_1, and Q8_0. Models in these formats (including Q8_0) run fully on GPU with no fallback. K-quant variants (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) are not supported byget_rows; for these, the engine automatically dequantizes the embedding tables to F16 on GPU while keeping the layer weights quantized, so compute still benefits from CUDA-acceleratedmul_mat.
cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_METAL=ON
cmake --build build --parallel $(nproc)
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-text "Text to synthesize." \
-M \
-o output.wav-M enables the Metal backend. Tested and functional on Apple Silicon. Generation speed is still slower than Vulkan/CUDA — the pipeline runs correctly but expect higher latency per token compared to AMD or NVIDIA GPUs.
| Flag | Default | Description |
|---|---|---|
-m, --model |
model.gguf |
Path to GGUF model file |
-t, --tokenizer |
tokenizer.json |
Path to tokenizer.json |
-text |
"Hello world" |
Text to synthesize |
-pa, --prompt-audio |
— | Reference audio file for voice cloning (WAV/MP3) |
-pt, --prompt-text |
— | Transcript of the reference audio |
-o, --output |
out.wav |
Output WAV file path |
-v, --vulkan |
-1 (CPU) |
Vulkan device index (-1 = CPU only) |
-c, --cuda |
-1 (CPU) |
CUDA device index (-1 = CPU only) |
-M, --metal |
— | Use Metal (Apple GPU, macOS only) |
-threads N |
4 |
Number of CPU threads |
-max-tokens N |
1024 |
Max tokens to generate |
--min-tokens-before-end N |
0 |
Minimum generated tokens before EOS is allowed; 0 matches fish-speech default behavior |
-temp F |
0.8 |
Sampling temperature |
-top-p F |
0.8 |
Top-p nucleus sampling |
-top-k N |
30 |
Top-k sampling |
--dynamic-normalize / --no-dynamic-normalize |
disabled |
Enable or disable dynamic RMS normalization |
--trim-silence / --no-trim-silence |
trim disabled |
Enable or disable trailing silence trimming on the saved WAV |
--normalize / --no-normalize |
normalize disabled |
Enable or disable peak normalization to 0.95 on the saved WAV |
--server |
— | Start HTTP server instead of CLI synthesis |
-H, --host |
127.0.0.1 |
Server bind address |
-P, --port |
3030 |
Server port |
Setting --min-tokens-before-end 0 matches the upstream fish-speech behavior. Non-zero values deliberately bias the model away from early EOS.
Start the server:
./build/s2 -m s2-pro-q6_k.gguf --server
# or with custom host/port:
./build/s2 -m s2-pro-q6_k.gguf --server -H 0.0.0.0 -P 8080POST /generate — synthesize audio (multipart/form-data)
| Field | Type | Required | Description |
|---|---|---|---|
text |
string | yes | Text to synthesize |
reference |
file | no | Reference audio file for voice cloning (WAV or MP3). Aliases: reference_audio, prompt_audio, ref_audio |
reference_text |
string | if reference audio is provided | Transcript of the reference audio. Aliases: ref_text, prompt_text |
params |
JSON string | no | Generation params: max_new_tokens, temperature, top_p, top_k, min_tokens_before_end, n_threads, verbose |
Returns audio/wav.
# Basic
curl -X POST http://127.0.0.1:3030/generate \
--form "text=Hello world" \
--form 'params={"max_new_tokens":512,"temperature":0.58,"top_p":0.88,"top_k":40}' \
-o output.wav
# With voice cloning
curl -X POST http://127.0.0.1:3030/generate \
--form "reference=@reference.wav" \
--form "reference_text=Transcript of the reference." \
--form "text=Text to synthesize in that voice." \
--form 'params={"max_new_tokens":512,"temperature":0.58,"top_p":0.88,"top_k":40}' \
-o output.wav
# Same request using the accepted aliases
curl -X POST http://127.0.0.1:3030/generate \
--form "reference_audio=@reference.wav" \
--form "ref_text=Transcript of the reference." \
--form "text=Text to synthesize in that voice." \
-o output.wavThe engine can be built as a shared library (.so / .dll) with a C-compatible export API for integration into other applications:
cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_BUILD_SHARED_LIBRARIES=ON
cmake --build build --parallel $(nproc)This produces s2_dll (shared) and s2_lib (static) alongside the CLI executable. When built as a shared library on Windows, non-essential console output is suppressed automatically via s2_config.h.
The export API (s2_export_api.h / s2_export_api.cpp, contributed by @subspecs) exposes a modular C interface with the following capabilities:
- Separate lifecycle management — model, tokenizer, codec, and pipeline can be loaded independently and reused across multiple synthesis calls without reloading
- Text-to-speech with or without voice cloning — synthesize from text alone, or provide a reference audio file and transcript for voice cloning
- Audio output flexibility — results can be saved to a WAV file, retrieved as a raw
float*sample buffer in memory, or both simultaneously - Precomputed reference codes — reference audio can be encoded once and reused across generations, avoiding repeated encoding overhead when synthesizing multiple utterances with the same voice
- Cross-platform visibility —
S2_Exportuses__declspec(dllexport)on Windows and__attribute__((visibility("default")))on GCC/Clang, so the same headers work across platforms
Exported functions cover allocation/release (AllocS2* / ReleaseS2*), initialization (InitializeS2*), tokenizer-model config sync (SyncS2TokenizerConfigFromS2Model), reference audio pre-processing (InitializeAudioPromptCodes), raw audio buffer access (AllocS2AudioBuffer / GetS2AudioBufferDataPointer), and the main synthesis entry point (S2Synthesize).
A community-maintained C# wrapper targeting .NET Standard 2.1 (Unity-compatible) is available at subspecs/FishS2Sharp.
| VRAM available | Recommended model |
|---|---|
| ≥ 10 GB | q8_0 — near-lossless quality |
| 6–9 GB | q6_k — good quality/size balance |
| 5–7 GB | q4_k_m — best compact variant in current quick validation |
| < 5 GB | q3_k or q2_k — experimental, quality drops faster |
VRAM usage at runtime is roughly on the order of the model size, but actual usage depends on backend buffers, KV cache length, and allocator overhead. The audio codec executes on CPU during inference.
Note for CUDA users: Quantized models (
q4_k_m,q3_k,q2_k) fall back to CPU compute on CUDA due to theggml_get_rowslimitation. Use Vulkan (-v) for GPU acceleration with these models, or useq8_0/f16which work natively on CUDA.
S2 Pro uses a Dual-AR architecture:
- Slow-AR — a 36-layer Qwen3-based transformer (4.13B params) that processes the full token sequence with GQA (32 heads, 8 KV heads), RoPE at 1M base, QK norm, and a persistent KV cache
- Fast-AR — a 4-layer transformer (0.42B params) that autoregressively generates 10 acoustic codebook tokens from the Slow-AR hidden state for each semantic step
- Audio codec — a convolutional encoder/decoder with residual vector quantization (RVQ, 10 codebooks × 4096 entries) that converts between audio waveforms and discrete codes
Total: ~4.56B parameters.
The C++ engine (src/) is built on ggml v0.9.11 (pinned as a submodule). Key design decisions:
- Backend-aware weight allocation — the engine detects the active backend at load time. Vulkan and Metal can store all weights on GPU. CUDA falls back to CPU for quantized models to avoid the
ggml_get_rowsunsupported-type crash. For non-CUDA backends, all weights go to GPU for maximum throughput. - Separate persistent
gallocrallocators for Slow-AR and Fast-AR — each path keeps its own compute buffer, avoiding memory re-planning per token - Temporary prefill allocator — freed immediately after prefill, so the large compute buffer does not persist into the generation loop
- Codec on CPU — the audio codec executes once per synthesis (decode only) or twice when a reference audio is provided (encode reference + decode output), so running it on CPU has zero impact on generation throughput
- Graceful CUDA fallback — if CUDA initialization fails, the engine falls back to CPU automatically with a warning instead of aborting
- RAS (Rejection-Augmented Sampling) — the generation loop includes a sliding-window repetition detector that resamples with higher temperature when tokens repeat within a 10-token window, reducing stuck-loop failures
- posix_fadvise(DONTNEED) after loading the weights (Linux only) — advises the kernel to drop the GGUF file from page cache once the tensors are already in the backend buffer, reducing duplicate RAM use
- Correct ByteLevel tokenization — the GPT-2 byte-to-unicode table is applied before BPE, producing token IDs identical to the HuggingFace reference tokenizer
- Thread-safe pipeline —
Pipeline::synthesize_rawis guarded by a mutex, allowing safe concurrent access from the HTTP server or external callers - HTTP server — built-in REST server (
--server) with multipart form support for synthesis and voice cloning, powered bycpp-httplib
| File | Purpose |
|---|---|
src/main.cpp |
CLI argument parsing and entry point |
src/s2_model.cpp |
Dual-AR model loading, weight allocation, prefill/step/fast_decode with KV cache |
src/s2_codec.cpp |
Audio codec (encoder + decoder) with snake activations, ConvNext blocks, RVQ |
src/s2_generate.cpp |
Generation loop with RAS, semantic masking, fast codebook decoding |
src/s2_pipeline.cpp |
Orchestration: tokenizer + model + codec, audio I/O, mutex-guarded synthesis |
src/s2_tokenizer.cpp |
Qwen3 BPE tokenizer with ByteLevel pre-tokenization |
src/s2_sampler.cpp |
Temperature, top-p, top-k sampling |
src/s2_prompt.cpp |
Prompt tensor construction for semantic and codebook tokens |
src/s2_audio.cpp |
WAV read/write, silence trimming, normalization |
src/s2_server.cpp |
HTTP server with /generate endpoint |
src/s2_export_api.cpp |
C-exported shared library API |
Voice quality and amplitude tend to degrade after ~800 tokens (~37 s of audio). For longer texts, split into sentences and concatenate the resulting WAV files. Optional post-processing flags such as --dynamic-normalize, --normalize, and --trim-silence can help clean up the result, but splitting remains the most reliable approach.
- Voice cloning quality depends heavily on reference audio length and SNR
- CUDA backend falls back to CPU compute for quantized models (Q2_K through Q6_K) — Vulkan works natively with all quantization types
- Windows: CUDA and Vulkan backends are supported; when using MSVC 2019+, ensure CUDA ≥ 12.4 is installed before building
- macOS: Metal backend is tested and works on Apple Silicon, but generation is noticeably slower than Vulkan/CUDA equivalents
The model weights and associated materials are licensed under the Fish Audio Research License. Key points:
- Research and non-commercial use: free, under the terms of this Agreement
- Commercial use: requires a separate written license from Fish Audio
- When distributing, you must include a copy of the license and the attribution notice
- Attribution: "This model is licensed under the Fish Audio Research License, Copyright © 39 AI, INC. All Rights Reserved."
Full license: LICENSE.md
Commercial licensing: https://fish.audio · business@fish.audio
The inference engine source code (src/) is a Derivative Work of the Fish Audio Materials as defined in the Agreement and is distributed under the same Fish Audio Research License terms.