s2.cpp

ALPHA — EXPERIMENTAL SOFTWARE This is an early-stage, community-built inference engine. Expect rough edges, missing features, and breaking changes. Not production-ready.

This fork

Diverges from rodrigomatta/s2.cpp upstream in the following ways:

Separate s2-server binary — the HTTP server is split out of the s2 CLI into its own executable. The --server, -H, and -P flags have been removed from s2.
OpenAI-compatible API — s2-server exposes /v1/audio/speech, /v1/audio/voices, and /v1/models endpoints, modeled on the qwen3-tts.cpp server surface. Voice cloning is ICL-only (reference audio + ref_text).
HuggingFace auto-download — s2-server accepts -hf <repo[:quant]> and resolves the GGUF via the hf CLI.
Deterministic sampling — GenerateParams::seed (and a sampler_set_seed() entry point) allow reproducible generation; seed=0 preserves the prior non-deterministic behavior.

s2.cpp — Fish Audio's S2 Pro Dual-AR text-to-speech model running locally via a pure C++/GGML inference engine with CPU, Vulkan, and CUDA GPU backends. No Python runtime required after build.

Built on Fish Audio S2 Pro The model weights are licensed under the Fish Audio Research License, Copyright © 39 AI, INC. All Rights Reserved. See LICENSE.md for full terms. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.

What this is

This repository contains:

s2.cpp — a self-contained C++17 inference engine built on ggml (v0.9.11), handling tokenization, Dual-AR generation, audio codec encode/decode, and WAV output with no Python dependency
tokenizer.json — Qwen3 BPE tokenizer with ByteLevel pre-tokenization
GGUF model files are not included here — see Model variants below

The engine runs the full pipeline: text → tokens → Slow-AR transformer (with KV cache) → Fast-AR codebook decoder → audio codec → WAV file.

Model variants

GGUF files are available at rodrigomt/s2-pro-gguf on Hugging Face.

File	Size	Notes
`s2-pro-f16.gguf`	9.9 GB	Full precision — reference quality
`s2-pro-q8_0.gguf`	5.6 GB	Near-lossless — recommended for 8+ GB VRAM
`s2-pro-q6_k.gguf`	4.5 GB	Good quality/size balance — recommended for 6+ GB VRAM
`s2-pro-q5_k_m.gguf`	4.0 GB	Smaller with still-good quality
`s2-pro-q4_k_m.gguf`	3.6 GB	Best compact variant so far in quick RU validation
`s2-pro-q3_k.gguf`	3.0 GB	Usable, but starts stretching short words
`s2-pro-q2_k.gguf`	2.6 GB	Lowest-size experimental variant

All variants include both the transformer weights and the audio codec in a single file. The quantized variants above were regenerated with the codec tensors (c.*) kept in F16, so only the AR transformer is quantized.

Requirements

Build dependencies

CMake ≥ 3.14
C++17 compiler (GCC ≥ 10, Clang ≥ 11, MSVC 2019+)
For Vulkan GPU support: Vulkan SDK and glslc
For CUDA/NVIDIA GPU support: CUDA Toolkit ≥ 12.4
- MSVC 2019+ note: MSVC 2019 and later require CUDA ≥ 12.4 when building GGML. Older CUDA versions will produce compiler compatibility errors; upgrade to 12.4+ to resolve them.

# Ubuntu / Debian
sudo apt install cmake build-essential

# Vulkan (optional, for AMD/Intel GPU acceleration)
sudo apt install vulkan-tools libvulkan-dev glslc

# CUDA (optional, for NVIDIA GPU acceleration)
# Install from https://developer.nvidia.com/cuda-downloads

Runtime

No Python or PyTorch required. The binary links only against the ggml shared libraries built alongside it.

Building

Clone with submodules (ggml is a submodule):

git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cpp

CPU only

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)

With Vulkan GPU support (AMD/Intel)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)

With CUDA GPU support (NVIDIA)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_CUDA=ON
cmake --build build --parallel $(nproc)

With Metal GPU support (Apple Silicon, macOS)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_METAL=ON
cmake --build build --parallel $(nproc)

The binary is produced at build/s2.

Usage

Basic synthesis (CPU)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "The quick brown fox jumps over the lazy dog." \
  -o output.wav

tokenizer.json is searched automatically in the same directory as the model file, then the parent directory. If not found in either, it falls back to tokenizer.json in the current working directory.

Voice cloning with a reference audio

Provide a short reference clip (5–30 seconds, WAV or MP3) and a transcript of it:

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -pa reference.wav \
  -pt "Transcript of what the reference speaker says." \
  -text "Now synthesize this text in that voice." \
  -o output.wav

By default, the engine uses fish-speech-aligned sampling defaults: --min-tokens-before-end 0, no trailing-silence trim, no peak normalization, and no dynamic loudness normalization. All of these behaviors are optional and can be enabled from the CLI.

GPU inference via Vulkan (AMD/Intel)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Text to synthesize." \
  -v 0 \
  -o output.wav

-v 0 selects the first Vulkan device. All model weights are loaded into GPU VRAM. The audio codec always runs on CPU (executes only twice per synthesis).

GPU inference via CUDA (NVIDIA)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Text to synthesize." \
  -c 0 \
  -o output.wav

-c 0 selects the first CUDA device.

CUDA + quantized models: The CUDA backend supports ggml_get_rows for F16, F32, BF16, Q4_0, Q4_1, Q5_0, Q5_1, and Q8_0. Models in these formats (including Q8_0) run fully on GPU with no fallback. K-quant variants (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) are not supported by get_rows; for these, the engine automatically dequantizes the embedding tables to F16 on GPU while keeping the layer weights quantized, so compute still benefits from CUDA-accelerated mul_mat.

GPU inference via Metal (Apple Silicon, macOS)

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_METAL=ON
cmake --build build --parallel $(nproc)

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Text to synthesize." \
  -M \
  -o output.wav

-M enables the Metal backend. Tested and functional on Apple Silicon. Generation speed is still slower than Vulkan/CUDA — the pipeline runs correctly but expect higher latency per token compared to AMD or NVIDIA GPUs.

All options

Flag	Default	Description
`-m`, `--model`	`model.gguf`	Path to GGUF model file
`-t`, `--tokenizer`	`tokenizer.json`	Path to tokenizer.json
`-text`	`"Hello world"`	Text to synthesize
`-pa`, `--prompt-audio`	—	Reference audio file for voice cloning (WAV/MP3)
`-pt`, `--prompt-text`	—	Transcript of the reference audio
`-o`, `--output`	`out.wav`	Output WAV file path
`-v`, `--vulkan`	`-1` (CPU)	Vulkan device index (`-1` = CPU only)
`-c`, `--cuda`	`-1` (CPU)	CUDA device index (`-1` = CPU only)
`-M`, `--metal`	—	Use Metal (Apple GPU, macOS only)
`-threads N`	`4`	Number of CPU threads
`-max-tokens N`	`1024`	Max tokens to generate
`--min-tokens-before-end N`	`0`	Minimum generated tokens before `EOS` is allowed; `0` matches fish-speech default behavior
`-temp F`	`0.8`	Sampling temperature
`-top-p F`	`0.8`	Top-p nucleus sampling
`-top-k N`	`30`	Top-k sampling
`--dynamic-normalize` / `--no-dynamic-normalize`	`disabled`	Enable or disable dynamic RMS normalization
`--trim-silence` / `--no-trim-silence`	`trim` disabled	Enable or disable trailing silence trimming on the saved WAV
`--normalize` / `--no-normalize`	`normalize` disabled	Enable or disable peak normalization to `0.95` on the saved WAV
`--server`	—	Start HTTP server instead of CLI synthesis
`-H`, `--host`	`127.0.0.1`	Server bind address
`-P`, `--port`	`3030`	Server port

Setting --min-tokens-before-end 0 matches the upstream fish-speech behavior. Non-zero values deliberately bias the model away from early EOS.

HTTP server mode

Start the server:

./build/s2 -m s2-pro-q6_k.gguf --server
# or with custom host/port:
./build/s2 -m s2-pro-q6_k.gguf --server -H 0.0.0.0 -P 8080

POST /generate — synthesize audio (multipart/form-data)

Field	Type	Required	Description
`text`	string	yes	Text to synthesize
`reference`	file	no	Reference audio file for voice cloning (WAV or MP3). Aliases: `reference_audio`, `prompt_audio`, `ref_audio`
`reference_text`	string	if reference audio is provided	Transcript of the reference audio. Aliases: `ref_text`, `prompt_text`
`params`	JSON string	no	Generation params: `max_new_tokens`, `temperature`, `top_p`, `top_k`, `min_tokens_before_end`, `n_threads`, `verbose`

Returns audio/wav.

# Basic
curl -X POST http://127.0.0.1:3030/generate \
  --form "text=Hello world" \
  --form 'params={"max_new_tokens":512,"temperature":0.58,"top_p":0.88,"top_k":40}' \
  -o output.wav

# With voice cloning
curl -X POST http://127.0.0.1:3030/generate \
  --form "reference=@reference.wav" \
  --form "reference_text=Transcript of the reference." \
  --form "text=Text to synthesize in that voice." \
  --form 'params={"max_new_tokens":512,"temperature":0.58,"top_p":0.88,"top_k":40}' \
  -o output.wav

# Same request using the accepted aliases
curl -X POST http://127.0.0.1:3030/generate \
  --form "reference_audio=@reference.wav" \
  --form "ref_text=Transcript of the reference." \
  --form "text=Text to synthesize in that voice." \
  -o output.wav

Shared library / DLL export API

The engine can be built as a shared library (.so / .dll) with a C-compatible export API for integration into other applications:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_BUILD_SHARED_LIBRARIES=ON
cmake --build build --parallel $(nproc)

This produces s2_dll (shared) and s2_lib (static) alongside the CLI executable. When built as a shared library on Windows, non-essential console output is suppressed automatically via s2_config.h.

The export API (s2_export_api.h / s2_export_api.cpp, contributed by @subspecs) exposes a modular C interface with the following capabilities:

Separate lifecycle management — model, tokenizer, codec, and pipeline can be loaded independently and reused across multiple synthesis calls without reloading
Text-to-speech with or without voice cloning — synthesize from text alone, or provide a reference audio file and transcript for voice cloning
Audio output flexibility — results can be saved to a WAV file, retrieved as a raw float* sample buffer in memory, or both simultaneously
Precomputed reference codes — reference audio can be encoded once and reused across generations, avoiding repeated encoding overhead when synthesizing multiple utterances with the same voice
Cross-platform visibility — S2_Export uses __declspec(dllexport) on Windows and __attribute__((visibility("default"))) on GCC/Clang, so the same headers work across platforms

Exported functions cover allocation/release (AllocS2* / ReleaseS2*), initialization (InitializeS2*), tokenizer-model config sync (SyncS2TokenizerConfigFromS2Model), reference audio pre-processing (InitializeAudioPromptCodes), raw audio buffer access (AllocS2AudioBuffer / GetS2AudioBufferDataPointer), and the main synthesis entry point (S2Synthesize).

A community-maintained C# wrapper targeting .NET Standard 2.1 (Unity-compatible) is available at subspecs/FishS2Sharp.

Choosing a model

VRAM available	Recommended model
≥ 10 GB	`q8_0` — near-lossless quality
6–9 GB	`q6_k` — good quality/size balance
5–7 GB	`q4_k_m` — best compact variant in current quick validation
< 5 GB	`q3_k` or `q2_k` — experimental, quality drops faster

VRAM usage at runtime is roughly on the order of the model size, but actual usage depends on backend buffers, KV cache length, and allocator overhead. The audio codec executes on CPU during inference.

Note for CUDA users: Quantized models (q4_k_m, q3_k, q2_k) fall back to CPU compute on CUDA due to the ggml_get_rows limitation. Use Vulkan (-v) for GPU acceleration with these models, or use q8_0 / f16 which work natively on CUDA.

Architecture notes

S2 Pro uses a Dual-AR architecture:

Slow-AR — a 36-layer Qwen3-based transformer (4.13B params) that processes the full token sequence with GQA (32 heads, 8 KV heads), RoPE at 1M base, QK norm, and a persistent KV cache
Fast-AR — a 4-layer transformer (0.42B params) that autoregressively generates 10 acoustic codebook tokens from the Slow-AR hidden state for each semantic step
Audio codec — a convolutional encoder/decoder with residual vector quantization (RVQ, 10 codebooks × 4096 entries) that converts between audio waveforms and discrete codes

Total: ~4.56B parameters.

Implementation notes

The C++ engine (src/) is built on ggml v0.9.11 (pinned as a submodule). Key design decisions:

Backend-aware weight allocation — the engine detects the active backend at load time. Vulkan and Metal can store all weights on GPU. CUDA falls back to CPU for quantized models to avoid the ggml_get_rows unsupported-type crash. For non-CUDA backends, all weights go to GPU for maximum throughput.
Separate persistent gallocr allocators for Slow-AR and Fast-AR — each path keeps its own compute buffer, avoiding memory re-planning per token
Temporary prefill allocator — freed immediately after prefill, so the large compute buffer does not persist into the generation loop
Codec on CPU — the audio codec executes once per synthesis (decode only) or twice when a reference audio is provided (encode reference + decode output), so running it on CPU has zero impact on generation throughput
Graceful CUDA fallback — if CUDA initialization fails, the engine falls back to CPU automatically with a warning instead of aborting
RAS (Rejection-Augmented Sampling) — the generation loop includes a sliding-window repetition detector that resamples with higher temperature when tokens repeat within a 10-token window, reducing stuck-loop failures
posix_fadvise(DONTNEED) after loading the weights (Linux only) — advises the kernel to drop the GGUF file from page cache once the tensors are already in the backend buffer, reducing duplicate RAM use
Correct ByteLevel tokenization — the GPT-2 byte-to-unicode table is applied before BPE, producing token IDs identical to the HuggingFace reference tokenizer
Thread-safe pipeline — Pipeline::synthesize_raw is guarded by a mutex, allowing safe concurrent access from the HTTP server or external callers
HTTP server — built-in REST server (--server) with multipart form support for synthesis and voice cloning, powered by cpp-httplib

Source file index

File	Purpose
`src/main.cpp`	CLI argument parsing and entry point
`src/s2_model.cpp`	Dual-AR model loading, weight allocation, prefill/step/fast_decode with KV cache
`src/s2_codec.cpp`	Audio codec (encoder + decoder) with snake activations, ConvNext blocks, RVQ
`src/s2_generate.cpp`	Generation loop with RAS, semantic masking, fast codebook decoding
`src/s2_pipeline.cpp`	Orchestration: tokenizer + model + codec, audio I/O, mutex-guarded synthesis
`src/s2_tokenizer.cpp`	Qwen3 BPE tokenizer with ByteLevel pre-tokenization
`src/s2_sampler.cpp`	Temperature, top-p, top-k sampling
`src/s2_prompt.cpp`	Prompt tensor construction for semantic and codebook tokens
`src/s2_audio.cpp`	WAV read/write, silence trimming, normalization
`src/s2_server.cpp`	HTTP server with `/generate` endpoint
`src/s2_export_api.cpp`	C-exported shared library API

Tips

Long outputs

Voice quality and amplitude tend to degrade after ~800 tokens (~37 s of audio). For longer texts, split into sentences and concatenate the resulting WAV files. Optional post-processing flags such as --dynamic-normalize, --normalize, and --trim-silence can help clean up the result, but splitting remains the most reliable approach.

Known limitations (alpha)

Voice cloning quality depends heavily on reference audio length and SNR
CUDA backend falls back to CPU compute for quantized models (Q2_K through Q6_K) — Vulkan works natively with all quantization types
Windows: CUDA and Vulkan backends are supported; when using MSVC 2019+, ensure CUDA ≥ 12.4 is installed before building
macOS: Metal backend is tested and works on Apple Silicon, but generation is noticeably slower than Vulkan/CUDA equivalents

License

The model weights and associated materials are licensed under the Fish Audio Research License. Key points:

Research and non-commercial use: free, under the terms of this Agreement
Commercial use: requires a separate written license from Fish Audio
When distributing, you must include a copy of the license and the attribution notice

Full license: LICENSE.md

Commercial licensing: https://fish.audio · business@fish.audio

The inference engine source code (src/) is a Derivative Work of the Fish Audio Materials as defined in the Agreement and is distributed under the same Fish Audio Research License terms.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
ggml @ 58c3805		ggml @ 58c3805
include		include
src		src
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md
tokenizer.json		tokenizer.json

Folders and files

Latest commit

History

Repository files navigation

s2.cpp

This fork

What this is

Model variants

Requirements

Build dependencies

Runtime

Building

CPU only

With Vulkan GPU support (AMD/Intel)

With CUDA GPU support (NVIDIA)

With Metal GPU support (Apple Silicon, macOS)

Usage

Basic synthesis (CPU)

Voice cloning with a reference audio

GPU inference via Vulkan (AMD/Intel)

GPU inference via CUDA (NVIDIA)

GPU inference via Metal (Apple Silicon, macOS)

All options

HTTP server mode

Shared library / DLL export API

Choosing a model

Architecture notes

Implementation notes

Source file index

Tips

Long outputs

Known limitations (alpha)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages