T5Gemma-TTS

Training and inference code for T5Gemma-TTS, a multilingual Text-to-Speech model based on the Encoder-Decoder LLM architecture. This repository provides scripts for data preprocessing, training (including LoRA fine-tuning), and inference.

For model details, audio samples, and technical information, please refer to the model card.

Features

Multilingual TTS: Supports English, Chinese, and Japanese
Voice Cloning: Zero-shot voice cloning from reference audio
Duration Control: Explicit control over generated audio length (auto-estimation when not specified)
Flexible Training: Full training, fine-tuning, and LoRA fine-tuning support
Multiple Inference Options: Command-line, HuggingFace format, and Gradio UI

Installation

git clone https://github.com/Aratako/T5Gemma-TTS.git
cd T5Gemma-TTS
pip install -r requirements.txt

Note: For GPU support, install PyTorch with CUDA before running pip install:

pip install "torch<=2.8.0" torchaudio --index-url https://download.pytorch.org/whl/cu128

Known Issues

Windows: On some native Windows environments, inference may exhibit unstable behavior such as inconsistent generation times or occasional hangs. This issue has been observed in my testing but the root cause is still under investigation. If you experience similar problems, consider using WSL2 or Docker as a workaround.

Tested Environments

The following environments have been tested by the developer. Other configurations may work but are not guaranteed.

Platform	Status	Notes
Linux + CUDA	✅ Tested
Windows + CUDA (Docker)	✅ Tested	Native Windows has known issues
Apple Silicon (MPS)	✅ Tested	M4 Max MacBook Pro

Quick Start

Basic Inference (HuggingFace Format)

python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "Hello, this is a test of the text to speech system."

Voice Cloning

python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "Hello, this is a test of the text to speech system." \
    --reference_text "This is a reference." \
    --reference_speech path/to/reference.wav

Duration Control

# Specify target duration in seconds
python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "Hello, this is a test of the text to speech system." \
    --target_duration 5.0

Note: If --target_duration is not specified, the system automatically calculates an appropriate duration based on phoneme count and language-specific pacing rules. This calculation is approximate, so if the result isn't as expected, try specifying the duration manually.

Inference

Using HuggingFace Format

python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "The quick brown fox jumps over the lazy dog." \
    --output_dir ./generated_tts

Using .pth Checkpoint

python inference_commandline.py \
    --model_root . \
    --model_name trained \
    --target_text "The quick brown fox jumps over the lazy dog."

For LoRA checkpoints:

python inference_commandline.py \
    --model_root . \
    --model_name lora \
    --target_text "The quick brown fox jumps over the lazy dog."

Gradio Web UI

python inference_gradio.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --port 7860

Batch generation (single GPU)

Use --max_batch to set the UI slider limit (default: 4).
When batch_count > 1, the demo will use inference_tts_batch if available (or auto-patch it for some older remote-code checkpoints). If you still see Model lacks inference_tts_batch, it will fall back to sequential generation.
Note: batch generation primarily improves throughput (multiple samples per request); it does not guarantee lower latency per sample.

Sentence segmentation (parallel segments)

In the UI, enable Enable Sentence Segmentation / 文分割を有効化 to split Target Text by sentence delimiters (Japanese/English punctuation + newlines).
If the model has inference_tts_batch_multi_text (or can be auto-patched for some older remote-code checkpoints), segments are generated in parallel; otherwise it falls back to sequential per-segment generation.
--max_segments controls how many segments are shown in the UI (display limit only). When batch_count > 1, only the first batch's per-segment audio is displayed; concatenated outputs are provided for each batch.
--inter_segment_silence controls the default silence (seconds) inserted between segments when concatenating (default: 1.0). This can also be adjusted via the UI slider (0.0–2.0s in 0.1s steps). Leading/trailing near-silence is trimmed from each segment (and the final output) before/after concatenation, then a fixed 0.5s silence is prepended to the beginning.
If the input text contains only a single sentence, segmentation is automatically disabled and standard inference is used instead.

Performance toggles (Gradio)

T5GEMMA_DETERMINISTIC=1: enable deterministic cuDNN settings (usually slower). Default is speed-oriented.
T5GEMMA_EMPTY_CACHE=1: call torch.cuda.empty_cache() after inference (helps low-VRAM setups, often slower).

By default, XCodec2-Variant (NandemoGHS/Anime-XCodec2-44.1kHz-v2) is used for audio decoding to better support Japanese voices. For English and Chinese voices, I recommend using the original XCodec2 model.

# You must use the original xcodec2 library when using the original XCodec2 model
pip install xcodec2==0.1.5 --no-deps

python inference_gradio.py \
    --model_dir t5gemma_voice_hf \
    --xcodec2_model_name HKUSTAudio/xcodec2 \
    --xcodec2_sample_rate 16000 \
    --port 7860

Low-VRAM options (Gradio / HF inference)

--cpu_codec: run XCodec2 tokenizer on CPU. Reduces VRAM use by roughly 3.5 GB; audio encode/decode becomes slower.
--cpu_whisper: run Whisper (auto-transcribe path) on CPU. Reduces VRAM use by roughly 5 GB; transcription slows down.
--low_vram: preset that enables both flags and disables torch.compile.

These switches don’t change model quality; they only trade GPU memory for a bit of latency on first runs.

Docker (Recommended for Windows users)

If you experience issues on Windows, Docker provides a stable Linux environment:

# Using docker-compose (recommended)
docker compose up --build

# Or build and run manually
docker build -t t5gemma-tts .
docker run --gpus all -p 7860:7860 t5gemma-tts

Docker Configuration Options

You can customize the Docker setup using environment variables:

# Specify CUDA version (cu118, cu121, cu124, cu128)
CUDA_VERSION=cu121 docker compose up --build

# Use a different model
MODEL_DIR=your-org/your-model docker compose up

# Change the port
PORT=8080 docker compose up

# Pass additional arguments
EXTRA_ARGS="--no_compile --share" docker compose up

Inference Parameters

Parameter	Default	Description
`--target_text`	(required)	Text to synthesize
`--reference_speech`	None	Path to reference audio for voice cloning
`--reference_text`	None	Transcript of reference audio (auto-transcribed via Whisper if not provided)
`--target_duration`	None	Target audio duration in seconds (auto-estimated if not provided)
`--top_k`	30	Top-k sampling parameter
`--top_p`	0.9	Top-p (nucleus) sampling parameter
`--temperature`	0.8	Sampling temperature
`--seed`	1	Random seed for reproducibility

Training

Data Preprocessing

Prepare training data using the preprocessing scripts. Example with Emilia-YODAS English subset:

python examples/data_preprocess/prepare_emilia_en.py \
    --output-dir datasets/emilia-yodas-en_0-9 \
    --data-files '{"train": "Emilia-YODAS/**/EN-B00000*.tar"}' \
    --encoder-devices auto \
    --valid-ratio 0.005 \
    --hf-num-proc 8

This generates:

text/ - Text transcripts
xcodec2_1cb/ - XCodec2 audio tokens
manifest_final/ - Train/validation manifests
neighbors/ - Neighbor files for voice prompting

Training from Scratch

NUM_GPUS=8 examples/training/t5gemma_2b-2b.sh

Fine-tuning a Pretrained Model

Full fine-tuning:

NUM_GPUS=8 examples/training/t5gemma_2b-2b-ft.sh

LoRA fine-tuning:

NUM_GPUS=1 examples/training/t5gemma_2b-2b-ft-lora.sh

Training Configuration

Key training parameters (see training scripts for full configuration):

Parameter	Description
`--t5gemma_model_name`	Base T5Gemma model (e.g., `google/t5gemma-2b-2b-ul2`)
`--xcodec2_model_name`	Audio codec model
`--lr`	Learning rate (default: 0.035 for ScaledAdam)
`--gradient_accumulation_steps`	Gradient accumulation steps
`--use_lora`	Enable LoRA training (1 to enable)
`--lora_rank`	LoRA rank (default: 8)

Model Conversion

Convert .pth to HuggingFace Format

Standard checkpoint:

python scripts/export_t5gemma_voice_hf.py \
    --ckpt trained.pth \
    --out t5gemma_voice_hf \
    --base_repo google/t5gemma-2b-2b-ul2

LoRA checkpoint (merge adapters):

python scripts/export_t5gemma_voice_hf_lora.py \
    --ckpt lora.pth \
    --out t5gemma_voice_hf_lora_merged \
    --base_repo google/t5gemma-2b-2b-ul2 \
    --save_adapter_dir lora-adapter

Project Structure

T5Gemma-TTS/
├── main.py                      # Training entry point
├── inference_commandline.py     # CLI inference (.pth format)
├── inference_commandline_hf.py  # CLI inference (HuggingFace format)
├── inference_gradio.py          # Gradio web demo
├── config.py                    # Configuration and arguments
├── requirements.txt             # Dependencies
│
├── models/                      # Model architecture
│   └── t5gemma.py               # T5GemmaVoiceModel with PM-RoPE
│
├── data/                        # Data loading
│   ├── combined_dataset.py      # Multi-dataset loader
│   └── tokenizer.py             # AudioTokenizer (XCodec2)
│
├── steps/                       # Training infrastructure
│   ├── trainer.py               # Distributed trainer
│   └── optim.py                 # ScaledAdam optimizer
│
├── scripts/                     # Utility scripts
│   ├── export_t5gemma_voice_hf.py      # Export to HF format
│   └── export_t5gemma_voice_hf_lora.py # Export LoRA to HF format
│
├── hf_export/                   # HuggingFace model wrapper
│   ├── configuration_t5gemma_voice.py
│   └── modeling_t5gemma_voice.py
│
└── examples/
    ├── training/                # Training shell scripts
    │   ├── t5gemma_2b-2b.sh           # Train from scratch
    │   ├── t5gemma_2b-2b-ft.sh        # Full fine-tuning
    │   └── t5gemma_2b-2b-ft-lora.sh   # LoRA fine-tuning
    └── data_preprocess/         # Data preprocessing
        └── prepare_emilia_en.py       # Emilia English preparation

Limitations

Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.
Duration Control: While the model supports explicit duration specification, control is not perfect. Generated audio may differ from the specified duration, and even when the duration matches, the speech pacing or naturalness may not always be optimal.
Audio Quality: Quality depends on training data characteristics. Performance may vary for voices, accents, or speaking styles underrepresented in the training data.

License

Code: MIT License
Model Weights: Please refer to the model card for licensing details

Acknowledgments

This project builds upon the following works:

VoiceStar - Architecture inspiration and base code
T5Gemma - Base model
XCodec2 and XCodec2-Variant - Audio codec

Citation

@misc{t5gemma-tts,
  author = {Aratako},
  title = {T5Gemma-TTS: An Encoder-Decoder LLM-based TTS Model},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Aratako/T5Gemma-TTS}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
archive		archive
data		data
docs		docs
examples		examples
figures		figures
hf_export		hf_export
models		models
prompts		prompts
scripts		scripts
steps		steps
tests		tests
texts		texts
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
config.py		config.py
conversation_gradio.py		conversation_gradio.py
docker-compose.yml		docker-compose.yml
duration_estimator.py		duration_estimator.py
inference_gradio.py		inference_gradio.py
inference_logic.py		inference_logic.py
inference_tts_utils.py		inference_tts_utils.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
script_generator.py		script_generator.py
serve.sh		serve.sh
serve_conversation.sh		serve_conversation.sh
uv.lock		uv.lock

License

he-be/TTS-Experiments

Folders and files

Latest commit

History

Repository files navigation

T5Gemma-TTS

Features

Installation

Known Issues

Tested Environments

Quick Start

Basic Inference (HuggingFace Format)

Voice Cloning

Duration Control

Inference

Using HuggingFace Format

Using .pth Checkpoint

Gradio Web UI

Batch generation (single GPU)

Sentence segmentation (parallel segments)

Performance toggles (Gradio)

Low-VRAM options (Gradio / HF inference)

Docker (Recommended for Windows users)

Docker Configuration Options

Inference Parameters

Training

Data Preprocessing

Training from Scratch

Fine-tuning a Pretrained Model

Training Configuration

Model Conversion

Convert .pth to HuggingFace Format

Project Structure

Limitations

License

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages