Skip to content

he-be/TTS-Experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

T5Gemma-TTS

Model Demo License: MIT

日本語版 README はこちら

Architecture

Training and inference code for T5Gemma-TTS, a multilingual Text-to-Speech model based on the Encoder-Decoder LLM architecture. This repository provides scripts for data preprocessing, training (including LoRA fine-tuning), and inference.

For model details, audio samples, and technical information, please refer to the model card.

Features

  • Multilingual TTS: Supports English, Chinese, and Japanese
  • Voice Cloning: Zero-shot voice cloning from reference audio
  • Duration Control: Explicit control over generated audio length (auto-estimation when not specified)
  • Flexible Training: Full training, fine-tuning, and LoRA fine-tuning support
  • Multiple Inference Options: Command-line, HuggingFace format, and Gradio UI

Installation

git clone https://github.com/Aratako/T5Gemma-TTS.git
cd T5Gemma-TTS
pip install -r requirements.txt

Note: For GPU support, install PyTorch with CUDA before running pip install:

pip install "torch<=2.8.0" torchaudio --index-url https://download.pytorch.org/whl/cu128

Known Issues

  • Windows: On some native Windows environments, inference may exhibit unstable behavior such as inconsistent generation times or occasional hangs. This issue has been observed in my testing but the root cause is still under investigation. If you experience similar problems, consider using WSL2 or Docker as a workaround.

Tested Environments

The following environments have been tested by the developer. Other configurations may work but are not guaranteed.

Platform Status Notes
Linux + CUDA ✅ Tested
Windows + CUDA (Docker) ✅ Tested Native Windows has known issues
Apple Silicon (MPS) ✅ Tested M4 Max MacBook Pro

Quick Start

Basic Inference (HuggingFace Format)

python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "Hello, this is a test of the text to speech system."

Voice Cloning

python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "Hello, this is a test of the text to speech system." \
    --reference_text "This is a reference." \
    --reference_speech path/to/reference.wav

Duration Control

# Specify target duration in seconds
python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "Hello, this is a test of the text to speech system." \
    --target_duration 5.0

Note: If --target_duration is not specified, the system automatically calculates an appropriate duration based on phoneme count and language-specific pacing rules. This calculation is approximate, so if the result isn't as expected, try specifying the duration manually.

Inference

Using HuggingFace Format

python inference_commandline_hf.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --target_text "The quick brown fox jumps over the lazy dog." \
    --output_dir ./generated_tts

Using .pth Checkpoint

python inference_commandline.py \
    --model_root . \
    --model_name trained \
    --target_text "The quick brown fox jumps over the lazy dog."

For LoRA checkpoints:

python inference_commandline.py \
    --model_root . \
    --model_name lora \
    --target_text "The quick brown fox jumps over the lazy dog."

Gradio Web UI

python inference_gradio.py \
    --model_dir Aratako/T5Gemma-TTS-2b-2b \
    --port 7860

Batch generation (single GPU)

  • Use --max_batch to set the UI slider limit (default: 4).
  • When batch_count > 1, the demo will use inference_tts_batch if available (or auto-patch it for some older remote-code checkpoints). If you still see Model lacks inference_tts_batch, it will fall back to sequential generation.
  • Note: batch generation primarily improves throughput (multiple samples per request); it does not guarantee lower latency per sample.

Sentence segmentation (parallel segments)

  • In the UI, enable Enable Sentence Segmentation / 文分割を有効化 to split Target Text by sentence delimiters (Japanese/English punctuation + newlines).
  • If the model has inference_tts_batch_multi_text (or can be auto-patched for some older remote-code checkpoints), segments are generated in parallel; otherwise it falls back to sequential per-segment generation.
  • --max_segments controls how many segments are shown in the UI (display limit only). When batch_count > 1, only the first batch's per-segment audio is displayed; concatenated outputs are provided for each batch.
  • --inter_segment_silence controls the default silence (seconds) inserted between segments when concatenating (default: 1.0). This can also be adjusted via the UI slider (0.0–2.0s in 0.1s steps). Leading/trailing near-silence is trimmed from each segment (and the final output) before/after concatenation, then a fixed 0.5s silence is prepended to the beginning.
  • If the input text contains only a single sentence, segmentation is automatically disabled and standard inference is used instead.

Performance toggles (Gradio)

  • T5GEMMA_DETERMINISTIC=1: enable deterministic cuDNN settings (usually slower). Default is speed-oriented.
  • T5GEMMA_EMPTY_CACHE=1: call torch.cuda.empty_cache() after inference (helps low-VRAM setups, often slower).

By default, XCodec2-Variant (NandemoGHS/Anime-XCodec2-44.1kHz-v2) is used for audio decoding to better support Japanese voices. For English and Chinese voices, I recommend using the original XCodec2 model.

# You must use the original xcodec2 library when using the original XCodec2 model
pip install xcodec2==0.1.5 --no-deps

python inference_gradio.py \
    --model_dir t5gemma_voice_hf \
    --xcodec2_model_name HKUSTAudio/xcodec2 \
    --xcodec2_sample_rate 16000 \
    --port 7860

Low-VRAM options (Gradio / HF inference)

  • --cpu_codec: run XCodec2 tokenizer on CPU. Reduces VRAM use by roughly 3.5 GB; audio encode/decode becomes slower.
  • --cpu_whisper: run Whisper (auto-transcribe path) on CPU. Reduces VRAM use by roughly 5 GB; transcription slows down.
  • --low_vram: preset that enables both flags and disables torch.compile.

These switches don’t change model quality; they only trade GPU memory for a bit of latency on first runs.

Docker (Recommended for Windows users)

If you experience issues on Windows, Docker provides a stable Linux environment:

# Using docker-compose (recommended)
docker compose up --build

# Or build and run manually
docker build -t t5gemma-tts .
docker run --gpus all -p 7860:7860 t5gemma-tts

Docker Configuration Options

You can customize the Docker setup using environment variables:

# Specify CUDA version (cu118, cu121, cu124, cu128)
CUDA_VERSION=cu121 docker compose up --build

# Use a different model
MODEL_DIR=your-org/your-model docker compose up

# Change the port
PORT=8080 docker compose up

# Pass additional arguments
EXTRA_ARGS="--no_compile --share" docker compose up

Inference Parameters

Parameter Default Description
--target_text (required) Text to synthesize
--reference_speech None Path to reference audio for voice cloning
--reference_text None Transcript of reference audio (auto-transcribed via Whisper if not provided)
--target_duration None Target audio duration in seconds (auto-estimated if not provided)
--top_k 30 Top-k sampling parameter
--top_p 0.9 Top-p (nucleus) sampling parameter
--temperature 0.8 Sampling temperature
--seed 1 Random seed for reproducibility

Training

Data Preprocessing

Prepare training data using the preprocessing scripts. Example with Emilia-YODAS English subset:

python examples/data_preprocess/prepare_emilia_en.py \
    --output-dir datasets/emilia-yodas-en_0-9 \
    --data-files '{"train": "Emilia-YODAS/**/EN-B00000*.tar"}' \
    --encoder-devices auto \
    --valid-ratio 0.005 \
    --hf-num-proc 8

This generates:

  • text/ - Text transcripts
  • xcodec2_1cb/ - XCodec2 audio tokens
  • manifest_final/ - Train/validation manifests
  • neighbors/ - Neighbor files for voice prompting

Training from Scratch

NUM_GPUS=8 examples/training/t5gemma_2b-2b.sh

Fine-tuning a Pretrained Model

Full fine-tuning:

NUM_GPUS=8 examples/training/t5gemma_2b-2b-ft.sh

LoRA fine-tuning:

NUM_GPUS=1 examples/training/t5gemma_2b-2b-ft-lora.sh

Training Configuration

Key training parameters (see training scripts for full configuration):

Parameter Description
--t5gemma_model_name Base T5Gemma model (e.g., google/t5gemma-2b-2b-ul2)
--xcodec2_model_name Audio codec model
--lr Learning rate (default: 0.035 for ScaledAdam)
--gradient_accumulation_steps Gradient accumulation steps
--use_lora Enable LoRA training (1 to enable)
--lora_rank LoRA rank (default: 8)

Model Conversion

Convert .pth to HuggingFace Format

Standard checkpoint:

python scripts/export_t5gemma_voice_hf.py \
    --ckpt trained.pth \
    --out t5gemma_voice_hf \
    --base_repo google/t5gemma-2b-2b-ul2

LoRA checkpoint (merge adapters):

python scripts/export_t5gemma_voice_hf_lora.py \
    --ckpt lora.pth \
    --out t5gemma_voice_hf_lora_merged \
    --base_repo google/t5gemma-2b-2b-ul2 \
    --save_adapter_dir lora-adapter

Project Structure

T5Gemma-TTS/
├── main.py                      # Training entry point
├── inference_commandline.py     # CLI inference (.pth format)
├── inference_commandline_hf.py  # CLI inference (HuggingFace format)
├── inference_gradio.py          # Gradio web demo
├── config.py                    # Configuration and arguments
├── requirements.txt             # Dependencies
│
├── models/                      # Model architecture
│   └── t5gemma.py               # T5GemmaVoiceModel with PM-RoPE
│
├── data/                        # Data loading
│   ├── combined_dataset.py      # Multi-dataset loader
│   └── tokenizer.py             # AudioTokenizer (XCodec2)
│
├── steps/                       # Training infrastructure
│   ├── trainer.py               # Distributed trainer
│   └── optim.py                 # ScaledAdam optimizer
│
├── scripts/                     # Utility scripts
│   ├── export_t5gemma_voice_hf.py      # Export to HF format
│   └── export_t5gemma_voice_hf_lora.py # Export LoRA to HF format
│
├── hf_export/                   # HuggingFace model wrapper
│   ├── configuration_t5gemma_voice.py
│   └── modeling_t5gemma_voice.py
│
└── examples/
    ├── training/                # Training shell scripts
    │   ├── t5gemma_2b-2b.sh           # Train from scratch
    │   ├── t5gemma_2b-2b-ft.sh        # Full fine-tuning
    │   └── t5gemma_2b-2b-ft-lora.sh   # LoRA fine-tuning
    └── data_preprocess/         # Data preprocessing
        └── prepare_emilia_en.py       # Emilia English preparation

Limitations

  • Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.
  • Duration Control: While the model supports explicit duration specification, control is not perfect. Generated audio may differ from the specified duration, and even when the duration matches, the speech pacing or naturalness may not always be optimal.
  • Audio Quality: Quality depends on training data characteristics. Performance may vary for voices, accents, or speaking styles underrepresented in the training data.

License

Acknowledgments

This project builds upon the following works:

Citation

@misc{t5gemma-tts,
  author = {Aratako},
  title = {T5Gemma-TTS: An Encoder-Decoder LLM-based TTS Model},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Aratako/T5Gemma-TTS}}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages