Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as Bark, MMS, VITS and SpeechT5.

You can easily generate audio using the "text-to-audio" pipeline (or its alias - "text-to-speech"). Some models, like Bark, can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.

Here’s an example of how you would use the "text-to-speech" pipeline with Bark:

```
from transformers import pipeline
from IPython.display import Audio

pipe = pipeline("text-to-speech", model="suno/bark-small")
text = "[clears throat] This is a test ... and I just took a long pause."
output = pipe(text)
Audio(output["audio"], rate=output["sampling_rate"])
```

If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers are SpeechT5 and FastSpeech2Conformer, though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.

The remainder of this guide illustrates how to:
1. Fine-tune SpeechT5 that was originally trained on English speech on the Dutch (nl) language subset of the VoxPopuli dataset.
2. Use your refined model for inference in one of two ways: using a pipeline or directly.

# Libraries

In [1]:
!pip install -U datasets soundfile speechbrain accelerate

# Install 🤗Transformers from source as not all the SpeechT5 features have been merged into an official release
!pip install git+https://github.com/huggingface/transformers.git

Collecting datasets
  Using cached datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting soundfile
  Using cached soundfile-0.13.1-py2.py3-none-macosx_10_9_x86_64.whl.metadata (16 kB)
Collecting accelerate
  Using cached accelerate-1.5.2-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Using cached pyarrow-19.0.1.tar.gz (1.1 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting requests>=2.32.2 (from datasets)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached datasets-3.4.1-py3-none-any.whl (487 kB)
Using cached soundfile-0.13.1-py2.py3-none-macosx_10_9_x86_64.whl (1.1 MB)
Using cached accelerate-1.5.2-py3-none-any.whl (345 kB)
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Using cached tqdm-

In [2]:
# To follow this guide you will need a GPU

# If you’re working in a notebook, run the following line to check if NVIDIA GPU available
#!nvidia-smi

# Or for AMD GPU
#!rocm-smi

zsh:1: command not found: nvidia-smi
zsh:1: command not found: rocm-smi


In [3]:
import os
import torch
import matplotlib.pyplot as plt
from IPython.display import Audio
from dataclasses import dataclass
from collections import defaultdict
from typing import Any, Dict, List, Union
from accelerate.test_utils.testing import get_backend
from speechbrain.inference.classifiers import EncoderClassifier
from transformers import pipeline, SpeechT5Processor, SpeechT5ForTextToSpeech, Seq2SeqTrainingArguments

INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
  Referenced from: <A549E5FA-1487-3474-A747-4913D621982E> /Users/nm/opt/anaconda3/envs/nlp/lib/python3.10/site-packages/torchvision/image.so
  warn(
2025-03-19 16:35:31.871494: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:numexpr.utils:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
module 'torch.distributed' has no attribute 'device_mesh'

# Data

VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. It contains labelled audio-transcription data for 15 European languages. In this guide, we are using the Dutch language subset, feel free to pick another subset.

Note that VoxPopuli or any other automated speech recognition (ASR) dataset may not be the most suitable option for training TTS models. The features that make it beneficial for ASR, such as excessive background noise, are typically undesirable in TTS. However, finding top-quality, multilingual, and multi-speaker TTS datasets can be quite challenging.

In [None]:
# Load data set
from datasets import load_dataset, Audio

# Check len == 20968 examples
dataset = load_dataset("facebook/voxpopuli", "nl", split="train")
len(dataset)

In [None]:
# SpeechT5 expects audio data to have a sampling rate of 16 kHz
# Make sure the examples in the dataset meet the requirement of 16kHz sampling rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# Preprocessing

## Text Proprocessing

In [None]:
# Load appropriate tokenizer and clean up text
checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)
tokenizer = processor.tokenizer

The dataset examples contain ```raw_text``` and ```normalized_text``` features. When deciding which feature to use as the text input, consider that the SpeechT5 tokenizer doesn’t have any tokens for numbers. In ```normalized_text``` the numbers are written out as text. Thus, it is a better fit, and it is recommended to use ```normalized_text``` as input text.

Because SpeechT5 was trained on the English language, it may not recognize certain characters in the Dutch dataset. If left as is, these characters will be converted to ```<unk>``` tokens. However, in Dutch, certain characters like à are used to stress syllables. In order to preserve the meaning of the text, we can replace this character with a regular a.

To identify unsupported tokens, extract all unique characters in the dataset using the SpeechT5Tokenizer which works with characters as tokens. To do this, write the ```extract_all_chars``` mapping function that concatenates the transcriptions from all examples into one string and converts it to a set of characters. Make sure to set ```batched=True``` and ```batch_size=-1``` in ```dataset.map()``` so that all transcriptions are available at once for the mapping function.

In [None]:
def extract_all_chars(batch):
    all_text = " ".join(batch["normalized_text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}


vocabs = dataset.map(
    extract_all_chars,
    batched=True,
    batch_size=-1,
    keep_in_memory=True,
    remove_columns=dataset.column_names,
)

dataset_vocab = set(vocabs["vocab"][0])
tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

In [None]:
# Identify unrecognised characters
print(dataset_vocab - tokenizer_vocab)

# Handle the unsupported characters identified in the previous step (manually in this case)
replacements = [
    ("à", "a"),
    ("ç", "c"),
    ("è", "e"),
    ("ë", "e"),
    ("í", "i"),
    ("ï", "i"),
    ("ö", "o"),
    ("ü", "u"),
]


def cleanup_text(inputs):
    for src, dst in replacements:
        inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
    return inputs


dataset = dataset.map(cleanup_text)

## Audio Preprocessing

### Multiple speaker identification

In [None]:
# How many speakers are represented in the dataset?
speaker_counts = defaultdict(int)

for speaker_id in dataset["speaker_id"]:
    speaker_counts[speaker_id] += 1

In [None]:
# How many examples are there for each speaker?
plt.figure()
plt.hist(speaker_counts.values(), bins=20)
plt.ylabel("Speakers")
plt.xlabel("Examples")
plt.show()

The histogram reveals that approximately one-third of the speakers in the dataset have fewer than 100 examples, while around ten speakers have more than 500 examples. To improve training efficiency and balance the dataset, we can limit the data to speakers with between 100 and 400 examples.

You should be left with just under 10,000 examples from approximately 40 unique speakers, which should be sufficient.

Note that some speakers with few examples may actually have more audio available if the examples are long. However, determining the total amount of audio for each speaker requires scanning through the entire dataset, which is a time-consuming process that involves loading and decoding each audio file. As such, we have chosen to skip this step here.

In [None]:
def select_speaker(speaker_id):
    return 100 <= speaker_counts[speaker_id] <= 400

dataset = dataset.filter(select_speaker, input_columns=["speaker_id"])


In [None]:
# Check how many speakers are left
len(set(dataset["speaker_id"]))

In [None]:
# Check how many examples are left
len(dataset)

### Multiple speaker embeddings

To enable the TTS model to differentiate between multiple speakers, you’ll need to create a speaker embedding for each example. The speaker embedding is an additional input into the model that captures a particular speaker’s voice characteristics. 

To generate these speaker embeddings, use the pre-trained ```spkrec-xvect-voxceleb``` model from SpeechBrain. It’s important to note that the ```spkrec-xvect-voxceleb``` model was trained on English speech from the VoxCeleb dataset, whereas the training examples in this guide are in Dutch. While we believe that this model will still generate reasonable speaker embeddings for our Dutch dataset, this assumption may not hold true in all cases.

For optimal results, we recommend training an X-vector model on the target speech first. This will ensure that the model is better able to capture the unique voice characteristics present in the Dutch language.

In [None]:

spk_model_name = "speechbrain/spkrec-xvect-voxceleb"
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
speaker_model = EncoderClassifier.from_hparams(
    source=spk_model_name,
    run_opts={"device": device},
    savedir=os.path.join("/tmp", spk_model_name),
)


def create_speaker_embedding(waveform):
    """
    Function input: audio waveform. 
    Function output: 512-element vector containing the corresponding speaker embedding.
    """
    with torch.no_grad():
        speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

## Final preprocessing step

Finally, let’s process the data into the format the model expects. Create a ```prepare_dataset``` function that takes in a single example and uses the SpeechT5Processor object to tokenise the input text and load the target audio into a log-mel spectrogram. It should also add the speaker embeddings as an additional input.

In [None]:
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        text=example["normalized_text"],
        audio_target=audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_attention_mask=False,
    )

    # strip off the batch dimension
    example["labels"] = example["labels"][0]

    # use SpeechBrain to obtain x-vector
    example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

    return example

In [None]:
# Inspect an example
processed_example = prepare_dataset(dataset[0])
print("\n"+list(processed_example.keys()))

# Speaker embeddings should be a 512-element vector
print("\n"+processed_example["speaker_embeddings"].shape)

In [None]:
# The labels should be a log-mel spectrogram with 80 mel bins
plt.figure()
plt.imshow(processed_example["labels"].T)
plt.xlabel("# mel bins")
plt.show()

In [None]:
def is_not_too_long(input_ids):
    """
    Remove those examples from the dataset are longer than the maximum input length 
    the model can handle. Model allows up to 600 tokens. Here, we only allow up to 200 tokens.
    """
    input_length = len(input_ids)
    return input_length < 200


dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"])
len(dataset)

In [None]:
# If all looks good, apply the processing function to the entire dataset
# This will take between 5 and 10 minutes
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

In [None]:
# Create train-test split
dataset = dataset.train_test_split(test_size=0.1)

## Data collator

In order to combine multiple examples into a batch, you need to define a custom data collator. This collator will pad shorter sequences with padding tokens, ensuring that all examples have the same length. For the spectrogram labels, the padded portions are replaced with the special value -100. This special value instructs the model to ignore that part of the spectrogram when calculating the spectrogram loss.

Note that in SpeechT5, the input to the decoder part of the model is reduced by a factor 2. In other words, it throws away every other timestep from the target sequence. The decoder then predicts a sequence that is twice as long. Since the original target sequence length may be odd, the data collator makes sure to round the maximum length of the batch down to be a multiple of 2.


In [None]:

@dataclass
class TTSDataCollatorWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        label_features = [{"input_values": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        # collate the inputs and targets into a batch
        batch = processor.pad(input_ids=input_ids, labels=label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = batch["labels"].masked_fill(batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100)

        # not used during fine-tuning
        del batch["decoder_attention_mask"]

        # round down target lengths to multiple of reduction factor
        if model.config.reduction_factor > 1:
            target_lengths = torch.tensor([len(feature["input_values"]) for feature in label_features])
            target_lengths = target_lengths.new(
                [length - length % model.config.reduction_factor for length in target_lengths]
            )
            max_length = max(target_lengths)
            batch["labels"] = batch["labels"][:, :max_length]

        # also add in the speaker embeddings
        batch["speaker_embeddings"] = torch.tensor(speaker_features)

        return batch

In [None]:
data_collator = TTSDataCollatorWithPadding(processor=processor)

# Training

In [None]:
# Load pre-trained model
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)

In [None]:
# use_cache is incompatible with gradient checkpointing
# Disable for training
model.config.use_cache = False

In [None]:
# Define training args
# Note we're not defining eval metrics here; we'll only use the loss
training_args = Seq2SeqTrainingArguments(
    output_dir="speecht5_finetuned_voxpopuli_nl",  # change to a repo name of your choice
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    eval_strategy="steps",
    per_device_eval_batch_size=2,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    greater_is_better=False,
    label_names=["labels"],
    push_to_hub=False,
)

In [None]:
# Instantiate the Trainer object
# Pass training args, model, dataset, data collator to Trainer object
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    processing_class=processor,
)

In [None]:
# Call train on trainer object to fine-tune the model
trainer.train()

Depending on your GPU, it is possible that you will encounter a CUDA “out-of-memory” error when you start training. In this case, you can reduce the ```per_device_train_batch_size``` incrementally by factors of 2 and increase ```gradient_accumulation_steps``` by 2x to compensate.

In [None]:
# Uncomment below if you want to save the checkpoint and use with a pipeline
#processor.save_pretrained("DIR_OR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")

# Inference

## Inference using a pipeline

In [None]:
pipe = pipeline("text-to-speech", model="speecht5_finetuned_voxpopuli_nl")

In [None]:
# Assign a piece of text in Dutch that you want to translate
text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"

In [None]:
# SpeechT5 pipeline requires a speech embedding
# Get it from an example in the data set (arbitrarily chosen here)
example = dataset["test"][304]
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

In [None]:
# Pass the text and speaker embeddings to the pipeline
forward_params = {"speaker_embeddings": speaker_embeddings}
output = pipe(text, forward_params=forward_params)
output

In [None]:
# Listen to the resulting audio
Audio(output['audio'], rate=output['sampling_rate'])

## Manual inference 

In [None]:
model = SpeechT5ForTextToSpeech.from_pretrained("speecht5_finetuned_voxpopuli_nl")

In [None]:
# Get a speech embedding from a training example
example = dataset["test"][304]
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

In [None]:
# Define some input text and tokenize it
text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"
inputs = processor(text=text, return_tensors="pt")

In [None]:
# Sanity check: create and visualise spectrogram 
spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)

plt.figure()
plt.imshow(spectrogram.T)
plt.show()

In [None]:
# Use the vocoder to turn text into sound
with torch.no_grad():
    speech = vocoder(spectrogram)

from IPython.display import Audio

Audio(speech.numpy(), rate=16000)

# Final Note

From experience, obtaining satisfactory results from this model can be challenging. The quality of the speaker embeddings appears to be a significant factor. Since SpeechT5 was pre-trained with English x-vectors, it performs best when using English speaker embeddings. That said, the speech clearly is Dutch instead of English, and it does capture the voice characteristics of the speaker (compare to the original audio in the example). If the synthesized speech sounds poor, you can try the following:

1. Try using a different speaker embedding example.
2. Increasing the training duration is also likely to enhance the quality of the results.  
3. Experiment with the model’s configuration. For example, try using config.reduction_factor = 1 to see if this improves the results.

Finally, it is essential to consider ethical considerations. Although TTS technology has numerous useful applications, it may also be used for malicious purposes, such as impersonating someone’s voice without their knowledge or consent. Please use TTS judiciously and responsibly.