Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as Bark, MMS, VITS and SpeechT5.

You can easily generate audio using the "text-to-audio" pipeline (or its alias - "text-to-speech"). Some models, like Bark, can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.

Here’s an example of how you would use the "text-to-speech" pipeline with Bark:

In [None]:
from transformers import pipeline
from IPython.display import Audio

pipe = pipeline("text-to-speech", model="suno/bark-small")
text = "[clears throat] This is a test ... and I just took a long pause."
output = pipe(text)
Audio(output["audio"], rate=output["sampling_rate"])


If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers are SpeechT5 and FastSpeech2Conformer, though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.

The remainder of this guide illustrates how to:
1. Fine-tune SpeechT5 that was originally trained on English speech on the Dutch (nl) language subset of the VoxPopuli dataset.
2. Use your refined model for inference in one of two ways: using a pipeline or directly.

# Libraries

In [None]:
pip install datasets soundfile speechbrain accelerate

# Install 🤗Transformers from source as not all the SpeechT5 features have been merged into an official release
pip install git+https://github.com/huggingface/transformers.git

In [None]:
# To follow this guide you will need a GPU

# If you’re working in a notebook, run the following line to check if NVIDIA GPU available
!nvidia-smi

# Or for AMD GPU
!rocm-smi

In [None]:
from transformers import SpeechT5Processor

# Data

VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. It contains labelled audio-transcription data for 15 European languages. In this guide, we are using the Dutch language subset, feel free to pick another subset.

Note that VoxPopuli or any other automated speech recognition (ASR) dataset may not be the most suitable option for training TTS models. The features that make it beneficial for ASR, such as excessive background noise, are typically undesirable in TTS. However, finding top-quality, multilingual, and multi-speaker TTS datasets can be quite challenging.

In [None]:
# Load data set
from datasets import load_dataset, Audio

# Check len == 20968 examples
dataset = load_dataset("facebook/voxpopuli", "nl", split="train")
len(dataset)

In [None]:
# SpeechT5 expects audio data to have a sampling rate of 16 kHz
# Make sure the examples in the dataset meet the requirement of 16kHz sampling rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# Preprocessing

## Text Proprocessing

In [None]:
# Load appropriate tokenizer and clean up text
checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)
tokenizer = processor.tokenizer

The dataset examples contain ```raw_text``` and ```normalized_text``` features. When deciding which feature to use as the text input, consider that the SpeechT5 tokenizer doesn’t have any tokens for numbers. In ```normalized_text``` the numbers are written out as text. Thus, it is a better fit, and it is recommended to use ```normalized_text``` as input text.

Because SpeechT5 was trained on the English language, it may not recognize certain characters in the Dutch dataset. If left as is, these characters will be converted to ```<unk>``` tokens. However, in Dutch, certain characters like à are used to stress syllables. In order to preserve the meaning of the text, we can replace this character with a regular a.

To identify unsupported tokens, extract all unique characters in the dataset using the SpeechT5Tokenizer which works with characters as tokens. To do this, write the ```extract_all_chars``` mapping function that concatenates the transcriptions from all examples into one string and converts it to a set of characters. Make sure to set ```batched=True``` and ```batch_size=-1``` in ```dataset.map()``` so that all transcriptions are available at once for the mapping function.

In [None]:
def extract_all_chars(batch):
    all_text = " ".join(batch["normalized_text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}


vocabs = dataset.map(
    extract_all_chars,
    batched=True,
    batch_size=-1,
    keep_in_memory=True,
    remove_columns=dataset.column_names,
)

dataset_vocab = set(vocabs["vocab"][0])
tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

In [None]:
# Identify unrecognised characters
print(dataset_vocab - tokenizer_vocab)

# Handle the unsupported characters identified in the previous step (manually in this case)
replacements = [
    ("à", "a"),
    ("ç", "c"),
    ("è", "e"),
    ("ë", "e"),
    ("í", "i"),
    ("ï", "i"),
    ("ö", "o"),
    ("ü", "u"),
]


def cleanup_text(inputs):
    for src, dst in replacements:
        inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
    return inputs


dataset = dataset.map(cleanup_text)