# SpeechT5 with Hugging Face

Also check out the blog post: [hf.co/blog/speecht5](http://hf.co/blog/speecht5)

And the online demos:

- [Speech Synthesis (TTS)](https://huggingface.co/spaces/Matthijs/speecht5-tts-demo)
- [Voice Conversion](https://huggingface.co/spaces/Matthijs/speecht5-vc-demo)
- [Automatic Speech Recognition](https://huggingface.co/spaces/Matthijs/speecht5-asr-demo)

First install Transformers and sentencepiece.

**Note:** It's important to restart the notebook after installing sentencepiece, or the demos won't work!

In [1]:
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-oc7bflab
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-oc7bflab
  Resolved https://github.com/huggingface/transformers.git to commit 6824461f2a35546a3d781fe60576e00f6db7bedf
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers==4.34.0.dev0)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers==4.34.0.dev0)
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [9

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [3]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


## Text-to-speech

Load the model:

In [4]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

Downloading (…)rocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

Downloading spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)lve/main/config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

Preprocess the text input:

In [5]:
inputs = processor(text="Don't count the days, make the days count.", return_tensors="pt")

Load a speaker embedding:

In [6]:
from datasets import load_dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

import torch
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

Downloading builder script:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.9M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [7]:
speaker_embeddings.shape

torch.Size([1, 512])

Load a vocoder:

In [28]:
from transformers import SpeechT5HifiGan
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Generate the speech from the input text:

In [9]:
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

In [10]:
speech.shape

torch.Size([35840])

In [11]:
from IPython.display import Audio

Audio(speech, rate=16000)

In [12]:
import soundfile as sf
sf.write("tts_example.wav", speech.numpy(), samplerate=16000)

## Speech-to-speech for voice conversion

Load the model:

In [13]:
from transformers import SpeechT5ForSpeechToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")

Downloading (…)rocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

Downloading spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)lve/main/config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/621M [00:00<?, ?B/s]

Some weights of SpeechT5ForSpeechToSpeech were not initialized from the model checkpoint at microsoft/speecht5_vc and are newly initialized: ['speecht5.encoder.prenet.pos_sinusoidal_embed.weights']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Load an input speech example:

In [14]:
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
example = dataset[40]

Downloading builder script:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [15]:
Audio(example["audio"]["array"], rate=16000)

Preprocess the speech input:

In [16]:
sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Load the speaker embedding for the target speaker's voice:

In [17]:
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

Generate the speech:

In [18]:
speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder)

In [19]:
Audio(speech, rate=16000)

In [20]:
import soundfile as sf
sf.write("speech_converted.wav", speech.numpy(), samplerate=16000)

## Automatic speech recognition (using pipeline)

In [21]:
from transformers import pipeline
generator = pipeline(task="automatic-speech-recognition", model="microsoft/speecht5_asr")

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/606M [00:00<?, ?B/s]

Some weights of SpeechT5ForSpeechToText were not initialized from the model checkpoint at microsoft/speecht5_asr and are newly initialized: ['speecht5.encoder.prenet.pos_sinusoidal_embed.weights']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

Downloading spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)rocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

In [22]:
transcription = generator(example["audio"]["array"])

spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.


In [23]:
transcription["text"]

'a man said to the universe sir i exist'

## Automatic speech recognition (using the model)

Load the model:

In [24]:
from transformers import SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of SpeechT5ForSpeechToText were not initialized from the model checkpoint at microsoft/speecht5_asr and are newly initialized: ['speecht5.encoder.prenet.pos_sinusoidal_embed.weights']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preprocess the input speech example:

In [25]:
sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Generate text from the speech input:

In [26]:
predicted_ids = model.generate(**inputs, max_length=100)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

In [27]:
transcription[0]

'a man said to the universe sir i exist'