# **EASY IMPLEMENTATION OF SPEECHT5 WITH HUGGING FACE**

---

📎 README

This notebook provides a straightforward implementation of text-to-speech using the SpeechT5 model, utilizing the Transformers  libraries from Hugging Face.

About:

*   Speech-to-text for automatic speech recognition or speaker identification
*   Text-to-speech to synthesize audio
*   Speech-to-speech for converting between different voices or performing speech enhancement.

References:

1.  [Huggig Face, SeamlessM4T Large](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t).
2. [Kaggle SeamlessM4T Usage in Transformers](https://www.kaggle.com/code/yoachlcmb/seamlessm4t-usage-in-transformers).


In [None]:
# @title #1. ✨ Installing dependences.
!pip install --quiet git+https://github.com/huggingface/transformers.git &> /dev/null
#!pip install --quiet git+https://github.com/google/sentencepiece &> /dev/null
!pip install sentencepiece &> /dev/null
!pip install datasets &> /dev/null

import transformers
import sentencepiece
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"




In [None]:
# @title #2.1 ✨ Text to Speech.

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

inputs = processor(text="No cuentes los días, haz que los días cuenten.", return_tensors="pt")

from datasets import load_dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

import torch
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

from transformers import SpeechT5HifiGan
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

import soundfile as sf
sf.write("tts_example.wav", speech.numpy(), samplerate=16000)

from IPython.display import Audio

Audio(speech, rate=16000)


In [None]:
# @title #2.2 ✨ Speech to Speech.
from transformers import SpeechT5Processor, SpeechT5ForSpeechToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")

from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
example = dataset[20]

sampling_rate = dataset.features["audio"].sampling_rate

Audio(example["audio"]["array"], rate=sampling_rate)


Some weights of the model checkpoint at microsoft/speecht5_vc were not used when initializing SpeechT5ForSpeechToSpeech: ['speecht5.encoder.prenet.pos_conv_embed.conv.weight_v', 'speecht5.encoder.prenet.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing SpeechT5ForSpeechToSpeech from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing SpeechT5ForSpeechToSpeech from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of SpeechT5ForSpeechToSpeech were not initialized from the model checkpoint at microsoft/speecht5_vc and are newly initialized: ['speecht5.encoder.prenet.pos_sinusoidal_embed.weights', 'speecht5.encoder.prenet.pos_conv_embed.conv.parametrizations.weight.original1', 's

In [None]:
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

import torch
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder)

import soundfile as sf
sf.write("speech_converted.wav", speech.numpy(), samplerate=sampling_rate)

Audio(speech, rate=sampling_rate)


In [None]:
# @title #2.3 ✨ Speech to Text.

from transformers import SpeechT5Processor, SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
#example = dataset[40]

sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

predicted_ids = model.generate(**inputs, max_length=100)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


#transcription = generator(example["audio"]["array"])
