# Creating a speech-to-speech translation app with cascaded approach

This notebook demonstrates how you can build a speech-to-speech translation (STST) application using open-source models from the 🤗 Hub.

We illustrate STST with cascaded approach, meaning we'll use a speech translation (ST) system to transcribe the source speech into text in the target language, then text-to-speech (TTS) to generate speech in the target language from the translated text.

Depending on your use case, take note of the following considerations:
- Cascaded STST is a data and compute-efficient way of developing a STST application, since existing speech recognition and text-to-speech systems can be quickly coupled without any need for additional training.
- However, having more than one model in the pipeline lends itself to error propagation, where the errors introduced by one model are compounded as they flow through the remaining model(s). This also increases latency, since inference has to be conducted for more than one model.

In this notebook, we'll create an application for translating from **any language** * to English.

`*` By any language we mean one of the 96 languages supported by the Whisper model.


Start by installing the required dependencies:

In [1]:
!pip install -q transformers datasets gradio sentencepiece

We'll use the Whisper model for the speech translation part of the application since it's capable of translating from over 96 languages to English. Specifically, we'll load the `whisper-base` checkpoint, which clocks in at 74M parameters. It's by no means the most performant Whisper model, with the largest Whisper checkpoint being over 20x larger, but since we're concatenating two auto-regressive systems together (ST + TTS), we want to ensure each model can generate relatively quickly so that we get reasonable inference speed.

## Speech-to-text pipeline with Whisper


The "automatic-speech-recognition" pipeline can cover both, converting the speech recording into text, and translation. You could do this with two separate models, but that would increase latency and increase the chances of error propagation.

In [2]:
import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
asr_pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-base", device=device
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Let's create the translation function using the pipeline we defined earlier. By setting the generation key-word argument for the "task" to "translate", we ensure that Whisper performs speech translation, not speech recognition.

In [3]:
def translate(audio):
    outputs = asr_pipe(audio, max_new_tokens=256, generate_kwargs={"task": "translate"})
    return outputs["text"]

**Note**: In this example, we create an app that translates from any language to English, however, you can make Whisper translate the speech in any language X into text in any language Y.

To do so, set the task to "transcribe" and the "language" to your target language in the generation key-word arguments, e.g. for Spanish, one would set:
generate_kwargs={"task": "transcribe", "language": "es"}

If you choose to translate from any language to any language, you'll need to pick a different model for the text-to-speech part of this guide.

## Text-to-speech with MMS

Now that you have the first half of the cascaded STST app, let's add the second half that involves mapping from English text to English speech.
For this, we'll use the pre-trained MMS checkpoint for English TTS.

You can achieve higher quality audio results as well as multi-lingual speech generation with other models, e.g. Bark. Here we use `facebook/mms-tts-eng` as it is small and fast. It is fine-tuned for English language text-to-speech and generates acceptable results.

In [4]:
from transformers import VitsModel, VitsTokenizer

model = VitsModel.from_pretrained("facebook/mms-tts-eng")
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")

Some weights of the model checkpoint at facebook/mms-tts-eng were not used when initializing VitsModel: ['flow.flows.0.wavenet.res_skip_layers.1.weight_g', 'posterior_encoder.wavenet.res_skip_layers.9.weight_g', 'posterior_encoder.wavenet.res_skip_layers.4.weight_g', 'posterior_encoder.wavenet.res_skip_layers.7.weight_v', 'posterior_encoder.wavenet.res_skip_layers.4.weight_v', 'flow.flows.2.wavenet.in_layers.0.weight_v', 'posterior_encoder.wavenet.res_skip_layers.1.weight_v', 'posterior_encoder.wavenet.res_skip_layers.12.weight_g', 'flow.flows.2.wavenet.in_layers.0.weight_g', 'posterior_encoder.wavenet.res_skip_layers.15.weight_v', 'flow.flows.3.wavenet.in_layers.3.weight_g', 'flow.flows.0.wavenet.in_layers.3.weight_v', 'flow.flows.3.wavenet.in_layers.3.weight_v', 'flow.flows.0.wavenet.in_layers.3.weight_g', 'posterior_encoder.wavenet.in_layers.10.weight_g', 'flow.flows.3.wavenet.res_skip_layers.3.weight_v', 'posterior_encoder.wavenet.in_layers.15.weight_v', 'posterior_encoder.wavenet.

Let's create the speech synthesising function:

In [5]:
def synthesise(text):
    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs["input_ids"]

    with torch.no_grad():
      outputs = model(input_ids)

    return 16000, outputs["waveform"]

## Bring STST parts together

Once you have both parts of the cascaded STST, you can bring them together in a single function:

In [6]:
def speech_to_speech_translation(audio):
    translated_text = translate(audio)
    sampling_rate, synthesised_speech = synthesise(translated_text)
    return sampling_rate, synthesised_speech

To test the results, let's take an audio recording from the German validation split of the `facebook/voxpopuli` dataset. Feel free to try other languages too!

In [7]:
from datasets import load_dataset

dataset = load_dataset("facebook/voxpopuli", "de", split="validation", streaming=True)
sample = next(iter(dataset))

Listen to the audio itself:

In [8]:
from IPython.display import Audio

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

Check out the English translation:

In [9]:
text = translate(sample["audio"]["array"])
text

' The second month is a new president in the field.'

Let's hear the result of the speech to speech translation:

In [10]:
sampling_rate, synthesised_speech = speech_to_speech_translation(sample["audio"])

Audio(synthesised_speech, rate=sampling_rate)

Not bad!

## Demo app

To illustrate the STST app in action, let's build a demo with Gradio:

In [None]:
import gradio as gr

demo = gr.Blocks()

mic_translate = gr.Interface(
    fn=speech_to_speech_translation,
    inputs=gr.Audio(sources="microphone", type="filepath"),
    outputs=gr.Audio(label="Generated Speech", type="numpy"),
)

file_translate = gr.Interface(
    fn=speech_to_speech_translation,
    inputs=gr.Audio(sources="upload", type="filepath"),
    outputs=gr.Audio(label="Generated Speech", type="numpy"),
)

with demo:
    gr.TabbedInterface([mic_translate, file_translate], ["Microphone", "Audio File"])

demo.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://39c756cc2869afc562.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
