# MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™

The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages.
The MMS model was proposed in [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516).  The models and code are originally released [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
There are the different models open sourced in the MMS project: Automatic Speech Recognition (ASR), Language Identification (LID) and Speech Synthesis (TTS).  
In this example we are considering ASR and LID. We will LID model to identify language, and then we will use ASR model corresponding to the language to recognize it. A simple diagram of this is below.
![LID and ASR flow](https://github.com/openvinotoolkit/openvino_notebooks/assets/76171391/0e7fadd6-29a8-4fac-bd9c-41d66adcb045)


<a id="0"></a>
### Table of contents:
- [Prerequisites](#1)
- [Prepare an example audio](#2)
- [Language Identification (LID)](#3)
  - [Download pretrained model and processor](#4)
  - [Use the original model to run an inference](#5)
  - [Convert to OpenVINO IR model and run an inference](#6)
- [Automatic Speech Recognition (ASR)](#7)
  - [Download pretrained model and processor](#8)
  - [Use the original model to run an inference](#9)
  - [Convert to OpenVINO IR model and run an inference](#10)
- [Interactive demo with Gradio](#11)


<a name='1'></a>
## Prerequisites

In [None]:
%pip install -q --upgrade pip
%pip install -q "transformers>=4.33.1" "openvino>=2023.1.0" "numpy>=1.21.0,<=1.24" datasets accelerate torch soundfile librosa gradio

In [None]:
from pathlib import Path

import torch

import openvino as ov

<a name='2'></a>
## Prepare an example audio
Read an audio file and process the audio data. Make sure that the audio data is sampled to 16000 kHz.
For this example we will use [a streamable version of the Multilingual LibriSpeech (MLS) dataset](https://huggingface.co/datasets/multilingual_librispeech). It supports contains example on 7 languages: `'german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'`.
Choose one of them.

In [None]:
import ipywidgets as widgets


SAMPLE_LANG = widgets.Dropdown(
    options=['german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'],
    value='german',
    description='Dataset language:',
    disabled=False,
)

SAMPLE_LANG

Specify `streaming=True` to not download the entire dataset.

In [None]:
from datasets import load_dataset


mls = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True)
mls = iter(mls)  # make it iterable

example = next(mls)  # get one example

Example has a dictionary structure. It contains an audio data and a text transcription.

In [None]:
print(example)  # look at structure

In [None]:
import IPython.display as ipd

print(example['text'])
ipd.Audio(example['audio']['array'], rate=16_000)

<a name='3'></a>
## Language Identification (LID) 

<a name='4'></a>
### Download pretrained model and processor
Different LID models are available based on the number of languages they can recognize - 126, 256, 512, 1024, 2048, 4017. We will use 126.

In [None]:
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor

model_id = "facebook/mms-lid-126"

lid_processor = AutoFeatureExtractor.from_pretrained(model_id)
lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

<a name='5'></a>
### Use the original model to run an inference

In [None]:
inputs = lid_processor(example['audio']['array'], sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = lid_model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = lid_model.config.id2label[lang_id]
print(detected_lang)

<a name='6'></a>
### Convert to OpenVINO IR model and run an inference

In [None]:
MAX_SEQ_LENGTH = 30480

input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
attention_mask = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.int32)
lid_model_xml_path = Path('models/ov_lid_model.xml')

if not lid_model_xml_path.exists():
    lid_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
    converted_model = ov.convert_model(lid_model, example_input={'input_values': input_values})
    ov.save_model(converted_model, lid_model_xml_path)

And compile. Select device from dropdown list for running inference using OpenVINO

In [None]:
core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

In [None]:
compiled_lid_model = core.compile_model(lid_model_xml_path, device_name=device.value)

Now it is possible to run an inference. 

In [None]:
def detect_lang(compiled_model, audio_data):
    inputs = lid_processor(audio_data, sampling_rate=16_000, return_tensors="pt")
    
    outputs = compiled_model(inputs['input_values'])[0]
    
    lang_id = torch.argmax(torch.from_numpy(outputs), dim=-1)[0].item()
    detected_lang = lid_model.config.id2label[lang_id]
    
    return detected_lang

In [None]:
detect_lang(compiled_lid_model, example['audio']['array'])

Let's check another language.

In [None]:
SAMPLE_LANG = widgets.Dropdown(
    options=['german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'],
    value='french',
    description='Dataset language:',
    disabled=False,
)

SAMPLE_LANG

In [None]:
mls = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True)
mls = iter(mls)

example = next(mls)
print(example['text'])
ipd.Audio(example['audio']['array'], rate=16_000)

In [None]:
detect_language_id = detect_lang(compiled_lid_model, example['audio']['array'])
print(detect_language_id)

<a name='7'></a>
## Automatic Speech Recognition (ASR)

<a name='8'></a>
### Download pretrained model and processor
Download pretrained model and processor. By default, MMS loads adapter weights for English. If you want to load adapter weights of another language make sure to specify `target_lang=<your-chosen-target-lang>` as well as `ignore_mismatched_sizes=True`. The `ignore_mismatched_sizes=True` keyword has to be passed to allow the language model head to be resized according to the vocabulary of the specified language. Similarly, the processor should be loaded with the same target language. 
It is also possible to change the supported language later.

In [None]:
from transformers import Wav2Vec2ForCTC, AutoProcessor
model_id = "facebook/mms-1b-all"

asr_processor = AutoProcessor.from_pretrained(model_id)
asr_model = Wav2Vec2ForCTC.from_pretrained(model_id)

You can look at all supported languages:

In [None]:
asr_processor.tokenizer.vocab.keys()

Switch out the language adapters by calling the `load_adapter()` function for the model and `set_target_lang()` for the tokenizer. Pass the target language as an input - `"detect_language_id"` which was detected in the previous step.

In [None]:
asr_processor.tokenizer.set_target_lang(detect_language_id)
asr_model.load_adapter(detect_language_id)

<a name='9'></a>
### Use the original model to run the inference

In [None]:
inputs = asr_processor(example['audio']['array'], sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = asr_model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = asr_processor.decode(ids)
print(transcription)

<a name='10'></a>
### Convert to OpenVINO IR model and run an inference
Convert to OpenVINO IR model format with `ov.convert_model` function directly. Use `ov.save_model` function to serialize the result of conversion. For convenience of further use, we will create a function for these purposes.

In [None]:
MAX_SEQ_LENGTH = 30480


def get_asr_model(detected_language_id):
    input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
    asr_model_xml_path = Path(f'models/ov_asr_{detected_language_id}_model.xml')
    
    if not asr_model_xml_path.exists():
        asr_processor.tokenizer.set_target_lang(detected_language_id)
        asr_model.load_adapter(detected_language_id)
        
        asr_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
        converted_model = ov.convert_model(asr_model, example_input={'input_values': input_values})
        ov.save_model(converted_model, asr_model_xml_path)
    
    compiled_asr_model = core.compile_model(asr_model_xml_path, device_name=device.value)

    return compiled_asr_model


compiled_asr_model = get_asr_model(detect_language_id)

Run an inference.

In [None]:
def recognize_audio(compiled_model, src_audio):
    inputs = asr_processor(src_audio, sampling_rate=16_000, return_tensors="pt")
    outputs = compiled_model(inputs['input_values'])[0]
    
    ids = torch.argmax(torch.from_numpy(outputs), dim=-1)[0]
    transcription = asr_processor.decode(ids)

    return transcription


transcription = recognize_audio(compiled_asr_model, example['audio']['array'])
print(transcription)

<a name='11'></a>
## Interactive demo with Gradio
In this demo you can try your own examples. Make sure that the audio data is sampled to 16000 kHz.

In [None]:
import gradio as gr
import librosa


src_audio = gr.inputs.Audio(label="Source Audio", type='filepath')
outputs = [
    gr.outputs.Textbox(label="Detected language ID"),
    gr.outputs.Textbox(label="Transcription"),
]
title = 'MMS with Gradio'
description = 'Gradio Demo for MMS and OpenVINO™. Upload a source audio, then click the "Submit" button to detect a language ID and a transcription. Make sure that the audio data is sampled to 16000 kHz. If this language has not been used before, it may take some time to prepare the ASR model.'


def infer(src_audio_path):
    src_audio, _ = librosa.load(src_audio_path)
    detected_lang_id = detect_lang(compiled_lid_model, src_audio)

    yield detected_lang_id, None
  
    compiled_asr_model = get_asr_model(detected_lang_id)
    transcription = recognize_audio(compiled_asr_model, src_audio)

    yield detected_lang_id, transcription


demo = gr.Interface(infer, src_audio, outputs, title=title, description=description)

try:
    demo.queue().launch(debug=True)
except Exception:
    demo.queue().launch(share=True, debug=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/