<a href="https://colab.research.google.com/github/psyrtsov/semse/blob/master/Infer_Whisper_%F0%9F%A4%97transformers_edition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A whirlwind tour of Whispering via 🤗transformers

by: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb)

There are multiple ways of infering a Whisper model depending on your specific use-case. We'll take a quick look at all such ways.

### Setup the environment

Let's begin by installing the packages we'll need to process audio datasets. We require the Unix package `ffmpeg` version 4. We'll also need `transfomers` and some other popular Hugging Face libraries like `datasets` and `huggingface_hub` for our ASR pipeline.

*Note*: Do make sure to select a GPU runtime if you haven't already!

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg
!pip install --quiet datasets git+https://github.com/huggingface/transformers evaluate huggingface_hub pytube

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [Connecting to security.ubuntu.com (185.125.190.36)] [Waiting for headers] [                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.36)] [0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connecting to security.ubun                                                                               Hit:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connecting to security.ubun                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
0% [1 InRelease gpgv 242 kB] [Connecting to security.ubuntu.com (185.125.190.36                                                                   

We'll test our ASR pipeline on Common Voice 11 (CV11) dataset. Since the CV11 dataset requires us to accept it's terms and conditions, we'd need to authenticate via huggingface_hub.

Make sure to accept the T&C before you run the next cell: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0

In [None]:
!git config --global credential.helper store
from huggingface_hub import login

login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


## Inference via `pipeline`

The `pipeline` class within transformers provides a neat abstraction over a data preprocessor, decoder and post processor. It comes with all the bells and whistles included. It also comes with added functionalities like long form transcription, which can help you go the extra mile with Whisper models.

Best part, we can instantiate the entire pipeline with just one line of code.

In [None]:
from transformers import pipeline

whisper_asr = pipeline(
    "automatic-speech-recognition", model="openai/whisper-medium"
)

To test our pipeline, let's stream a record from the common voice 11 dataset and perform zero-shot inference from our pipeline.

We'll load the dataset in streaming mode to make sure we don't have to wait for the entire dataset to download on our local hard disk and we can get infering at lightning fast speed! ⚡️

In [None]:
from datasets import load_dataset

common_voice_es = load_dataset("mozilla-foundation/common_voice_11_0", "es", revision="streaming", split="test", streaming=True, use_auth_token=True)

The CV11 dataset is sampled at 48KHz, while the Whisper model expects the inputs to be sampled at 16KHz. To fix, that we'll cast the audio into 16KHz sampling rate.

Note: This operation takes place on-the-fly when we stream a record. This helps prototype faster!

In [None]:
from datasets import Audio

common_voice_es = common_voice_es.cast_column("audio", Audio(sampling_rate=16000))

Great! We have the dataset ready to stream the records. Let's checkout the first sample:

In [None]:
print(next(iter(common_voice_es)))

Reading metadata...: 15520it [00:00, 71629.10it/s]


{'client_id': '0003b969350f5308dc7347c574bc291834f38fdd92a2863b6059349b4be9738133855f3c49c0a6f7b7b01aff4be4573a1b066fbc7e1c26b1b9383c9d787acc5d', 'path': 'common_voice_es_19698530.mp3', 'audio': {'path': 'common_voice_es_19698530.mp3', 'array': array([ 0.0000000e+00, -8.5687837e-13,  6.4558742e-13, ...,
        2.2153064e-04,  1.2165759e-04, -7.3081144e-05], dtype=float32), 'sampling_rate': 48000}, 'sentence': 'Habita en aguas poco profundas y rocosas.', 'up_votes': 2, 'down_votes': 1, 'age': 'thirties', 'gender': 'male', 'accent': 'México', 'locale': 'es', 'segment': ''}


Brilliant! Now we can take a listen of what the audio sounds like and print the text:

In [None]:
import IPython.display as ipd

sample = next(iter(common_voice_es))
audio = sample["audio"]

print(sample["sentence"])
ipd.Audio(data=audio["array"], autoplay=True, rate=audio["sampling_rate"])

Reading metadata...: 15520it [00:00, 63824.22it/s]


Habita en aguas poco profundas y rocosas.


On to the fun part, let's transcribe this audio via our `whisper_asr` pipeline.

To make sure we use the correct decoder_ids we'll force the decoder to "force" on `es` specfic ids whilst performing the transcription task.

In [None]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

whisper_asr.model.config.forced_decoder_ids = (
    whisper_asr.tokenizer.get_decoder_prompt_ids(
        language="es", task="transcribe"
    )
)

and.. perfect!! We get a more or less similar transcription as the reference transcription. Wohoo!!

In [None]:
whisper_asr(next(iter(common_voice_es))["audio"]["array"])["text"]

Reading metadata...: 15520it [00:00, 35237.19it/s]


' Habitan aguas poco profundas y rocosas.'

The ASR `pipeline` comes with certain frills attached. One of the prominent and more widely used use cases is for long range transcriptions. Whisper model by default only supports 30 second inference.

With the `pipeline` object we can auto-magically chunk long audio files and generate reasonably accurate transcriptions.

To make the pipeline perform long range transcription, we'll need to reload it with an additional chunking parameter: `chunk_length_s`

This will allow us to chunk the audio, produce it's transcription and then match all the chunked transcriptions together to produce one unified transcription.

In [None]:
whisper_asr = pipeline(
    "automatic-speech-recognition", 
    model="openai/whisper-medium", 
    chunk_length_s=30
)

To test this, let's try and transcribe a Spanish YouTube video.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('https://www.youtube.com/watch?v=mlBZeNKCbSI')

We'll use `pytube` to download the video and fetch it's corresponding audio file.

In [None]:
import pytube as pt

yt = pt.YouTube("https://www.youtube.com/watch?v=mlBZeNKCbSI")
stream = yt.streams.filter(only_audio=True)[0]
stream.download(filename="audio.mp3")

'/content/audio.mp3'

Brilliant! Now we'll pass along this audio file to our `whisper_asr` pipeline to extract the transcriptions!

In [None]:
whisper_asr("audio.mp3")["text"]



' Hola, hola, hola, ¿cómo estás? Hola, hola, hola, ¿cómo estás? Estoy bien, estoy estupendo Estoy maravilloso Estoy bien, estoy estupendo maravillosocómo estás?'

Worked like a charm! You can now use the `pipeline` for longer transcriptions :)

## Processor + Model

Often times, it is desirable to have a more fine-grained control over the generation. For cases like those, it is better to us the processor plus the Whisper model provided in `transformers`.

Let's load up the `WhisperForConditionalGeneration` and `Processor` method. The Processor helps us prepare the input speech into log-mel spctrograms.

WhisperForConditionalGeneration takes in the input from the Processor and performs a forward pass on the Whisper model.

In [None]:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

torch.cuda.empty_cache()

device = "cuda" if torch.cuda.is_available() else "cpu"

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium").to(device)
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")

After loading the Model and the Processor we now pass our input data from datasets to them both. As with the pipeline we'd ensure that we force the model to focus on `es` task and `transcribe` instead of `translate`.

In [None]:
inputs = processor.feature_extractor(next(iter(common_voice_es))["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features.to("cuda")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="transcribe")

Reading metadata...: 15520it [00:00, 96336.73it/s]


Now to the easy part, we ask the model to generate with the inputs returned to us via the processor.

In [None]:
predicted_ids = model.generate(inputs, max_length=448, forced_decoder_ids=forced_decoder_ids)
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)[0]

' Habitan aguas poco profundas y rocosas.'

Yayy! We can the output from the processor + model resembles the pipeline output. Great!