# A whirlwind tour of Whispering via 🤗transformers

by: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb)

There are multiple ways of infering a Whisper model depending on your specific use-case. We'll take a quick look at all such ways.

### Setup the environment

Let's begin by installing the packages we'll need to process audio datasets. We require the Unix package `ffmpeg` version 4. We'll also need `transfomers` and some other popular Hugging Face libraries like `datasets` and `huggingface_hub` for our ASR pipeline.

*Note*: Do make sure to select a GPU runtime if you haven't already!

In [1]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg
!pip install --quiet datasets git+https://github.com/huggingface/transformers evaluate huggingface_hub pytube

Repository: 'deb https://ppa.launchpadcontent.net/jonathonf/ffmpeg-4/ubuntu/ jammy main'
Description:
Backport of FFmpeg 4 and associated libraries. Now includes AOM/AV1 support!

FDK AAC is not compatible with GPL and FFmpeg can't be redistributed with it included. Please don't ask for it to be added to this public PPA.

---

PPA supporters:

BigBlueButton (https://bigbluebutton.org)

---

Donate to FFMPEG: https://ffmpeg.org/donations.html
Donate to Debian: https://www.debian.org/donations
Donate to this PPA: https://ko-fi.com/jonathonf
More info: https://launchpad.net/~jonathonf/+archive/ubuntu/ffmpeg-4
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding key to /etc/apt/trusted.gpg.d/jonathonf-ubuntu-ffmpeg-4.gpg with fingerprint 4AB0F789CBA31744CC7DA76A8CF63AD3F06FC659
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [1

We'll test our ASR pipeline on Common Voice 11 (CV11) dataset. Since the CV11 dataset requires us to accept it's terms and conditions, we'd need to authenticate via huggingface_hub.

Make sure to accept the T&C before you run the next cell: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0

In [2]:
!git config --global credential.helper store
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Inference via `pipeline`

The `pipeline` class within transformers provides a neat abstraction over a data preprocessor, decoder and post processor. It comes with all the bells and whistles included. It also comes with added functionalities like long form transcription, which can help you go the extra mile with Whisper models.

Best part, we can instantiate the entire pipeline with just one line of code.

In [3]:
from transformers import pipeline

whisper_asr = pipeline(
    "automatic-speech-recognition", model="openai/whisper-medium"
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

To test our pipeline, let's stream a record from the common voice 11 dataset and perform zero-shot inference from our pipeline.

We'll load the dataset in streaming mode to make sure we don't have to wait for the entire dataset to download on our local hard disk and we can get infering at lightning fast speed! ⚡️

In [7]:
from datasets import load_dataset

common_voice_es = load_dataset("mozilla-foundation/common_voice_11_0", "de", revision="streaming", split="test", streaming=True, use_auth_token=True)

The CV11 dataset is sampled at 48KHz, while the Whisper model expects the inputs to be sampled at 16KHz. To fix, that we'll cast the audio into 16KHz sampling rate.

Note: This operation takes place on-the-fly when we stream a record. This helps prototype faster!

In [8]:
from datasets import Audio

common_voice_es = common_voice_es.cast_column("audio", Audio(sampling_rate=16000))

Great! We have the dataset ready to stream the records. Let's checkout the first sample:

In [9]:
print(next(iter(common_voice_es)))

Reading metadata...: 16082it [00:00, 30313.75it/s]


{'client_id': '0052c07533a6976233ad5926d950b523002c4d8cdd9ae8726dbfec385951bd22aa707a742c49afe20c7d6cb9515dbaddac5b4d6fe8ebddcfbec46a2d3180a3a1', 'path': 'common_voice_de_17922420.mp3', 'audio': {'path': 'common_voice_de_17922420.mp3', 'array': array([-3.55271368e-14,  4.79616347e-14, -2.13162821e-14, ...,
        1.40587009e-09, -3.40389184e-09,  1.92177829e-10]), 'sampling_rate': 16000}, 'sentence': 'Zieht euch bitte draußen die Schuhe aus.', 'up_votes': 2, 'down_votes': 0, 'age': '', 'gender': '', 'accent': '', 'locale': 'de', 'segment': ''}


Brilliant! Now we can take a listen of what the audio sounds like and print the text:

In [10]:
import IPython.display as ipd

sample = next(iter(common_voice_es))
audio = sample["audio"]

print(sample["sentence"])
ipd.Audio(data=audio["array"], autoplay=True, rate=audio["sampling_rate"])

Reading metadata...: 16082it [00:00, 48586.54it/s]


Zieht euch bitte draußen die Schuhe aus.


On to the fun part, let's transcribe this audio via our `whisper_asr` pipeline.

To make sure we use the correct decoder_ids we'll force the decoder to "force" on `es` specfic ids whilst performing the transcription task.

In [12]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

whisper_asr.model.config.forced_decoder_ids = (
    whisper_asr.tokenizer.get_decoder_prompt_ids(
        language="de", task="transcribe"
    )
)

and.. perfect!! We get a more or less similar transcription as the reference transcription. Wohoo!!

In [13]:
whisper_asr(next(iter(common_voice_es))["audio"]["array"])["text"]

Reading metadata...: 16082it [00:00, 41955.25it/s]


' zieht euch bitte draußen die Schuhe aus'

The ASR `pipeline` comes with certain frills attached. One of the prominent and more widely used use cases is for long range transcriptions. Whisper model by default only supports 30 second inference.

With the `pipeline` object we can auto-magically chunk long audio files and generate reasonably accurate transcriptions.

To make the pipeline perform long range transcription, we'll need to reload it with an additional chunking parameter: `chunk_length_s`

This will allow us to chunk the audio, produce it's transcription and then match all the chunked transcriptions together to produce one unified transcription.

In [14]:
whisper_asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-medium",
    chunk_length_s=30
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


To test this, let's try and transcribe a Spanish YouTube video.

In [16]:
from IPython.display import YouTubeVideo
YouTubeVideo('https://www.youtube.com/watch?v=mlBZeNKCbSI')

We'll use `pytube` to download the video and fetch it's corresponding audio file.

In [17]:
import pytube as pt

yt = pt.YouTube("https://www.youtube.com/watch?v=mlBZeNKCbSI")
stream = yt.streams.filter(only_audio=True)[0]
stream.download(filename="audio.mp3")

'/content/audio.mp3'

Brilliant! Now we'll pass along this audio file to our `whisper_asr` pipeline to extract the transcriptions!

In [18]:
whisper_asr("audio.mp3")["text"]

' Hola, hola, hola, ¿cómo estás? Hola, hola, hola, ¿cómo estás? Estoy bien, estoy estupendo Estoy maravilloso Estoy bien estoy estupendo estoy maravilloso Hola, hola, hola, ¿cómo estás? Estoy cansado, estoy hambriento, no estoy muy bien. Hola, hola, hola, ¿cómo estás?'

Worked like a charm! You can now use the `pipeline` for longer transcriptions :)

## Processor + Model

Often times, it is desirable to have a more fine-grained control over the generation. For cases like those, it is better to us the processor plus the Whisper model provided in `transformers`.

Let's load up the `WhisperForConditionalGeneration` and `Processor` method. The Processor helps us prepare the input speech into log-mel spctrograms.

WhisperForConditionalGeneration takes in the input from the Processor and performs a forward pass on the Whisper model.

In [19]:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

torch.cuda.empty_cache()

device = "cuda" if torch.cuda.is_available() else "cpu"

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium").to(device)
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


After loading the Model and the Processor we now pass our input data from datasets to them both. As with the pipeline we'd ensure that we force the model to focus on `es` task and `transcribe` instead of `translate`.

In [20]:
inputs = processor.feature_extractor(next(iter(common_voice_es))["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features.to("cuda")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")

Reading metadata...: 16082it [00:00, 42078.86it/s]


Now to the easy part, we ask the model to generate with the inputs returned to us via the processor.

In [21]:
predicted_ids = model.generate(inputs, max_length=448, forced_decoder_ids=forced_decoder_ids)
processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)[0]

' zieht euch bitte draußen die Schuhe aus'

Yayy! We can the output from the processor + model resembles the pipeline output. Great!