# Inference code for Whisper (example with Whisper Medium in Portuguese)

- Autor: [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou)
- Date: 09/12/2022
- Credit: this notebook copy most of the code and text of the notebook [Whisper Large inference in 8-bit mode](https://colab.research.google.com/drive/1EMOwwfm1V1fHxH7eT1LLg7yBjhTooB6j?usp=sharing) from [Vaibhav Srivastav](https://www.linkedin.com/in/vaibhavs10/)
- Blog post: [Speech-to-Text & IA | Transcreva qualquer áudio para o português com o Whisper (OpenAI)... sem nenhum custo!](https://medium.com/@pierre_guillou/speech-to-text-ia-transcreva-qualquer-%C3%A1udio-para-o-portugu%C3%AAs-com-o-whisper-openai-sem-ad0c17384681)

In [None]:
# check GPU or CPU
!nvidia-smi

## Language & Whisper model

In [None]:
# whisper model
lang = "pt"
model_name = "pierreguillou/whisper-medium-portuguese"

We'll first install the necessary packages. We need ffmpeg to decode mp3 files from the CV11 dataset and transformers.

In [None]:
%%capture
!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg
!pip install --quiet datasets git+https://github.com/huggingface/transformers evaluate huggingface_hub jiwer 

Since we will be running inference on CV11 dataset, we'd need to authenticate ourselves (since, CV11 requires accepting its Terms and Conditions).

In [None]:
!git config --global credential.helper store
from huggingface_hub import login

login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


To reduce the memory and time overhead, we'll load the dataset in streaming fashion. During the time of inference we'll stream one data point at a time. This is specially useful for larger datasets.

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "mozilla-foundation/common_voice_11_0", lang, revision="streaming", split="test", streaming=True, use_auth_token=True
)

Downloading builder script:   0%|          | 0.00/8.30k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Loading the model and processor.

In [6]:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

# load model and processor
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

Downloading:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/830 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

Preprocess the dataset to be sampled at 16KHz, since Whisper expects 16KHz input.

In [7]:
from datasets import Audio

num_audios = 10
dataset = dataset.take(num_audios)

# resample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Voila! Time to run inference loop!

In [8]:
from time import perf_counter
total = 0

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

for data in dataset:
  
    inputs = processor.feature_extractor(data["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features.to(device)
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")

    start = perf_counter()
    predicted_ids = model.generate(inputs, forced_decoder_ids=forced_decoder_ids)
    text = processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)[0]
    end = perf_counter()

    diff = end - start
    total += diff
    print(text)

print(f"\ntotal prediction time for 10 audios: {round(total,2)}s")
print(f"average prediction time: {round(total/num_audios,2)}s")

Reading metadata...: 8693it [00:01, 7015.14it/s]


Cheque match
É necessário fornecer, quando formulado, uma vazação.
Buteá
Se esta primeira condição for satisfeita, sensata, forte e ágil.
Most of us tempos are pistolas.
For more information, visit the site of the Fedora Project.
Bem, digitalizar Don Quixote é um passo para levar a cultura a todos.
Arivém
Ele é advogado do comando vermelho.
Curuçá

total prediction time for 10 audios: 18.28s
average prediction time: 1.83s


# END