# Whisper Tutorial 

[Whisper](https://openai.com/research/whisper) is a speech recognition model, first released by OpenAI in September 2022. It uses a transformer-based, encoder-decoder architecture (paper [here](https://cdn.openai.com/papers/whisper.pdf)), and was trained using 680,000 hours of audio from a range of sources. Along with transcription, it can translate from over 50 languages to English - including Māori.

It is also available in a variety of model 'sizes', from `tiny` (39 million parameters) to `large` (1.5 billion). This allows the user to trade inference speed for accuracy, depending on available hardware and desired application. GPUs are required to run the `large` model efficiently.

Despite the name, OpenAI has been reticent to release model weights for most of their other models (GPT-$x$, DALL-E etc.), preferring to serve them from behind their APIs; Whisper is an exception.

## Transcription using local model

Using Whisper is surprisingly easy. This tutorial will show you a couple of ways to do so.

First, we must install the necessary Python libraries. If you are running this notebook inside the Docker image provided, they will already be installed. If not, I have provided a `requirements.txt` file which will help you out. You will also need to install [FFmpeg](https://ffmpeg.org/). 

To install from the requirements file using pip, uncomment the shell command below and run it (this will take a while if Pytorch needs to be installed):


In [None]:
# ! pip install -r requirements.txt

We will start by testing transcription with a simple audio clip. Here it is, with playback handily embedded in our notebook:

In [None]:
from IPython.display import Audio
Audio('data/example.wav')

Now run the following commands to load the `tiny` version of Whisper (you are welcome to try the `large` version if you like, but be warned it is 2.7GB and very slow on a plain old CPU!)

In [None]:
import whisper
tiny_model = whisper.load_model('tiny')

To transcribe our example WAV file, simply run the following command:

In [None]:
tiny_model.transcribe('data/example.wav')

As you will see, even the `tiny` model takes a while to run. In fact, let's time it. The result will vary depending on your machine. Note also that I am suppressing the annoying warning message by supplying the argument `fp16=False`.

In [None]:
%%timeit
result=tiny_model.transcribe('data/example.wav', fp16=False)

For comparison's sake, let's try the second smallest model (appropriately called `small`)...

In [None]:
small_model=whisper.load_model('small')
small_model.transcribe('data/example.wav', fp16=False)

In [None]:
%%timeit
result=small_model.transcribe('data/example.wav', fp16=False)

Not very promising. Thankfully for those of us without expensive hardware, OpenAI have recently begun to offer Whisper as part of their managed API service. This (allegedly) hosts version 2 of their `large` model.

## Transcription using API

In order to use the Whisper API (and other OpenAI products), you will need to create an account. More instructions can be found here: [https://platform.openai.com/signup/api-key](https://platform.openai.com/signup/api-key)

Once you done that, generate an [API secret key](https://platform.openai.com/account/api-keys), copy it to the `credentials.template` file, save as `credentials` (note: this will be ignored by git!) and run the following:

In [None]:
with open('credentials', 'r') as f:
    secret_key = f.read()
    
import openai
openai.api_key = secret_key

Transcribing via API in Python is as simple as this:

In [None]:
audio_file = open("data/example.wav", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
transcript

Note the result is the same as above. Before you think "this isn't worth it", let's move on to "real world data" that will make you appreciate the benefits of the `large` model...

## Downloading New Zealand English dataset (RNZ)

A recent project I worked on involved building a tool for transcribing [New Zealand English](https://en.wikipedia.org/wiki/New_Zealand_English). Speech recognition models have traditionally struggled with this dialect - compared to, say, General American English - not to mention the various borrowed words from te reo Māori ("kia ora", "whānau" etc.) and place names.

To get a feel for how well Whisper does, I have prepared some clips from RNZ's programme Morning Report. Each clip can be downloaded in MP3 format from the supplied URL, and the content between the specified start/end timestamps is transcribed (by a human) in the `Text` field.

Let's have a look at this data, using Pandas:

In [None]:
import pandas as pd
rnz_metadata = pd.read_csv('data/morning_report.csv')
rnz_metadata.head()

In [None]:
rnz_metadata.loc[1,'Text']

We can use the `wget` and `os` Python packages to download these MP3s directly and save them in a directory called `data/rnz/raw` (not tracked by git). Here is an example, using one of the rows above:

In [None]:
import wget
import os
rnz_data_dir = 'data/rnz'
rnz_raw_dir = os.path.join(rnz_data_dir, 'raw')
os.makedirs(rnz_raw_dir, exist_ok=True)
filepath=wget.download(rnz_metadata.loc[1,'URL'],out=rnz_raw_dir)
Audio(filepath)

If we want to extract only the transcribed segment, we can use the `pydub` library. We will use a helper function to extract raw millisecond timings from the timestamps supplied.

In [None]:
# helper function
def timestamp_to_ms(timestamp):
    m, s = timestamp.split(':')
    s, ss = s.split('.')
    m, s, ss = [int(t) for t in [m,s,ss]]
    return m*60*1000+s*1000+ss

from pydub import AudioSegment
import re

clip_filepath = re.sub('/raw/','/clips/',filepath)
rnz_clips_dir = os.path.join(rnz_data_dir, 'clips')
os.makedirs(rnz_clips_dir, exist_ok=True)

start_ms = timestamp_to_ms(rnz_metadata.loc[1,'Start'])
end_ms = timestamp_to_ms(rnz_metadata.loc[1,'End'])

sound = AudioSegment.from_mp3(filepath)
sound[start_ms:end_ms]

We will save this trimmed clip in the directory `data/rnz/clips`. Let's see how our `tiny` model handles it...

In [None]:
sound[start_ms:end_ms].export(clip_filepath, format="mp3")

tiny_model.transcribe(clip_filepath, fp16=False)['text']

Well, that could have been better! It seems Whisper is confused about which language is being spoken, and has "split the difference" in some way. We can give it a hint by specifying that the language is English (`en`):

In [None]:
tiny_model.transcribe(clip_filepath, fp16=False, language='en')['text']

This is better, but some words are still causing issues. Let's try the API/`large` model for comparison. We will write a handy function to make these calls easier to construct going forward:

In [None]:
def transcribe_api(mp3_path):
    with open(mp3_path, "rb") as audiofile:
        transcript=openai.Audio.transcribe("whisper-1", audiofile, response_format='text', language='en')
    return transcript.strip()

In [None]:
transcribe_api(clip_filepath)

Much better! Now let's run these models over the entire dataset. I will be interested to see what you find among the results.

## Whisper accuracy

First, we will consolidate some of the steps above into a handy function that takes the supplied URL and start/end times, and downloads the MP3, trims the relevant selection and saves it in the right place.

In [None]:
def mp3_download_trim(row, raw_dir=rnz_raw_dir,clips_dir=rnz_clips_dir):
    basename=row.URL.split('/')[-1]
    raw_filepath=os.path.join(raw_dir, basename)
    # insure against doing this many times!
    if not os.path.exists(raw_filepath):
        wget.download(row.URL,out=raw_dir)
    
    # include start/end in filepath name
    clip_filepath=os.path.join(clips_dir, f"{row.Start}_{row.End}_basename")
    if not os.path.exists(clip_filepath):
        start_ms = timestamp_to_ms(row.Start)
        end_ms = timestamp_to_ms(row.End)
        sound = AudioSegment.from_mp3(filepath)
        sound[start_ms:end_ms].export(clip_filepath)
    return clip_filepath

We can then exploit the Pandas `apply` method to run this across our entire dataset.

In [None]:
rnz_metadata['Filepath']=rnz_metadata.apply(mp3_download_trim,axis=1)

The filepath to the relevant clip is now a new field in the data frame:

In [None]:
rnz_metadata['Filepath'][0]

Now let's use a similar `apply` call to transcribe the relevant clip, using our local/`tiny` model. Be patient - this could take a while. I have used a sample here to save testing time, but feel free to vary `N` below to suit your device.

In [None]:
# set seed
import numpy as np
np.random.seed(0)
N=3

rnz_metadata=rnz_metadata.sample(N)

In [None]:
rnz_metadata

In [None]:
rnz_metadata['Tiny transcription'] = rnz_metadata.Filepath.apply(lambda x: tiny_model.transcribe(x, fp16=False, language='en')['text'].strip())

And the same with the API/`large` model...

In [None]:
rnz_metadata['API transcription'] = rnz_metadata.Filepath.apply(transcribe_api)

In [None]:
rnz_metadata

Let's pick a random example and see how the transcriptions compare:

In [None]:
example = rnz_metadata.sample(1)
Audio(example.Filepath.tolist()[0])

In [None]:
example.Text.tolist()[0]

In [None]:
example['Tiny transcription'].tolist()[0]

In [None]:
example['API transcription'].tolist()[0]

For a more comprehensive comparison of accuracy, we need metrics. A common one for comparing transcriptions is the [Word Error Rate](https://en.wikipedia.org/wiki/Word_error_rate), or WER. This scores a transcription on how many substitutions, insertions and deletions of words are needed to recreate the ground truth transcription from the inferred one.

We can implement this in Python using the `jiwer` package. Note that, in order to be statistically fair, we may want to first "normalise" both texts to remove unimportant differences such as how numbers are rendered from influencing the score.

In [None]:
from jiwer import wer

# a single example
wer(example['Tiny transcription'].tolist()[0], example.Text.tolist()[0])

In [None]:
# over entire dataset
tiny_wer = wer(rnz_metadata['Tiny transcription'].tolist(), rnz_metadata.Text.tolist())
api_wer = wer(rnz_metadata['API transcription'].tolist(), rnz_metadata.Text.tolist())
print(f"WER (Tiny model): {tiny_wer}, WER (API model): {api_wer}")

## Ethical considerations

After seeing that Whisper can (at least attempt to) transcribe and translate te reo Māori, you may be wondering: where/how did it learn to do that?

The paper claims that training data originates from [Fleurs](https://huggingface.co/datasets/google/fleurs), a multilingual collection of recordings of a common set of phrases, translated into 102 languages. The method Google use to select and approach the various speakers involved is somewhat opaque.

This raises many interesting ethical questions - several of which have probably already occurred to you!

Te Hiku Media have written an article which discusses these issues quite comprehensively, and which I highly recommend reading: [https://blog.papareo.nz/whisper-is-another-case-study-in-colonisation/](https://blog.papareo.nz/whisper-is-another-case-study-in-colonisation/) 