### Evaluate Whisper Medium on ATCO2

* this time we evaluate the vanilla model, not fine-tuned on atco2
* we see that WER is lower, demonstrating the importance of fine tuning

In [1]:
from tqdm import tqdm
from transformers import pipeline
from datasets import load_from_disk

from evaluate import load

# these two file comes from OpenAI Whisper, for text normalization
from basic import *
from english import *

In [2]:
# load the metric definition
wer = load("wer")

# apply the same text normalization rules as Whisper
normalizer = EnglishTextNormalizer()

In [3]:
TASK = "automatic-speech-recognition"
MODEL_LABEL = "openai/whisper-medium"

HF_DIR = "atco2_hf"

In [4]:
# load the dataset from local
atco2_hf = load_from_disk(HF_DIR)

In [5]:
ds_test = atco2_hf["test"]

ds_test

Dataset({
    features: ['path', 'audio', 'sentence'],
    num_rows: 56
})

In [6]:
# define the pipeline and a utility method
pipe = pipeline(task=TASK, model=MODEL_LABEL)


def transcribe(audio):
    text = pipe(audio)["text"]

    return text

Downloading:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.46k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/843 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

#### Loop all over the test dataset and compute transcriptions

In [7]:
predicted = []
expected = []

# loop over all test set
for row in tqdm(ds_test):
    # to get the right WER we neeed to apply same normalization rules as Whisper
    # in the local hf dataset text is NOT normalized
    expected.append(normalizer(row["sentence"]))

    text_predicted = transcribe(row["audio"])

    predicted.append(normalizer(text_predicted))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 56/56 [05:28<00:00,  5.87s/it]


#### Compute WER

In [14]:
# strict normalization
new_predicted = []
new_expected = []

for pair in zip(predicted, expected):
    new_predicted.append(normalizer(pair[0]))
    new_expected.append(normalizer(pair[1]))

wer_score = wer.compute(predictions=new_predicted, references=new_expected)

print(f"WER computed on test set is {round(wer_score, 2)}.")

WER computed on test set is 0.64.


Without fine tuning WER is not good.

ATCO2 is based on a very specilaized domain language, with words not so common in spoken english.

And more, words not so common in the datasets Whsiper has been trained on.

In [15]:
for pair in zip(predicted, expected):
    print(f"{normalizer(pair[0])} ---  {normalizer(pair[1])}")

tango 335 frequency change approved goodbye goodbye ---  tango 335 frequency change approved good bye good bye
c est la papa passion granello ---  sierra alpha papa sion ground hello
continue approach runway 16 r 108.7791 ---  continue approach runway 16 right china eastern 77 niner one
telgo papa make a 180 stay clear of the runway taxi to holding point foxrot ---  hotel golf papa make a 180 stay clear of the runway taxi to holding point foxtrot
cqh air canada 7216 heavy with you we are established 6 miles 14 air canada 7216 tower good morning wind 35 30 degrees 3 knots runway 14 cleared to land ---  tower guten morgen air canada 7216 heavy with you we are established 6 miles 14 air canada 7216 tower good morning wind 350 degrees 3 knots runway 14 cleared to land
sydney tower corner 642 corner 642 sydney tower g day ---  sydney tower qantas 642 qantas 642 sydney tower good day
am 139 taxi via foxtrot cross runway 12 g juliet stand 14 a cross runway 12 g juliet stand 14 a 139 ---  emir