# Explore Whisper

In this notebook, we explore whether the model (and package) 'Whisper' can be used for Automatic Speech Recognition (ASR) as part of the mexca pipeline. Whisper detects the language of speech in an audio file and then transcribes the speech to text. It has been trained to detect 98 languages besides English and can also translate speech from other languages to English. See the [model card](https://github.com/openai/whisper/blob/main/model-card.md) and [paper](https://cdn.openai.com/papers/whisper.pdf) for more details.

In [35]:
import jiwer
import numpy as np
import librosa
import whisper
from datasets import load_dataset
from IPython.display import Audio
from whisper.normalizers import EnglishTextNormalizer

First, we load a data set to apply the model to. We choose a relatively difficult test case from the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) corpus.

In [2]:
ami = load_dataset("ami", "headset-single", split=['test'])[0]

Found cached dataset ami (C:/Users/MalteLuken/.cache/huggingface/datasets/ami/headset-single/1.6.2/2accdf810f7c0585f78f4bcfa47684fbb980e35d29ecf126e6906dbecb872d9e)
100%|██████████| 1/1 [00:00<00:00, 77.69it/s]


In mexca, Whisper needs to transcribe audio segments from the same speaker detected by pyannote.audio. That is, we need to test how Whisper performs on transcribing shorter segments. It has been trained on audio chunks of 30s, thus, it should be suited for this task (most audio segments from pyannote are below 30s). We cut the audio of the AMI test file into speech segments.

In [3]:
SAMPLE_RATE = 16_000

def segment_audio(batch):
    new_batch = {
        "audio": [],
        "words": [],
        "speaker": [],
        "lengths": [],
        "word_start_times": [],
        "segment_start_times": [],
    }

    audio, _ = librosa.load(batch["file"], sr=SAMPLE_RATE)

    word_idx = 0
    num_words = len(batch["words"])
    for segment_idx in range(len(batch["segment_start_times"])):
        words = []
        word_start_times = []
        start_time = batch["segment_start_times"][segment_idx]
        end_time = batch["segment_end_times"][segment_idx]

        # go back and forth with word_idx since segments overlap with each other
        while (word_idx > 1) and (start_time < batch["word_end_times"][word_idx - 1]):
            word_idx -= 1

        while word_idx < num_words and (start_time > batch["word_start_times"][word_idx]):
            word_idx += 1

        new_batch["audio"].append(audio[int(start_time * SAMPLE_RATE): int(end_time * SAMPLE_RATE)])

        while word_idx < num_words and batch["word_start_times"][word_idx] < end_time:
            words.append(batch["words"][word_idx])
            word_start_times.append(batch["word_start_times"][word_idx])
            word_idx += 1

        new_batch["lengths"].append(end_time - start_time)
        new_batch["words"].append(words)
        new_batch["speaker"].append(batch["segment_speakers"][segment_idx])
        new_batch["word_start_times"].append(word_start_times)

        new_batch["segment_start_times"].append(batch["segment_start_times"][segment_idx])

    return new_batch

In [4]:
# Choose the first file of the test set
ami_segments = segment_audio(ami[0])

Now, we load the pretrained 'small' Whisper model. This model has relatively few parameters, and there are larger models available. The performance of the larger models is better although the difference between small, medium, and large models is relatively small (see paper Appendix D). The inference time of the models substantially increases with the model size (on CPU). 

In [6]:
model = whisper.load_model("small")

Whisper can also provide timestamps for different speech segments, however, we won't need them in mexca since we get them from pyannote (which provides more precise time stamps).

In [7]:
options = whisper.DecodingOptions(
    without_timestamps=True, # we don't need timestamps
    fp16=False # only available on GPU
)

For every audio segment of the test data, we let whisper pad or trim the input to 30 s, compute the log Mel spectrogram as the input of the model, and decode it. Finally, we add the transribed text from the output to a list.

In [10]:
hypothesis = []

i = 0

for segment in ami_segments['audio']:
    print(i)
    audio = whisper.pad_or_trim(segment.flatten())
    mel = whisper.log_mel_spectrogram(audio)
    output = whisper.decode(model, mel, options)
    hypothesis.append(output.text)
    i += 1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

To compare the transcriptions to the reference, we collate the words from the reference segments.

In [12]:
reference = []

for segment in ami_segments['words']:
    reference.append(' '.join(segment))

Let's look at a few examples.

In [16]:
def print_example(k):
    print('Whisper: ', hypothesis[k], '\n\n', 'Reference: ', reference[k])

In [32]:
print_example(5)

Whisper:  Am I supposed to be standing up there? 

 Reference:  Am I supposed to be standing up there ?


In [40]:
Audio(ami_segments['audio'][5], rate=SAMPLE_RATE)

First of all, Whisper recognizes the spoken language as English. Here, Whisper does a perfect job of transribing the audio.

In [33]:
print_example(20)

Whisper:  We will do some stuff, get to know each other a bit better, feel more comfortable with each other. Then we'll go do tool training, talk about the project plan, discuss our own ideas and everything. And we've got 25 minutes to do that as far as I can understand. 

 Reference:  we will do some stuff get , to know each other a bit better to feel more comfortable with each other . Um then we'll go do tool training , talk about the project plan , discuss our own ideas and everything um and we've got twenty five minutes to do that , as far as I can understand .


In [41]:
Audio(ami_segments['audio'][20], rate=SAMPLE_RATE)

For this one, Whisper also provides a very accurate transcription. However, it omits filler words ('umm'), breathing, and provides are more coherent and grammatically sound text compared to the audio.

In [34]:
print_example(100)

Whisper:  Hopefully that will stay on, 200 version. 

 Reference:  Hopefully Hmm that'll . stay on , two-handed version Okay . Okay . , uh


In [42]:
Audio(ami_segments['audio'][100], rate=SAMPLE_RATE)

In this example, Whisper confuses 'two-handed' with '200' and omits the 'Okay's.

Now, we can check how good the transcription is for the entire test file. For both reference and transcription, we first remove punctuation from the text and normalize it using a normalizer for English from the whisper package.

In [43]:
normalizer = EnglishTextNormalizer()

In [62]:
hypothesis_norm = [normalizer(text) for text in hypothesis]
reference_norm = [normalizer(text) for text in reference]

Now we calculate the overall Word Error Rate (WER). However, the normalization leaves some segments with no text, which we first need to remove.

In [67]:
for i, _ in enumerate(reference_norm):
    if reference_norm[i] == '':
        reference_norm.pop(i)
        hypothesis_norm.pop(i)

In [68]:
jiwer.wer(
    reference_norm,
    hypothesis_norm
)

0.3684640522875817

This results in a WER of 0.37 for this test file. In Appendix D.1.1 of the paper, an average WER of 0.19 is reported for a similar version of the entire AMI test set (with a slightly different preprocessing).

It is also interesting to look at segments where Whisper transcribed text that does not occurr in the reference. We can find such segments by looking at empty reference segments after normalization.

In [70]:
for i, ref in enumerate(reference):
    if normalizer(ref) == '':
        print('Reference: ', reference[i])
        print('Whisper: ', hypothesis[i])

Reference:  Hmm hmm hmm .
Whisper:  No.
Reference:  hmm hmm
Whisper:  I
Reference:  
Whisper:  Thank you.
Reference:  Mm .
Whisper:  Thank you.
Reference:  Mm-hmm .
Whisper:  Mm hmm.
Reference:  .
Whisper:  Okay.
Reference:  
Whisper:  Yeah.
Reference:  .
Whisper:  Good.
Reference:  
Whisper:  Thank you.
Reference:  .
Whisper:  Thank you.
Reference:  Uh Uh .
Whisper:  Uh...
Reference:  Uh .
Whisper:  Uh
Reference:  Mm-hmm .
Whisper:  Dear.
Reference:  
Whisper:  Thank you.
Reference:  Mm .
Whisper:  you
Reference:  Mm-hmm . . Um
Whisper:  Mm hmm.
Reference:  Mm-hmm .
Whisper:  Mm hmm.
Reference:  Mm-hmm .
Whisper:  Mm hmm.
Reference:  Mm-hmm .
Whisper:  obvious.
Reference:  Mm .
Whisper:  Hmm.
Reference:  Mm-hmm .
Whisper:  Mm hmm.
Reference:  Mm-hmm .
Whisper:  Mm hmm.
Reference:  Mm-hmm .
Whisper:  Mm hmm.
Reference:  Mm .
Whisper:  Mm.


It seems like Whisper occasionally adds filler words, 'Thank you.', or affirmative words ('Okay', 'Good', 'Yeah').

## Conclusion

Overall, Whisper seems like a very good model for ASR on English audio. Even for a difficult test case as the AMI corpus, it mostly provides quite accurate transcriptions for speech segments (see also the benchmarks in the paper). The model card claims that it provides strong results for at least 10 languages (including most European ones). Most importantly, it meets the requirements of mexca, i.e., that the transcription can be applied to speech segments. The size of the models is also not too large (small is 500 MB). A big advantage is also the automatic language detection, which only requires a single model for transcribing different languages without any user input. The transcriptions are more grammatically sound and coherent than the raw speech. In some cases, this can be a disadvantage, but for mexca this is actually a bonus: It enables sentiment analysis via dictionary look-up and will most likely also improve the predictions of sentiment language model.