# ASR evaluation

Copyright CLTL, VU Amsterdam

Prepared by Lieke Gelderloos, April 2021



## Installing prerequities

### SpeechRecognition

The Python package [SpeechRecognition](https://github.com/Uberi/speech_recognition) provides access to several different ASR systems. You can install it via the terminal using pip.

`pip install SpeechRecognition`

If you are using Anaconda, you can install it using conda.

`conda install -c conda-forge SpeechRecognition`

The systems this package gives access to are some of the major commercially available systems, that can be used at scale (you could use them to build ASR into an app, for example). For this reason, you usually have to sign up for an account (not necessarily paid - there are usually different options depending on what you want to use the system for and how intensively). There are two exceptions: Google Speech Recognition, for which a 'default' API key is included in the SpeechRecognition library, and CMU Sphinx, which is an offline system. Since the latter runs on the user's system, it requires downloading the models involved and a rather complicated installation process (if you want to give it a try, see the [information about PocketSphinx in the README of SpeechRecognition](https://github.com/Uberi/speech_recognition#pocketsphinx-python-for-sphinx-users)). For the purpose of this tutorial we will use Google Speech Recognition as it should work out-of-the-box. Note that it is subject to a 'fair use' policy, so if you want to use ASR more intensely (say to use as input for an app), you will need to obtain your own API key.

### PyAudio

Since we are going to be working with our own speech, we need to be able to use the microphone to record it. To be able to use the microphone for input, install the package PyAudio using either pip or conda.

`pip install PyAudio`

`conda install -c conda-forge PyAudio`

### Testing if the abovementioned packages work

On the command line, run 

`python -m speech_recognition`

and say something when prompted.

### JiWER

This is a package we will use for evaluation.

`pip install jiwer`

It is not readily available through conda; either you will have to use `conda-build` or you can install it using pip.

## Import SpeechRecognition

In [1]:
import speech_recognition

## Recording some data

Let's record the following sentences (note, insert your (nick)name into the second sentence!)

In [2]:
test_sentences = ["Hello world!",
            "Hello, my name is ###.", # insert your (nick)name in place of the ###
            "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
            "Bilbo Baggins was a hobbit who lived in the Shire during the Third Age.",
            "Wreck a nice beach."]

We will loop through the sentences and prompt the user (you!) to say the sentence, and record 5 seconds of audio for each. We will store the audio in a nested dictionary test_sentences. The dictionaries include an `'audio'` field and a `'ground_truth'` field.

In [3]:
rec = speech_recognition.Recognizer() # instantiate the Recognizer object

def record_sentences(sentences, rec):
    recorded = []
    for sentence in sentences:
        print(f"Please say:\n{sentence}")
        with speech_recognition.Microphone() as source:
            audio_data = rec.record(source, duration=5) # record 5 seconds of speech
        recorded.append({'audio': audio_data, 'ground_truth': sentence})
    print('Done recording!')
    return recorded

recordings = record_sentences(test_sentences, rec)

Please say:
Hello world!
Please say:
Hello, my name is ###.
Please say:
How much wood would a woodchuck chuck if a woodchuck could chuck wood?
Please say:
Bilbo Baggins was a hobbit who lived in the Shire during the Third Age.
Please say:
Wreck a nice beach.
Done recording!


## Transcribe speech

In [19]:
def transcribe_sentences(recordings):
    for sentence in recordings:
        try:
            transcription = rec.recognize_google(sentence['audio']) # use the google ASR to transcribe the recording
        except:
            print(f"transcription failed for {sentence['ground_truth']}")
            transcription = "" # if speech recognition fails, return an empty string instead of a transcription
        print(transcription)
        sentence['google_transcript'] = transcription # write the transcriptions to a transcription field in the dictionary
        
transcribe_sentences(recordings)

hello world
hello my name is Luka
how much wood would a woodchuck chuck if a woodchuck could chuck wood
Google meghan's Mother Hubbard who lived in a shower during the Third Age
recognise speech


How well did the transcription go? Do you notice any mistakes?

# Evaluation

## Word Error Rate
ASR systems are typically evaluated using the Word Error Rate (WER) between the ASR output and ground-truth transcriptions. To calculate WER, you count the number of words in the ASR output that are inserted, deleted, and substituted with respect to the ground truth transcription, and divide this by the total number of words in the ground truth transcription. 

$WER = \frac{S + D + I}{N}$

where:<br>
$S$ is the number of substituted words<br>
$D$ is the number of deleted words<br>
$I$ is the number of inserted words<br>
$N$ is the length of the ground truth in words<br>


You always count the minimum required edits; e.g. if a word is replaced by another word, you count that as one edit (a substitution), rather than as two edits (one insertion and one deletion). The lower the word eror rate, the better; if there are no errors, WER will be 0. 

Note that it is possible for the WER to exceed 1. $S+D$ will never exceed N, since they are tied to the words in the original sentence: you can only insert or delete words that are actually in a sentence. $I$, however, is not tied to $N$, since the number of insertions does not depend on the number of original words in the sentence. Therefore, $S+D+I$ may exceed $N$, meaning $WER$ may exceed 1!

### Word Error Rate by hand
For one of the test sentences in which the ASR made ome mistakes, find the insertions, deletions and substitutions, and then calculate the WER. Ignore punctuation and capitalization.

In the unlikely event that ASR went perfectly for every sentence, work through the following example:

Ground truth: 
`Bilbo Baggins was a hobbit who lived in the Shire during the Third Age.` <br>
Transcription: 
`double bangs with a hobbit who lived in the shower during the Third Age.`

Insertions: <br>
Deletions: <br>
Substitutions: <br>

Number of words in the ground truth:
    
WER:

### Word Error Rate automatically
While the above calculation may look simple, that is because your brain has done the difficult work intuitively: identifying the minimum required insertions, deletions, and substitutions. Implementing this is not quite as straightforward, as it requires an alignment between the ground truth transcription and the recognized speech. If the ASR omits a word at the start of the sentence, but correctly identifies the rest of the words, we want to punish it only for omitting the first word, and not for the rest of the words, as we can align them to the ground truth. 

In order to calculate the WER, then, we need to find the optimal alignment between the word sequence in the ground truth and our ASR output. Perhaps you have heard about Levenshtein distance before, a metric to calculate the distance between two sequences. This is what the WER is based on (basically, WER is Levenshtein distance divided by the length of the ground truth sentence) It can be implemented in different ways (see the Wikipedia page if interested in the algorithm).

We will use the function `wer()` from [JiWER](https://pypi.org/project/jiwer/) to calculate word error rate.

In [1]:
import jiwer
for sentence in recordings:
    print(jiwer.wer(sentence['ground_truth'], sentence['google_transcript']))

NameError: name 'recordings' is not defined

Are these the values you expected?

Perhaps not!

JiWER's `wer()` is case-sensistive, meaning that it punishes ASR for transcibing 'Hello' as 'hello'. Also, `wer()` only strips out `.`and `,` - any other punctuation is left in. ASR's are thus punished for not transcribing e.g. exclamation marks.
Let's clean the punctuation from both sentences before calculating WER, and also lowercase them so that our WER becomes case-insensitive. Note that this is a choice: perhaps you care whether your model differentiates between the word 'chase' and the given name 'Chase'; and perhaps you want your ASR to recognize question-intonation. You might make different choices then. 

As a side note, JiWER also contains tools to do these types of regularization before calculating WER; you can check the description of different `transformation`s in the [documentation](https://pypi.org/project/jiwer/) if you want to learn more.

In [8]:
import numpy as np
import string

def clean_sentence(sentence):
    # remove punctuation
    exclude = set(string.punctuation) # we want to remove all characters from the string.punctuation class
    clean_string = ''.join(ch for ch in sentence if ch not in exclude) # if the character is not in string.punctuation, we keep it
    # return lowercased
    return clean_string.lower()
    
def wer_clean(ground_truth, asr_result):
    wer = jiwer.wer(clean_sentence(ground_truth), clean_sentence(asr_result))
    return wer

def calculate_wers(recordings):
    wers = []
    for sentence in recordings:
        wer = wer_clean(sentence['ground_truth'], sentence['google_transcript'])
        print(wer)
        wers.append(wer)
    return(wers)
    
print("Average WER:", np.mean(calculate_wers(recordings)))

0.0
0.25
0.0
0.5
1.0
Average WER: 0.35


### Adversarial conditions

Let's make the ASR's task a bit more challenging. Think of some suboptimal conditions for ASR: e.g. sit next to a noisy fan; speak very fast or slow, or in a non-standard accent. Record the sentences again.

In [20]:
#noisy_recordings = record_sentences(test_sentences, rec)
transcribe_sentences(noisy_recordings)

Heywood
transcription failed for Hello, my name is ###.

transcription failed for How much wood would a woodchuck chuck if a woodchuck could chuck wood?

women's onesie
speech


In [21]:
# evaluate
print("Average WER on noisy speech:", np.mean(calculate_wers(noisy_recordings)))

1.0
1.0
1.0
1.0
1.0
Average WER on noisy speech: 1.0


## Qualitative evaluation

This part of the assignment is meant to be done together with your group members.

A) Compare the average WER between the members of your group, in the clean and the adversarial conditions separately. Does the ASR perform equally well for all members of your group? If not, can you think of reasons why that might be? Consider different factors such as characteristics of your voices, accents, and recording conditions.

B) A HMM-based system consists of a phonetic model, the pronunciation lexicon, and the language model. Try to find a mistake that is likely attributable to each of these components, and explain how it might have occurred.

C) Try to find a phonetic error and a lexical error.