### Evaluate Word Error Rate (WER)

Imagine that you have an ASR model (for example, Whisper), and you want to evluate the performance of this model on an evaluation dataset (Librispeech?).
* Which metric are you going to use?
* What code can you use?

In this NB we will show how you can use **Word Error Rate (WER)**, with HF **Evaluate** library

More details [Here](https://huggingface.co/spaces/evaluate-metric/wer)

In [1]:
# you need to have installed jiwer
from evaluate import load

In [2]:
wer = load("wer")

In [3]:
wer

EvaluationModule(name: "wer", module_type: "metric", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Compute WER score of transcribed segments against references.

Args:
    references: List of references for each speech input.
    predictions: List of transcriptions to score.
    concatenate_texts (bool, default=False): Whether to concatenate all input texts or compute WER iteratively.

Returns:
    (float): the word error rate

Examples:

    >>> predictions = ["this is the prediction", "there is an other sample"]
    >>> references = ["this is the reference", "there is another one"]
    >>> wer = evaluate.load("wer")
    >>> wer_score = wer.compute(predictions=predictions, references=references)
    >>> print(wer_score)
    0.5
""", stored examples: 0)

#### the perfect case

In [4]:
# you need to create two list of strings:
# predictions: prediction of model
# references: expected transcriptions

preds = ["sentence1", "sentence2"]
expected = ["sentence1", "sentence2"]

wer_score = wer.compute(predictions=preds, references=expected)

# between 0 and 1, the lower is better
wer_score

0.0

#### the worst case

In [5]:
preds = ["sentence3", "sentence4"]
expected = ["sentence1", "sentence2"]

wer_score = wer.compute(predictions=preds, references=expected)

wer_score

1.0

As we can see, WER can be **indeed a punishing metric**. In the last example, with just two wrong chars we get a WER = 1 (100%)