The purpose of this notebook is to calculate the Word Error Rate (WER) between model outputs and the test set. As in HW 2, this is computed as (S+D+I)/N where S is the number of times the system substitutes one source word for a different word in its transcript, D is the number of times the system deletes a source word, I is the number of times the system inserts a word in the transcript where there is no corresponding source word, and N is the total number of words in the source.

We import the necessary libraries and modules to be able to process input files.

In [None]:
from pathlib import Path
import re
from datasets import load_dataset


This `wer` function calculates the WER between a source transcription and the output generated by one of our models. It does this using dynamic programming and outputs the WER, the number of substitutions (S), the number of deletions (D), and the number of insertions (I).

In [None]:
def wer(source, ours):

    # Tokenize
    r = source.split()
    h = ours.split()
    len_r = len(r)
    len_h = len(h)

    # DP matrix: rows = source, cols = model output (ours)
    dp = [[0] * (len_h + 1) for _ in range(len_r + 1)]

    # Operation matrix to help backtrack edits
    backtrace = [[None] * (len_h + 1) for _ in range(len_r + 1)]

    # Initialize boundaries
    for i in range(1, len_r + 1):
        dp[i][0] = i
        backtrace[i][0] = "D"

    for j in range(1, len_h + 1):
        dp[0][j] = j
        backtrace[0][j] = "I"

    # Fill DP table
    for i in range(1, len_r + 1):
        for j in range(1, len_h + 1):
            if r[i - 1] == h[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
                backtrace[i][j] = "OK"
            else:
                sub = dp[i - 1][j - 1] + 1
                ins = dp[i][j - 1] + 1
                dele = dp[i - 1][j] + 1

                dp[i][j] = min(sub, ins, dele)

                if dp[i][j] == sub:
                    backtrace[i][j] = "S"
                elif dp[i][j] == ins:
                    backtrace[i][j] = "I"
                else:
                    backtrace[i][j] = "D"

    # Backtrack to count S, D, I
    i, j = len_r, len_h
    S = D = I = 0

    while i > 0 or j > 0:
        op = backtrace[i][j]

        if op == "OK":
            i -= 1
            j -= 1
        elif op == "S":
            S += 1
            i -= 1
            j -= 1
        elif op == "D":
            D += 1
            i -= 1
        elif op == "I":
            I += 1
            j -= 1

    wer_value = (S + D + I) / len_r if len_r > 0 else float("inf")
    return wer_value, S, D, I


The next two cells are for uploading our data:


*  `modern_test.tsv` that has one column of original Modern English from our test set and a second column of translated Early Modern English to Modern English that was generated by one of our models
*   `shakespeare_test.tsv` that has one column of original Early Modern English from our test set and a second column of translated Modern English to Early Modern English that was generated by one of our models



In [None]:
from google.colab import files
uploaded = files.upload()

# file modern_test.tsv

Saving modern_test.tsv to modern_test.tsv


In [None]:
from google.colab import files
uploaded = files.upload()

# file shakespeare_test.tsv

Saving shakespeare_test.tsv to shakespeare_test.tsv


We then prepare the data into `DatasetDict` objects with two colummns.

In [None]:
shakes_testset = load_dataset(
    "csv",
    data_files={"full": str(Path("./shakespeare_test.tsv"))},
    delimiter="\t",
    column_names=["original", "translated"]
 )

modern_testset = load_dataset(
    "csv",
    data_files={"full": str(Path("./modern_test.tsv"))},
    delimiter="\t",
    column_names=["original", "translated"]
)

Generating full split: 0 examples [00:00, ? examples/s]

This function calculates the overall WER, S, D, and I from the entire test set by applying the `wer` function for each pair of data in the test set given.

In [None]:
def error_calc(testset):
  error = 0
  subs = 0
  dels = 0
  ins = 0
  for row in testset["full"]:
          original = str(row["original"])
          translated = str(row["translated"])
          w, S, D, I = wer(original, translated)
          subs += S
          dels += D
          ins  += I
          error += w
  average_wer = error / len(testset["full"])
  return average_wer, subs, dels, ins

Finally, the next two cells apply the WER calculations to the two types of translation (Modern English to Early Modern English and Early Modern English to Modern English).

In [None]:
w, S, D, I = error_calc(shakes_testset)

print("WER:", w)
print("Substitutions:", S)
print("Deletions:", D)
print("Insertions:", I)

WER: 0.6145503711925923
Substitutions: 8834
Deletions: 1509
Insertions: 1967


In [None]:
w, S, D, I = error_calc(modern_testset)

print("WER:", w)
print("Substitutions:", S)
print("Deletions:", D)
print("Insertions:", I)

WER: 0.5694496803509015
Substitutions: 8838
Deletions: 2780
Insertions: 987
