# Calculation of Word Error Rate for Mangled Segment Corrections

One of the interesting things to study in more detail is the performance of the OCR text corrections when compared to the text that is visible in the original document. We have set up a small setup to annotate this by annotating mangled segments with highlighting in the original pdf, and an accompanying csv file with the corrected segments, which we can use to calculate the word error rate. To get an idea of the types of mistakes made, we split out the mistakes made by the models as either OCR errors (confusing l for 1, l for i etc.) and hallucination errors, where words are changed to different valid Dutch words.

## Loading in our annotated data

We have detected the mangled segments in 10 documents, and manually evaluated the corrections made by three systems: only Tesseract, only ChatGPT, and the combination of Tesseract and ChatGPT. We will load in our dataframes and show some basic analytics of the dataset.

In [30]:
import os
import pandas as pd
from glob import glob

def collect_annotated_documents(annotation_root_folder: str):
    """
    Helper function to get the names of the annotated files together with their dataframe.
    """
    annotation_files = sorted(glob(os.path.join(annotation_root_folder, '*.csv')))
    annotation_dataframe = pd.concat([pd.read_csv(csv_file, sep=';') for csv_file in annotation_files]).fillna(0)
    
    annotation_dataframe['Number of Hallucinations'] = annotation_dataframe['Number of Hallucinations'].astype(int)
    annotation_dataframe['Number of OCR errors'] = annotation_dataframe['Number of OCR errors'].astype(int)
    
    return annotation_dataframe

Next up we load in the dataframes with annotations for all three corrections.

In [31]:
tesseract_dataframe  = collect_annotated_documents('../data/WER_annotation_data/tesseract/')

In [32]:
tesseract_dataframe.head()

Unnamed: 0,text,Number of Tokens,Number of Mistakes,Number of Hallucinations,Number of OCR errors
0,ministerie\n,1,0,0,0
1,Voedselkwaliteit (hierna: LNV) heeft umet een ...,18,2,0,2
2,krijgen\n,1,0,0,0
3,"adviezen, memo’s en andere documenten \n",5,0,0,0
4,"verzoek (ongevraagd) verkregen, uitgedaan en/o...",22,0,0,0


In [33]:
chatgpt_dataframe  = collect_annotated_documents('../data/WER_annotation_data/chatgpt')

In [34]:
combination_dataframe  = collect_annotated_documents('../data/WER_annotation_data/tesseract+chatgpt')

Now let's look at the total size of the dataset.

In [35]:
print("The annotated dataset contains %d mangled segments, totalling %d mangled tokens" % (tesseract_dataframe.shape[0], tesseract_dataframe['Number of Tokens'].sum()))

The annotated dataset contains 227 mangled segments, totalling 672 mangled tokens


## Calculating WER for the different corrections

Now that we have these dataframes, we can calculate the WER by summing the total number of mistake and dividing by the total number of tokens. We will print this for all three of the correction strategies

In [36]:
def calculate_WER(annotation_dataframe: pd.DataFrame):
    # WER is just the number of corect tokens divided by the number of mistakes in our case
    # note that in principle this score can exceed 0
    return annotation_dataframe['Number of Mistakes'].sum() / annotation_dataframe['Number of Tokens'].sum()

In [40]:
# Only chatgpt
calculate_WER(chatgpt_dataframe)

0.10714285714285714

In [41]:
# Only tesseract
calculate_WER(tesseract_dataframe)

0.06547619047619048

In [43]:
# The combination of both
calculate_WER(combination_dataframe)

0.02976190476190476

When looking at the WER of the different correction strategies we can see that the ChatGPT model has the highest word error rate, and that the combination of both ChatGPT and Tesseract has the lowest number WER.

## Number of Hallucinations

To investigate the differences in the Word Error Rate for the different strategies we will look at the number and percentages of hallucinations Vs. OCR mistakes for all of the three strategies.

In [44]:
# First for Tesseract
tesseract_dataframe['Number of Hallucinations'].sum().astype(int)

0

In [45]:
# Now for ChatGPT
chatgpt_dataframe['Number of Hallucinations'].sum().astype(int)

36

In [47]:
# Also print the percentage for chatgpt of hallucinations
# percentage for chatgpt
chatgpt_dataframe['Number of Hallucinations'].sum().astype(int) / chatgpt_dataframe['Number of Mistakes'].sum()

0.5

In [46]:
# And now for the combination
combination_dataframe['Number of Hallucinations'].sum().astype(int)

12

In [48]:
# And now also the percentage for the combination
combination_dataframe['Number of Hallucinations'].sum().astype(int) / combination_dataframe['Number of Mistakes'].sum()

0.6

We can see that Tesseract only makes OCR type errors, with ChatGPT making an equal amount of hallucinations and OCR mistakes. We can see that although the percentage of errors that is a hallucination goes up in the combination, the absolute number is actually reduced drastically, from 36 to 12, so the number of hallucinations is reduced by two thirds.