## 1. Part: OCR
Import Image from PIL library to load images and import pytesseract library. Don't forget to specify path to tesseract.

In [None]:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Users/venglaro/Tesseract-OCR/tesseract.exe'

Load image you want to OCR. We'll start with the Little Prince image which is in Czech:

In [None]:
img = Image.open('C:/Users/venglaro/Desktop/OCR_teaching/images/maly_princ.jpeg')

Specify which model you want to use and do the recognition.

Hint: Not sure, which model to use? On https://tesseract-ocr.github.io/tessdoc/Data-Files, you'll find the list of languages and their tesseract models.

Hint2: Getting error while OCRing? Make sure that you downloaded the language model you want to use from https://github.com/tesseract-ocr/tessdata and moved it into your Tesseract-OCR/tessdata folder. You only need to do this once.

In [None]:
ocr_model = 'ces'
text = pytesseract.image_to_string(img, lang=ocr_model)
print(text)

## 2. Part: How good is my result?

Load the file contatining ground truth.

In [None]:
ground_truth_path = 'C:/Users/venglaro/Desktop/OCR_teaching/transcription/maly_princ.txt'
with open(ground_truth_path, 'r', encoding='utf-8') as file:
    ground_truth = file.read()

Import the Levenshtein modul (if not installed: pip install Levenshtein) and calculate the distance.

Brief recap: Levenshtein distance = number of editing operations (insert, delete, replace) we need to perform in order to obtain identical strings.

Note: Is OCR boring for you? Invest some effort into these evaluation metrics anyway! You can use Levenshtein, CER and WER anyway in many other applications! For example to evaluate of your speech-to-text results, or use texts similarity to detect plagiats.

In [None]:
import Levenshtein

distance = Levenshtein.distance(text, ground_truth)
print(distance)

What does it say about the similarity of texts? Is it a lot or a few? Is every second character mis-recognized, or every 100th? Let's calculate text similarity by taking into account the length of the text.

In [None]:
def count_text_similarity(distance, text1, text2):
    if max(len(text1), len(text2)) > 0:
        similarity = 1 - (distance / max(len(text1), len(text2)))
    else:
        similarity = None
    return similarity

similarity = count_text_similarity(distance, text, ground_truth)
print(similarity)

Besides Levenshtein distance, there are also other metrics such as Character Error Rate (CER; how many characters from 100 are mis-recognized - the lower the better), or Word Error Rate (WER; on the words level).
Their calculation is based on Levenshtein distance and is defined as (S + D + I) / N, where S is number of substitutions, D number of deletions, I number of insertions and N number of characters (CER) or words (WER) in the ground truth.

Luckily, we just need to undestand the concepts, but the metrics were already implemented by other people and are available in python packages. 

If not installed: **pip install jiwer**

Note: How do I choose appropriate metric?
It depends on your use-case! Generally, CER or Levenshtein distance are fitting when OCRing normal text, WER e.g. when OCRing insurance numbers where you absolutely need the token as unit to be correct.

In [None]:
import jiwer 

wer = jiwer.wer(text, ground_truth)
print(f'WER: {wer}')

In [None]:
cer = jiwer.cer(text, ground_truth)
print(f'CER: {cer}')

## 3. Part: Post-processing

Post-processing (or post-correction) serves to improvement of the OCR result. There are many ways to do that, from rule-based or dictionary approaches to BERT and machine learning. How to choose the correct approach depends on the type of errors, OCR quality, amount of data or availability of pre-trained post-processing models or availability of training data if we want to train a post-correction model ourselves.

In our OCRed text, one source of imperfectness are additional multiple new lines. Replacing multiple new lines by only one could improve our results. (Btw, this would be one of the rule-based approaches.) 

Did the distance/similarity change after this step?

In [None]:
def filter_empty_lines(input_string):
    while '\n\n' in input_string:
        input_string = input_string.replace('\n\n', '\n')
    return input_string

filtered_text = filter_empty_lines(text)

distance = Levenshtein.distance(filtered_text, ground_truth)
similarity = count_text_similarity(distance, filtered_text, ground_truth)
print(f'New distance is: {distance}, text similarity is now {similarity}.')

wer = jiwer.wer(filtered_text, ground_truth)
cer = jiwer.cer(filtered_text, ground_truth)
print(f'New WER is: {wer}, new CER is: {cer}.')

## 4. Part: Your turn 😊

1. Load the ibn_18640702_010.jpg image:

2. Define (and download) appropriate model and do the OCR.

Hint: the data is in German, the font is called Fraktur. There are several models, feel free to use any of them.

3. Load the ground truth image (in the transcriptions folder, named ONB_ibn_18640702_010.txt).

4. Select one or several evaluation metrics and find out how good your OCR result is.

## 5. Part: Post-corrections with BERT-based model

Depending on the model you used, the OCR errors are different. What they probably have in common is that the errors are too complex to define hand-crafted rules. For these cases, machine-learning approaches can be used.

We will use a pre-trained model fine-tuned on our data from https://huggingface.co/Var3n/hmByT5_anno. To use huggingface models, you need to do all the necessary installations.

pip install transformers

First, we import and define the tokenizer and model we want to use. For the first time, the model will be download. By all other uses, it will only be loaded.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno")
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno")

One of issues is that the data needs to be divided into batches with maximal length so they can be processed. For our model, it's 350 characters (see the documentation). As our text is already seperated by new lines, it seems convenient to use these lines as inputs. In the end, we will check whether the longest lines is still shorter than the maximum size.

In [None]:
import re

def split_by_new_line(text):
    tokens = re.split("\\n", text)
    tokens = list(filter(None, tokens))
    return tokens

lines = split_by_new_line(text)
print(f'The maximum length of a line is {max(len(line) for line in lines)}.')

We will process each of the lines. The tokenizer turns the text in the number that can be processed by the model. In the end, we get the text processed by the model and decoded.

In [None]:
result_postprocessed = []

for line in lines:
    inp = tokenizer(line, return_tensors="pt").input_ids
    output = model.generate(inp, max_new_tokens=len(inp[0]), num_beams=4, do_sample=True)
    postprocessed_text = tokenizer.decode(output[0], skip_special_tokens=True)
    result_postprocessed.append(postprocessed_text)

postprocessed = '\n'.join(result_postprocessed)
print(postprocessed)

When we look at the postprocessed text and the ground truth, we can see some characters we don't want, namely the long s ('ſ') instead of 's'. Let's replace them to obtain results more similar to our ground truth.

In [None]:
postprocessed = postprocessed.replace('ſ', 's')

Let's compare the metrics for the text before and after the post-processing:

In [None]:
distance = Levenshtein.distance(text, ground_truth)
similarity = count_text_similarity(distance, text, ground_truth)
wer = jiwer.wer(text, ground_truth)
cer = jiwer.cer(text, ground_truth)

print('Results for the raw OCR:')
print(f'Distance is: {distance}, text similarity is {similarity}.')
print(f'The WER is: {wer}, CER is: {cer}.\n')

distance = Levenshtein.distance(postprocessed, ground_truth)
similarity = count_text_similarity(distance, postprocessed, ground_truth)
wer = jiwer.wer(postprocessed, ground_truth)
cer = jiwer.cer(postprocessed, ground_truth)

print('Results for the post-corrected text:')
print(f'New distance is: {distance}, text similarity is now {similarity}.')
print(f'New WER is: {wer}, new CER is: {cer}.')

Note: It can happen, that your results are actually worse after the post-processing. In this case, it's a good idea to check the type of errors in your original OCRed text and the post-processed and decide, what's better for your use-case.

Different solutions than tesseract exist, e.g. EasyOCR (https://github.com/JaidedAI/EasyOCR). The choice depends on your application, available language models and personal preferences.