# Improving OCR with GenAI

Als Datensatzgrundlage habe ich eine Seite aus dem [Aberdeen Bestiary (MS 24)](https://www.abdn.ac.uk/bestiary/) verwendet, das wir uns im Seminar mal näher angesehen hatten. Ich habe Folio 8r als Beispiel und Folio 8v für ein kleines Experiment genutzt.

Für die OCR-Durchführung werden im Folgenden drei Ansätze verglichen:
- GPT-4o als KI-basierte Methode
- Transkribus als traditionelle OCR-Methode mit dem Modell [Mittelalterliche_Schriften_M2.4](https://readcoop.eu/de/modelle/charter-scripts-german-latin-french/)
- Kombination: GPT-4o mit Transkribus-Ergebnis zusammen


### Beispielseite 8r:
<img src="../data/folio8r_ground_truth/folio8r.png" alt="Folio 8r" width="300"/>

Besipielausgabe für strukturierte Ausgabe mit GPT-4o:
```
{
    "transcription": {
        "text": "inde per singulos numerum decoquunt annis in sequentibus.\ Et postremo cum ad unum pervenerint, materna fecunditas\ reciditur, sterilescunt in eternum. Leo cibum fastidit hester\num, et ipsas sue esce reliquias aversatur. Que autem ei se cire fere\ audeat; cuius voci tantus naturaliter inest terror ut multa animan\tium que per celeritatem possunt evadere eius impetum, rugitus eius\ sonitum velud quadam vi attonita atque victa deficiant?\ Leo eger simiam querit ut devoret, quo possit sanari. Leo \ gallum et maxime album veretur. Leo quidem rex ferarum,\ exiguo scorpionis aculeo exagitatur, et veneno serpentis\ occiditur. Leontophones vocari accipimus modicas bestias.\ Que capte exuruntur ut earum cineres [A: cineris] aspergine carnes pol\lute iacteque carnes pita [A:per compita] concurrentium semitarum leones ne\cent, si quantulumcumque ex illis sumpserint. Propterea leones\ naturali eas primunt odio atque ubi facultas data est morsu\ quidem abstinent, sed dilaniatas exanimant pedum nisibus.\ Tigris vocata propter volucrem fugam ita eum nominant\ perse greci et medi sagittam. Est enim bestia variis\"
    },
    "translation": {
        "text": "In the years which follow, they reduce the number by one at a time. Afterwards, when they are down to one cub, the fertility of the mother is diminished; they become sterile for ever. The lion disdains to eat the Previous day's meat and turns away from the remains of its own meal. Which beast dares to rouse the lion, whose voice, by its nature, inspires such terror, that many living things which could evade its attack by their speed, grow faint at the sound of its roar as if dazed and overcome by force. A sick lion seeks out an ape to devour it, in order to be cured. The lion fears the cock, especially the white one. King of the beasts, it is tormented by the tiny sting of the scorpion and is killed by the venom of the snake. We learn of small beasts called leontophones, lion-killers. When captured, they are burnt; meat contaminated by a sprinkling of their ashes and thrown down at crossroads kills lions, even if they eat only a small an amount. For this reason, lions pursue leontophones with an instinctive hatred and, when they have the opportunity, they refrain from biting them but kill them by rending them to pieces under their paws. The tiger is named for its swiftness in flight; the Persians and Greeks call it 'arrow'. It is a beast distinguished by its varied"
    },
    "illustration": {
        "description": "The horseman has stolen a cub and has been pursued by the tiger. The thief can stop the tiger by a trick: he throws down a glass sphere and the tiger, seeing its own reflection, stops to nurse the sphere like a cub. She ends by losing both her revenge and her child."
    }
}
```

### Testseite 8v:
<img src="../data/folio8v_ground_truth/folio8v.png" alt="Folio 8v" width="300"/>

# Loading Dependencies and Defining Pydantic Model

In [2]:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
import base64


load_dotenv() 
# im notebooks Ordner muss eine .env Datei mit OPENAI_API_KEY sein
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

class Transcription(BaseModel):
    text: str = Field(..., description="The transcribed text in the original language")

class Translation(BaseModel):
    text: str = Field(..., description="The translated text in English")

class Illustration(BaseModel):
    description: str = Field(..., description="Description of the illustration on the folio.")


class FolioResponse(BaseModel):
    transcription: Transcription
    translation: Translation
    illustration: Illustration


# Transcription

## Transcription with GPT-4o only

### API Call

In [3]:
def get_folio_info(prompt_image: str) -> FolioResponse:
    # image path for folio 8r
    with open('../data/folio8r_ground_truth/folio8r.png', "rb") as image_file_1:
        image_data_1 = image_file_1.read()
    # Ground Truth for 8r
    with open('../data/folio8r_ground_truth/folio8r_ground_truth.txt', 'r') as file:
        example_output = file.read()
    # Prompt Image 
    with open(prompt_image, "rb") as image_file_2:
        image_data_2 = image_file_2.read()
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06", 
        messages=[
            {"role": "system", "content": "You are an expert in medieval manuscripts. Provide information about folio 8v. The following is an example for the folio 8r and the corresponding image."},
            {"role": "user", "content": [
                {"type": "text", "text": example_output},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64.b64encode(image_data_1).decode()}"}
                },
                {"type": "text", "text": "This is the image of folio 8v:"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64.b64encode(image_data_2).decode()}"}
                }
            ]}
        ],
        response_format=FolioResponse
    )
    
    return completion.choices[0].message.parsed



result = get_folio_info('../data/folio8v_ground_truth/folio8v.png')
result

FolioResponse(transcription=Transcription(text='dissimula maculis. uni uero crudelitate manus uult. eius \nomine flumen cygnus appellant. qui et rapidissim us siue \nomnium fluminum. has magnit hirmana gignit. Tigri \nu uacuum rapte solus reperit cubile. ubi rapo \nr vt se ignis institar. Ar uile quod uccowell fuga ca. un \ndem si nuloc orat fere se spueran nec euaden chuilum sub \npere sibi posse subditsime ueram in huiusmodi fraude mo \ntra ur. Vbi se congnitu .midera .perendi storon pierc. \nAc illa ymagine sic luditur: et sololem piratic. Leo vat \nimpecum colliget seram dehiderunt. Rursus man spre ptem. \nreterna. tosti sead comprehendi umn equum turinche \nfundit eric aucindu esthumulo udo et fuga impericums. \nlium niu spere obtencu sequenteru eraurat. nec en fodus \ntaem maurm neumova fraudes gedulct. Osatlan uer stat. \nymaginem. ac ci lacarauta fci em modoec. Sicqc penprerer \nsue rubido decapt. es tumuda comet et exium. De \nuterno leene nasciatur et pardi. et terumronignem 

### Saving the Result

In [3]:
import json
import os

result_json = json.dumps(result.model_dump(), ensure_ascii=False, indent=2)


os.makedirs('../output', exist_ok=True)

with open('../output/result_with_gpt4o.json', 'w', encoding='utf-8') as json_file:
    json.dump(result.model_dump(), json_file, ensure_ascii=False, indent=2)

print(result_json)


{
  "transcription": {
    "text": "distincra maculis. uterque crucelocitate immatuit. qui ex\r\nnomine flumen cygnus appellatur quia rapidissimus sit\r\nomnium fluminum. has magnit humane gignit. Tigris \r\nut vacuum rapte solius repercit cubile: suo rapo \r\nrus reliquis insistit. Ac ulver qui quo vacua fugac, un\r\ndem ordine locaturi se se spuent: nec evadere ullum sub\r\nflit pree sibi posse subjectum: uternam huh modum fraude in can\r\ntenur. Ubi se conspirum invaderet; feamine diverso peric.\r\nAc illum imaginem suis ludicatur: et sololem punit. Leo vact\r\nimpetum collegit feram decludant. Rursus nam sperate \r\nneema dotes eam comprehendimus equorum vitum suspra\r\nfundit er cracundem stimulo ulodor fugac summu innit\r\nkam inc sper obdutes sequentem retarai. nec vi sedule\r\ntantum maneum ova fraude venduc. Casitan quot latet ease volen\r\nimaginem. ac ci acaturatur ferem residoc. sioc ripariter\r\nrue subido despat’ er tundem se cancrum. \r\nDO\r\npardo.\r\nUterus e\t\r\ngen

### Comparison with Ground Truth

In [4]:
from difflib import SequenceMatcher
import pandas as pd
import json
import Levenshtein

with open('../data/folio8v_ground_truth/folio8v_ground_truth.json', 'r') as json_file:
    ground_truth = json.load(json_file)

ground_truth = ground_truth["transcription"]["text"]

def similarity_ratio(a, b):
    return SequenceMatcher(None, a, b).ratio()

def normalized_levenshtein(a, b):
    levenshtein_distance = Levenshtein.distance(a, b)
    max_len = max(len(a), len(b))
    return 1 - (levenshtein_distance / max_len) if max_len > 0 else 1

preprocessed_result = ' '.join(result.transcription.text.split())
preprocessed_ground_truth = ' '.join(ground_truth.split())

similarity = similarity_ratio(preprocessed_result, preprocessed_ground_truth)
levenshtein_distance = Levenshtein.distance(preprocessed_result, preprocessed_ground_truth)
normalized_levenshtein_similarity = normalized_levenshtein(preprocessed_result, preprocessed_ground_truth)

df = pd.DataFrame({
    "model": ["gpt-4o"], 
    "sequence_similarity": [similarity],
    "normalized_levenshtein_similarity": [normalized_levenshtein_similarity]
})

print(f"Similarity ratio between the result and ground truth: {similarity:.4f}")
print(f"Normalized Levenshtein similarity between the result and ground truth: {normalized_levenshtein_similarity:.4f}")
df

Similarity ratio between the result and ground truth: 0.0487
Normalized Levenshtein similarity between the result and ground truth: 0.5970


Unnamed: 0,model,sequence_similarity,normalized_levenshtein_similarity
0,gpt-4o,0.04866,0.596991


## Transcription with Transkribus only

### Loading Transkribus Text

In [5]:
with open('../output/transkribus_text.txt', 'r') as file:
    transkribus_text = file.read()

transkribus_text


'clistintamaculis .uirtute eruelocitate mirabilis excui\nnomine flumentẏgris appellatur qal is rapiclissimꝰ sit\nommimlluiorum has magis huramagignit. Tigs\nuubiuacuum rapte sobolis repperit cubile:ilico rapto\nrisuestigns insistit.At ille qmuis equo uectus fugaci ui\nclens inuelocitate fere se puerci. nec euaclencli ullum sub\npete sibi posse subsichunr ternam hummocli fraucle mo\nlitur Abi se contiguum uiclerit:speram cle uitropicit.\nEt illa ẏmagine suiluclitur et sobolem putat.Neuosat\nimpetum collige fecum clesiclerans.Nursus manispecie\nrecenta. totis se aclcomphenclenclum equitemuiribz\nfunclit. et iracunclie stimulo uelocit fugientiimminet.\nfrum ille spere obieccu sequentemrecamat.nec tn secluli\ntatem matrismemoria frauclis exrcluolit. qassamuersat\nymaginem.et qilaccatura fecum resiclec. Sicq: pietatis\nsue stuclio clecepta:er uinclictam amittit etplem.De\nAAR ASatatsachv\nparclo\narclus e\ngenuari\num ac uelo\ncissimu er p\ncepsaclsang\nnem saltu\nemimaclmor\ntem rint • lo=

### Comparison with Ground Truth

In [6]:
from difflib import SequenceMatcher
import pandas as pd
import json
import Levenshtein

with open('../data/folio8v_ground_truth/folio8v_ground_truth.json', 'r') as json_file:
    ground_truth = json.load(json_file)

ground_truth = ground_truth["transcription"]["text"]

def similarity_ratio(a, b):
    return SequenceMatcher(None, a, b).ratio()

def normalized_levenshtein(a, b):
    levenshtein_distance = Levenshtein.distance(a, b)
    max_len = max(len(a), len(b))
    return 1 - (levenshtein_distance / max_len) if max_len > 0 else 1

preprocessed_result = ' '.join(transkribus_text.split())
preprocessed_ground_truth = ' '.join(ground_truth.split())

similarity = similarity_ratio(preprocessed_result, preprocessed_ground_truth)
levenshtein_distance = Levenshtein.distance(preprocessed_result, preprocessed_ground_truth)
normalized_levenshtein_similarity = normalized_levenshtein(preprocessed_result, preprocessed_ground_truth)

# nur falls du den Code für GPT-4o nicht ausgeführt hast
# df = pd.DataFrame({
#     "model": ["gpt-4o"], 
#     "sequence_similarity": [similarity],
#     "normalized_levenshtein_similarity": [normalized_levenshtein_similarity]
# })

transkribus_row = {
    "model": "transkribus", 
    "sequence_similarity": similarity,
    "normalized_levenshtein_similarity": normalized_levenshtein_similarity
}

df.loc[len(df)] = transkribus_row

print(f"Similarity ratio between the result and ground truth: {similarity:.4f}")
print(f"Normalized Levenshtein similarity between the result and ground truth: {normalized_levenshtein_similarity:.4f}")
df

Similarity ratio between the result and ground truth: 0.0635
Normalized Levenshtein similarity between the result and ground truth: 0.7181


Unnamed: 0,model,sequence_similarity,normalized_levenshtein_similarity
0,gpt-4o,0.04866,0.596991
1,transkribus,0.063453,0.718131


## Transcription with GPT-4o and Transkribus

### API Call

In [15]:
def get_folio_info(prompt_image: str) -> FolioResponse:
    # image path for folio 8r
    with open('../data/folio8r_ground_truth/folio8r.png', "rb") as image_file_1:
        image_data_1 = image_file_1.read()
    # Ground Truth for 8r
    with open('../data/folio8r_ground_truth/folio8r_ground_truth.txt', 'r') as file:
        example_output = file.read()
    # Prompt Image 
    with open(prompt_image, "rb") as image_file_2:
        image_data_2 = image_file_2.read()
    # Transkribus Text
    with open('../output/transkribus_text.txt', 'r') as file:
        transkribus_text = file.read()
    
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06", 
        messages=[
            {"role": "system", "content": "You are an expert in medieval manuscripts. Provide information about folio 8v. The following is an example for the folio 8r and the corresponding image."},
            {"role": "user", "content": [
                {"type": "text", "text": example_output},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64.b64encode(image_data_1).decode()}"}
                },
                {"type": "text", "text": f"""The following is the image of folio 8v. This is the attempt from Transkribus to transcribe the text from the image. Note that it can be incorrect, but still be helpful for the transcription.
                 ### Transkribus Start ###
                 {transkribus_text}
                 ### Transkribus End ###
                 Think step by step when you are doing the transcription. You must not hallucinate.

"""},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64.b64encode(image_data_2).decode()}"}
                }
            ]}
        ],
        response_format=FolioResponse,
    )
    
    return completion.choices[0].message.parsed



result = get_folio_info('../data/folio8v_ground_truth/folio8v.png')
result

FolioResponse(transcription=Transcription(text='distincta maculis. virtute et velocitate mirabilis. ex cui\nnomine flumen Tigris appellatur quia rapidissimus sit\nomnium fluminum. has magis humana gignit. Tigris\nubi vacuum rapte sobolis repperit cubile: ilico rapto\nris vestigiis insistit. At ille quamvis equo vectus fugaci, vi\ndens in velocitate fere se superari, nec evadendi ullum sub\npetere sibi posse subsidium: ternam huiusmodi fraude mo\nlitur. Ubi se contiguum viderit: speram de vitro iacit.\nEt illa imagine sui luditur et sobolem putat. Revocat\nimpetum colligit secum desiderans. Rursus nam specie\nrecenta, totis se ad comprehendendum equitem viribus\nfundit, et iracundie stimulo velociter fugienti imminet.\nFrum ille spere obiectu sequentem retardat, nec tamen seclu\ntatem matris memoria fraudis excludit. Cassata versat\nymaginem, et quia captura fecum residet. Sicque pietatis\nsue studio decepta: et vindictam amittit et pignus. De\npardo. Pardus est\ngenuitate varius et\nve

### Saving the Result

In [16]:
import json
import os

result_json = json.dumps(result.model_dump(), ensure_ascii=False, indent=2)

os.makedirs('../output', exist_ok=True)

with open('../output/result_with_gpt4o_and_transkribus.json', 'w', encoding='utf-8') as json_file:
    json.dump(result.model_dump(), json_file, ensure_ascii=False, indent=2)

print(result_json)


{
  "transcription": {
    "text": "distincta maculis. virtute et velocitate mirabilis. ex cui\nnomine flumen Tigris appellatur quia rapidissimus sit\nomnium fluminum. has magis humana gignit. Tigris\nubi vacuum rapte sobolis repperit cubile: ilico rapto\nris vestigiis insistit. At ille quamvis equo vectus fugaci, vi\ndens in velocitate fere se superari, nec evadendi ullum sub\npetere sibi posse subsidium: ternam huiusmodi fraude mo\nlitur. Ubi se contiguum viderit: speram de vitro iacit.\nEt illa imagine sui luditur et sobolem putat. Revocat\nimpetum colligit secum desiderans. Rursus nam specie\nrecenta, totis se ad comprehendendum equitem viribus\nfundit, et iracundie stimulo velociter fugienti imminet.\nFrum ille spere obiectu sequentem retardat, nec tamen seclu\ntatem matris memoria fraudis excludit. Cassata versat\nymaginem, et quia captura fecum residet. Sicque pietatis\nsue studio decepta: et vindictam amittit et pignus. De\npardo. Pardus est\ngenuitate varius et\nvelocissimus. 

### Comparison with Ground Truth

In [22]:
from difflib import SequenceMatcher
import pandas as pd
import json
import Levenshtein

with open('../data/folio8v_ground_truth/folio8v_ground_truth.json', 'r') as json_file:
    ground_truth = json.load(json_file)

ground_truth = ground_truth["transcription"]["text"]

def similarity_ratio(a, b):
    return SequenceMatcher(None, a, b).ratio()

def normalized_levenshtein(a, b):
    levenshtein_distance = Levenshtein.distance(a, b)
    max_len = max(len(a), len(b))
    return 1 - (levenshtein_distance / max_len) if max_len > 0 else 1

preprocessed_result = ' '.join(result.transcription.text.split())
preprocessed_ground_truth = ' '.join(ground_truth.split())

similarity = similarity_ratio(preprocessed_result, preprocessed_ground_truth)
levenshtein_distance = Levenshtein.distance(preprocessed_result, preprocessed_ground_truth)
normalized_levenshtein_similarity = normalized_levenshtein(preprocessed_result, preprocessed_ground_truth)

# nur falls du den Code für GPT-4o nicht ausgeführt hast
# df = pd.DataFrame({
#     "model": ["gpt-4o"], 
#     "sequence_similarity": [similarity],
#     "normalized_levenshtein_similarity": [normalized_levenshtein_similarity]
# })

transkribus_row = {
    "model": "gpt-4o_with_transkribus", 
    "sequence_similarity": similarity,
    "normalized_levenshtein_similarity": normalized_levenshtein_similarity
}

df.loc[len(df)] = transkribus_row

print(f"Similarity ratio between the result and ground truth: {similarity:.4f}")
print(f"Normalized Levenshtein similarity between the result and ground truth: {normalized_levenshtein_similarity:.4f}")
df

Similarity ratio between the result and ground truth: 0.3642
Normalized Levenshtein similarity between the result and ground truth: 0.8662


Unnamed: 0,model,sequence_similarity,normalized_levenshtein_similarity
0,gpt-4o,0.04866,0.596991
1,transkribus,0.063453,0.718131
3,gpt-4o_with_transkribus,0.364155,0.866192


## Collatex Comparison

In [23]:
from collatex import *


collation = Collation()


In [24]:
with open('../output/result_with_gpt4o.json', 'r') as json_file:
    result = json.load(json_file)
    gpt4o_transcription = result["transcription"]["text"]

with open('../output/result_with_gpt4o_and_transkribus.json', 'r') as json_file:
    result = json.load(json_file)
    gpt4o_with_transkribus_transcription = result["transcription"]["text"]

with open('../output/transkribus_text.txt', 'r') as file:
    transkribus_text = file.read()

In [25]:
collation.add_plain_witness("ground_truth", preprocessed_ground_truth)
collation.add_plain_witness("transcribus_result", transkribus_text)
collation.add_plain_witness("gpt-4o", gpt4o_transcription)
collation.add_plain_witness("gpt-4o_with_transkribus", gpt4o_with_transkribus_transcription)

alignment_table = collate(collation, near_match=True, segmentation=False, layout="vertical")

alignment_table


<collatex.core_classes.AlignmentTable at 0x12bc62fc0>

In [26]:
print(alignment_table)

+--------------+--------------------+----------------+-------------------------+
| ground_truth | transcribus_result |     gpt-4o     | gpt-4o_with_transkribus |
+--------------+--------------------+----------------+-------------------------+
|  distincta   |         -          |   distincra    |        distincta        |
+--------------+--------------------+----------------+-------------------------+
|   maculis    |  clistintamaculis  |    maculis     |         maculis         |
+--------------+--------------------+----------------+-------------------------+
|      ,       |         .          |       .        |            .            |
+--------------+--------------------+----------------+-------------------------+
|   virtute    |      uirtute       |    uterque     |         virtute         |
+--------------+--------------------+----------------+-------------------------+
|      et      |    eruelocitate    | crucelocitate  |            et           |
+--------------+------------

In [27]:
with open('../output/alignment_output.txt', 'w') as txtfile:
    txtfile.write(str(alignment_table))

In [28]:
csv_output = collate(collation, output="csv", near_match=True, segmentation=False)

with open('../output/alignment_output.csv', 'w', encoding='utf-8') as csvfile:
    csvfile.write(csv_output)