# Testing OCR for the Timbuktu Chronicles
This file contains scripts that test the accuracy of various OCR outputs produced for samples of Tarikh al-Fattash and Tarikh al-Soudan. Both samples will be tested against manually-corrected versions.

The code utilizes CAMeL's pre-processing tools, like de-diacritization. More information on this can be seen from CAMeL's [Guided Tour on Google Colab](https://colab.research.google.com/drive/1Y3qCbD6Gw1KEw-lixQx1rI6WlyWnrnDS?authuser=3)


The accuracy tests are produced by Character Error Rate and Word Error Rate. More information on these error rates can be found here:
- https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorrates
- https://towardsdatascience.com/evaluating-ocr-output-quality-with-character-error-rate-cer-and-word-error-rate-wer-853175297510#5aec

4 modules for these:
- https://pypi.org/project/fastwer/
- https://pypi.org/project/jiwer/
- https://pypi.org/project/asr_evaluation/
- https://pypi.org/project/asrtoolkit/




In [None]:
from google.colab import drive
drive.mount("/content/gdrive/", force_remount=True)

Mounted at /content/gdrive/


In [None]:
# Fattash Arabic only
fattashCorrect = open("/content/gdrive/My Drive/OCR_data/fattash-arabic-only-corrected.txt", "r").read()
fattashHT = open("/content/gdrive/My Drive/OCR_data/fattash-arabic-only.txt", "r").read()
fattashES = open("/content/gdrive/My Drive/OCR_data/fattash-simorgh-arabic-only.txt", "r").read()

# Soudan Arabic Only
soudanCorrect = open("/content/gdrive/My Drive/OCR_data/tarikh-as-soudan-pgs9-26-corrected.txt", "r").read()
soudanHT = open("/content/gdrive/My Drive/OCR_data/tarikh-as-soudan-ara-pgs9-26.txt", "r").read()
soudanES = open("/content/gdrive/My Drive/OCR_data/soudan-simorgh-arabic.txt", "r").read()

In [None]:
# Pre-process CORRECT and HathiTrust via the following CAMeL functions: unicode normalizations, orthographic normalization, and de-diacritization
%pip install camel-tools
from camel_tools.utils.normalize import normalize_unicode
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar
from camel_tools.utils.dediac import dediac_ar

Collecting camel-tools
[?25l  Downloading https://files.pythonhosted.org/packages/06/23/331ce904926a8d53a527aac34bfe03fffb9fd1d4597924cbcd65432a53ef/camel_tools-1.1.0.tar.gz (56kB)
[K     |█████▉                          | 10kB 9.9MB/s eta 0:00:01[K     |███████████▋                    | 20kB 9.7MB/s eta 0:00:01[K     |█████████████████▌              | 30kB 7.2MB/s eta 0:00:01[K     |███████████████████████▎        | 40kB 7.5MB/s eta 0:00:01[K     |█████████████████████████████▏  | 51kB 4.1MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.9MB/s 
Collecting transformers==3.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 8.6MB/s 
Collecting camel-kenlm
[?25l  Downloading https://files.pythonhosted.org/packages/e0/4e/147d258c7168b8f538e6aa7c4dc602b2bb696452502af8af35876a28de78/camel-kenl

In [None]:
# function to pre-process Arabic text
# not sure if everything in this function from CAMeL Tools is necessary
def preprocess(text):
  # turns stuff like "ﷺ" into the full text
  text = normalize_unicode(text)

  # Normalize alef variants to 'ا' 
  text = normalize_alef_ar(text)

  # Normalize alef maksura 'ى' to yeh 'ي'
  text = normalize_alef_maksura_ar(text)

  # Normalize teh marbuta 'ة' to heh 'ه'
  text = normalize_teh_marbuta_ar(text)

  # Remove diacritization (tashkeel / harakat)
  text = dediac_ar(text)

  return text



In [None]:
# fattash arabic only:
fattashCorrect = preprocess(fattashCorrect)
fattashHT = preprocess(fattashHT)
fattashES = preprocess(fattashES)

# soudan arabic only:
soudanCorrect = preprocess(soudanCorrect)
soudanHT = preprocess(soudanHT)
soudanES = preprocess(soudanES)


In [None]:
soudanCorrect

'<pb n="9"/>"\n(1) بسم الله الرحمان الرحيم وصلي الله علي\nسيدنا محمد نبيه واله وصحبه وسلم\n\\\nالحمد لله المنفرد بالملك والبقاء والقدره والثناء المحيط بعلمه بجميع الاشياء\nيعلم ما كان وما يكون وان لو كان كيف يكون لا يعزب عنه مثقال ذره في الارض\nولا في السماء يؤتي الملك من يشاء وينزع الملك ممن يشاء سبحانه من ملك\nقادر وعزيز قاهر الذي قهر1 عباده بالموت والفناء وهو الاول بلا ابتداء والاخر\nبلا انتهاء والصلاه والسلام علي سيد الاولين والاخرين سيدنا ومولانا محمد\nخاتم الرسل والانبياء وعلي اله واصحابه الطيبين الطاهرين من اهل الصفوه\nوالاعتناء صلي الله عليه وعليم اجمعين وسلم 2 صلاه وسلاما بلا انقطاع3 ولا\nانقضاء و بعد\nفقد ادركنا اسلافنا المتقدمين 4 اكثر ما يتوانسون به في مجالسهم ذكر\nالصحابه والصالحين رضي الله عنهم ورحمهم ثم ذكر 5 اشياخ بلادهم وملوكها\nوسيرهم وقصصهم وانبائهم 6 وايامهم ووفياتهم وهو احلي ما يرون واشهي\nما يتذاكرون حتي انقرض ذلك الجيل ومضي رحمه الله تعالي عليهم واما الجيل\n<pb n="10"/>"\n ٢ \nالثاني ما كان فيهم من له الاعتناء بذلك ولا من يقتدي بطريق السلف الماضين\nولا من له همه 

### FastWER
Following code from https://github.com/kennethleungty/OCR-Metrics-CER-WER/blob/main/OCR_Metrics_CER_WER_Colab.ipynb

In [None]:
!pip install pybind11
!pip install fastwer

Collecting pybind11
[?25l  Downloading https://files.pythonhosted.org/packages/00/30/57fe5b30b484277a5db2d0098465e2665b171162dba7afa87a7f326647c9/pybind11-2.7.0-py2.py3-none-any.whl (199kB)
[K     |█▋                              | 10kB 13.0MB/s eta 0:00:01[K     |███▎                            | 20kB 8.3MB/s eta 0:00:01[K     |█████                           | 30kB 7.8MB/s eta 0:00:01[K     |██████▋                         | 40kB 7.4MB/s eta 0:00:01[K     |████████▏                       | 51kB 4.2MB/s eta 0:00:01[K     |█████████▉                      | 61kB 4.5MB/s eta 0:00:01[K     |███████████▌                    | 71kB 4.8MB/s eta 0:00:01[K     |█████████████▏                  | 81kB 5.1MB/s eta 0:00:01[K     |██████████████▉                 | 92kB 5.1MB/s eta 0:00:01[K     |████████████████▍               | 102kB 4.3MB/s eta 0:00:01[K     |██████████████████              | 112kB 4.3MB/s eta 0:00:01[K     |███████████████████▊            | 122kB 4.3MB/s 

In [None]:
import fastwer

In [None]:
# Obtain Sentence-Level Character Error Rate (CER)
# fastwer.score_sent(output, ref, char_level=True)
# Obtain Sentence-Level Word Error Rate (WER)
# fastwer.score_sent(output, ref)

# WORD ERROR RATE

print("Fattash Sample OCR WER:")
print("HathiTrust: " + str(fastwer.score_sent(fattashHT, fattashCorrect)))
print("eScriptorium: " + str(fastwer.score_sent(fattashES, fattashCorrect)))

print()

print("Soudan Sample OCR WER:")
print("HathiTrust: " + str(fastwer.score_sent(soudanHT, soudanCorrect)))
print("eScriptorium: " + str(fastwer.score_sent(soudanES, soudanCorrect)))

Fattash Sample OCR WER:
HathiTrust: 27.5211
eScriptorium: 26.4233

Soudan Sample OCR WER:
HathiTrust: 18.277
eScriptorium: 22.2555


In [None]:
print("Fattash Sample OCR CER:")
print("HathiTrust: " + str(fastwer.score_sent(fattashHT, fattashCorrect, char_level=True)))
print("eScriptorium: " + str(fastwer.score_sent(fattashES, fattashCorrect, char_level=True)))

print()

print("Soudan Sample OCR CER:")
print("HathiTrust: " + str(fastwer.score_sent(soudanHT, soudanCorrect, char_level=True)))
print("eScriptorium: " + str(fastwer.score_sent(soudanES, soudanCorrect, char_level=True)))

Fattash Sample OCR CER:
HathiTrust: 6.3078
eScriptorium: 5.5518

Soudan Sample OCR CER:
HathiTrust: 5.1128
eScriptorium: 5.5535


### JIWER

Does not have CER

In [None]:
# fastwer did not install.
%pip install jiwer

Collecting jiwer
  Downloading https://files.pythonhosted.org/packages/8c/cc/fb9d3132cba1f6d393b7d5a9398d9d4c8fc033bc54668cf87e9b197a6d7a/jiwer-2.2.0-py3-none-any.whl
Collecting python-Levenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/2a/dc/97f2b63ef0fa1fd78dcb7195aca577804f6b2b51e712516cc0e902a9a201/python-Levenshtein-0.12.2.tar.gz (50kB)
[K     |████████████████████████████████| 51kB 5.0MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149828 sha256=d7fae63022c3ab0f9299665f262f4256b2ab839358e3357627753ae84001a0ab
  Stored in directory: /root/.cache/pip/wheels/b3/26/73/4b48503bac73f01cf18e52cd250947049a7f339e940c5df8fc
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein, jiwer
Successfully installed jiwer-2.2.0 python-Levenshtein-0.12.

In [None]:
from jiwer import wer

In [None]:
# wer(ground_truth, hypothesis)
print("Fattash Sample OCR WER:")
error1 = wer(fattashCorrect, fattashHT)
print("HathiTrust: " + str(error1))

error2 = wer(fattashCorrect, fattashES)
print("eScriptorium: " + str(error2))

print()
print("Soudan Sample OCR WER:")
error3 = wer(soudanCorrect, soudanHT)
print("HathiTrust: " + str(error3))
error4 = wer(soudanCorrect, soudanES)
print("eScriptorium: " + str(error4))



Fattash Sample OCR WER:
HathiTrust: 0.27388535031847133
eScriptorium: 0.25961783439490443

Soudan Sample OCR WER:
HathiTrust: 0.18258006584854833
eScriptorium: 0.2167015863513918
