# Benchmarking and Improving OCR Systems for Southeast Asian Languages

## Data Collection

This section describes the process of collecting the data from Wikipedia.

In [11]:
from data_collection.article_pdf import ArticlePDF
from data_collection.download import download_article_texts, download_article_pdfs, get_articles_by_language

### Collecting Articles in English

I chose 100 Wikipedia articles as my dataset. These 100 articles are a collection of the top 20 most viewed Wikipedia articles from 5 categories ([Wikipedia:Popular pages](https://en.wikipedia.org/wiki/Wikipedia:Popular_pages)). From this list, I collected the article's URL file path in English.

In [2]:
people = ['Donald_Trump', 'Elizabeth_II', 'Barack_Obama', 'Christiano_Ronaldo', 'Michael_Jackson', 'Elon_Musk', 'Lady_Gaga', 'Adolf_Hitler', 'Eminem', 'Lionel_Messi', 'Justin_Bieber', 'Freddie_Mercury', 'Kim_Kardashian', 'Johnny_Depp', 'Steve_Jobs', 'Dwayne_Johnson', 'Michael_Jordan', 'Taylor_Swift', 'Stephen_Hawking', 'Kanye_West']
countries = ['United_States', 'India', 'United_Kingdom', 'Canada', 'Australia', 'China', 'Russia', 'Japan', 'Germany', 'France', 'Singapore', 'Israel', 'Pakistan', 'Philippines', 'Brazil', 'Italy', 'Netherlands', 'New Zealand', 'Ukraine', 'Spain']
cities = ['New_York_City', 'London', 'Hong_Kong', 'Los_Angeles', 'Dubai', 'Washington,_D.C.', 'Paris', 'Chicago', 'Angelsberg', 'Mumbai', 'San_Francisco', 'Rome', 'Monaco', 'Toronto', 'Tokyo', 'Philadelphia', 'Machu_Picchu', 'Jerusalem', 'Amsterdam', 'Boston'] # Excluding Singapore since listed for countries
life = ['Cat', 'Dog', 'Animal', 'Lion', 'Coronavirus', 'Tiger', 'Human', 'Dinosaur', 'Elephant', 'Virus', 'Horse', 'Photosynthesis', 'Evolution', 'Apple', 'Bird', 'Mammal', 'Potato', 'Polar_bear', 'Shark', 'Snake']
structures = ['Taj_Mahal', 'Burj_Khalifa', 'Statue_of_Liberty', 'Great_Wall_of_China', 'Eiffel_Tower', 'Berlin_Wall', 'Stonehenge', 'Mount_Rushmore', 'Colosseum', 'Auschwitz_concentration_camp', 'Great_Pyramid_of_Giza', 'One_World_Trade_Center', 'Empire_State_Building', 'White_House', 'Petra', 'Large_Hadron_Collider', 'Hagia_Sophia', 'Golden_Gate_Bridge', 'Panama_Canal', 'Angkor_Wat'] # Excluding Machu Picchu since listed for cities

english_titles = people + countries + cities + life + structures

english_articles = []
for english_title in english_titles:
    url = f'https://en.wikipedia.org/wiki/{english_title}'
    article = ArticlePDF(english_title, english_title, url, 'en')
    english_articles.append(article)

### Collecting Articles in SEA Languages

Using the [MediaWiki Action API](https://www.mediawiki.org/wiki/API:Main_page), I then collected the articles' URLs and names in the following languages:
- Thai
- Vietnamese
- Bahasa Indonesian


I referred to the [Wikimedia Language Codes](https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all) to identify the mapping of languages to language codes used by the API.

The API failed to fetch some English articles in other languages. The missing articles include:
- Thai:
    - Christiano_Ronaldo
    - Angelsberg
- Vietnamese:
    - Christiano_Ronaldo
    - Angelsberg
- Bahasa:
    - Christiano_Ronaldo
    - Angelsberg

In [None]:
thai_articles = get_articles_by_language(english_articles, 'th')
vietnamese_articles = get_articles_by_language(english_articles, 'vi')
bahasa_articles = get_articles_by_language(english_articles, 'id')

### Downloading Article PDFs and Text

At this point, I had collected the following meta data for each article in each language:
- Article title
- Article title in English
- Article URL
- Language of article

Using [Selenium](https://selenium-python.readthedocs.io/), I then downloaded the articles in PDF format. These PDFs serve as my dataset of images to perform OCR on.

Lastly, using [MediaWiki Action API](https://www.mediawiki.org/wiki/API:Main_page), I downloaded the text of the articles into `.txt` files. These files serve as my ground truth.

Some things to note:
- Downloading the articles take around 10-15 minutes.
- You can change the destination folder to store the downloaded files by modifying the arguments passed into `download_article_pdfs` and `download_article_texts`. Currently, it's set to `data/<language>`.

In [4]:
download_article_pdfs(english_articles, '../data/english')
download_article_pdfs(thai_articles, '../data/thai')
download_article_pdfs(vietnamese_articles, '../data/vietnamese')
download_article_pdfs(bahasa_articles, '../data/bahasa')

download_article_texts(english_articles, '../data/english')
download_article_texts(thai_articles, '../data/thai')
download_article_texts(vietnamese_articles, '../data/vietnamese')
download_article_texts(bahasa_articles, '../data/bahasa')

## Data Preprocessing

This section describes the steps done to pre-process the collected data.

### Converting PDFs to Images

As most OCR tools do not read documents in PDF format, I converted the collected PDFs into PNG images. If the PDF file has multiple pages, then multiple PNG images are stored to represent each page.

In [2]:
from helper.preprocess import convert_pdfs_to_pngs

convert_pdfs_to_pngs('../data/english')
convert_pdfs_to_pngs('../data/thai')
convert_pdfs_to_pngs('../data/vietnamese')
convert_pdfs_to_pngs('../data/bahasa')

## Benchmarking

This section describes the steps done to benchmark OCR tools across different languages using the collected data.

### Iteration v1: Preliminary Results

In this iteration, I benchmarked on the 20 shortest English articles to get preliminary results quickly.

Ranked in increasing lengths, the 20 shortest English articles are as follows:

In [1]:
shortest_articles = ['Polar_bear', 
                     'Mount_Rushmore',
                     'Potato',
                     'Burj_Khalifa',
                     'Machu_Picchu',
                     'Petra',
                     'Animal',
                     'Great_Wall_of_China',
                     'Angkor_Wat',
                     'Taj_Mahal',
                     'Colosseum',
                     'Japan',
                     'Elizabeth_II',
                     'Apple',
                     'Photosynthesis',
                     'Coronavirus',
                     'Lionel_Messi',
                     'Eiffel_Tower',
                     'Large_Hadron_Collider',
                     'Monaco']

#### EasyOCR

[EasyOCR](https://github.com/JaidedAI/EasyOCR) relies on a text detection model using the CRAFT algorithm and a recognition model using a Convolutional Recurrent Neural Network (CRNN). EasyOCR supports over 80 languages, including the ones I wish to test.

Refer to EasyOCR's [supported languages](https://www.jaided.ai/easyocr/) for the mappings of language to EasyOCR language codes. 

In [8]:
import easyocr

# This needs to run only once to load the model into memory
en_reader = easyocr.Reader(['en']) 
th_reader = easyocr.Reader(['th'])
vi_reader = easyocr.Reader(['vi'])
id_reader = easyocr.Reader(['id'])

In [None]:
from ocr.easyocr import run_easy_ocr

# run_easy_ocr(shortest_articles, '../data/thai', th_reader)
# run_easy_ocr(shortest_articles, '../data/english', en_reader)
# run_easy_ocr(shortest_articles, '../data/vietnamese', vi_reader)
# run_easy_ocr(shortest_articles, '../data/bahasa', id_reader)

#### Tesseract

[Tesseract](https://github.com/tesseract-ocr/tesseract) uses a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) architecture.

Running the `pytesseract` library requires installing Tesseract. On macOS, I recommend using Homebrew:

```bash
brew install tesseract 

# To enable support for extra languages
brew install tesseract-lang 
```

In [None]:
from ocr.tesseract import run_tesseract

run_tesseract(shortest_articles, '../data/thai', 'tha')
run_tesseract(shortest_articles, '../data/english', 'eng')
run_tesseract(shortest_articles, '../data/vietnamese', 'vie')
run_tesseract(shortest_articles, '../data/bahasa', 'ind')

#### Evaluation

To evaluate the accuracy of the OCR tools, I chose to use Character Error Rate (CER) and Word Error Rate (WER) using the [jiwer](https://jitsi.github.io/jiwer/) library. The values are rounded to 4 decimal places and exported into a CSV file.

I previously tried [torchmetrics](https://lightning.ai/docs/torchmetrics/stable/), but it took 10-20 minutes to evaluate results for 1 article. jiwer is able to evaluate results in milliseconds.


In [2]:
from evaluate import evaluate_to_csv

languages = ['english', 'thai', 'bahasa', 'vietnamese']

for language in languages:
    evaluate_to_csv(f'../data/{language}', 'text-clean.txt', 'easy-ocr-results-clean.txt', 'easy-ocr-evaluation')
    evaluate_to_csv(f'../data/{language}', 'text-clean.txt', 'tesseract-results-clean.txt', 'tesseract-evaluation')

### Iteration v2: Clean Up Results

Evaluating the results directly yielded incredibly high error rates. By examining a couple of articles, I realized there exists a lot of noise from parsing article screenshots, such as tables and embedded text in images.

Thus, some preprocessing of the results and ground truth is needed to reduce the error rate. Here are my observations and the corresponding preprocessing steps I took to reduce error rate caused by noise and formatting issues:

#### Clean Up EasyOCR Output

- Remove in-text citations
- Remove "WIKIPEDIA The Free Encyclopedia" header
- Remove references

#### Clean Up Tesseract Output

- Remove alternating space
- Remove in-text citations
- Remove "WIKIPEDIA The Free Encyclopedia" header
- Remove references

#### Clean Up Ground Truth

In [6]:
from helper.clean_text import clean_easy_ocr, clean_tesseract, clean_ground_truth

for language in languages:
    clean_easy_ocr(f'../data/{language}', language)
    clean_tesseract(f'../data/{language}', language)
    clean_ground_truth(f'../data/{language}', language)

### Iteration v2.1: Running on Clean Data



### Iteration v3: Expanding Text Corpus

In [4]:
from ocr.easyocr import run_easy_ocr_on_all

run_easy_ocr_on_all('../data/thai', th_reader)
run_easy_ocr_on_all('../data/english', en_reader)
run_easy_ocr_on_all('../data/vietnamese', vi_reader)
run_easy_ocr_on_all('../data/bahasa', id_reader)

Running at ../data/thai
Running on article: Apple
Results exist
Running on article: Elon_Musk
Results exist
Running on article: Cat
Results exist
Running on article: Polar_bear
Results exist
Running on article: Paris
Results exist
Running on article: Mumbai
Results exist
Running on article: Mount_Rushmore
Results exist
Running on article: Justin_Bieber
Results exist
Running on article: Russia
Results exist
Running on article: New_York_City
Results exist
Running on article: Panama_Canal
Results exist
Running on article: Lady_Gaga
Results exist
Running on article: Brazil
Results exist
Running on article: Berlin_Wall
Results exist
Running on article: Eminem
Results exist
Running on article: Barack_Obama
Results exist
Running on article: Jerusalem
Results exist
Running on article: Stephen_Hawking
Results exist
Running on article: Ukraine
Results exist
Running on article: Washington,_D.C.
Results exist
Running on article: France
Results exist
Running on article: Great_Pyramid_of_Giza
Result

In [None]:
from ocr.tesseract import run_tesseract_on_all

run_tesseract_on_all('../data/thai', 'tha')
run_tesseract_on_all('../data/english', 'eng')
run_tesseract_on_all('../data/vietnamese', 'vie')
run_tesseract_on_all('../data/bahasa', 'ind')

In [None]:
from helper.clean_text import clean_easy_ocr, clean_tesseract, clean_ground_truth

for language in languages:
    # clean_tesseract(f'../data/{language}', language)
    clean_easy_ocr(f'../data/{language}', language)

In [3]:
from evaluate import evaluate_to_csv

for language in languages:
    evaluate_to_csv(f'../data/{language}', 'text-clean.txt', 'easy-ocr-results-clean.txt', 'easy-ocr-evaluation')
    # evaluate_to_csv(f'../data/{language}', 'text-clean.txt', 'tesseract-results-clean.txt', 'tesseract-evaluation')

easy-ocr-evaluation.csv does not have results
tesseract-evaluation.csv does not have results
easy-ocr-evaluation.csv does not have results
tesseract-evaluation.csv does not have results
easy-ocr-evaluation.csv does not have results
tesseract-evaluation.csv does not have results
easy-ocr-evaluation.csv does not have results
tesseract-evaluation.csv does not have results


## Iteration v4: Transformer-based tools

In [None]:
from PIL import Image

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

# Load processor and model
processor = TrOCRProcessor.from_pretrained('openthaigpt/thai-trocr')
model = VisionEncoderDecoderModel.from_pretrained('openthaigpt/thai-trocr')

Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "qkv_bias": false,
  "transformers_version": "4.47.1"
}

Config of the decoder: <class 'transformers.models.electra.modeling_electra.ElectraForCausalLM'> is overwritten by shared decoder config: ElectraConfig {
  "_name_or_path": "/project/lt200324-optmul/pluem/model/huggingface_electra-small-25000-no-grad-small",
  "add_cross_attention": true,
  "architectures": [
    "ElectraModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act":

In [9]:
url = '../sample-data/thai/sample-article/page-0.png'
image = Image.open(url).convert("RGB")
print('image loaded')

pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
# generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
test = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(test)
print('hello')
print(test[0])

image loaded
['การทํางาน']
hello
การทํางาน


## Iteration v5: Difficulties

In [10]:
from ocr.easyocr import run_easy_ocr

articles = ['Polar_bear']

run_easy_ocr(articles, '../artificial_data/english', en_reader)

Running on article: Polar_bear
Finished! Time taken: 0.34470582008361816 seconds
