Kannada-English dictionary OCR

I have a PDF of an excellent Kannada-English etymological dictionary, a gem to any linguist or student studying Kannada. Unfortunately, the pdf is corrupted and therefore cannot be searched. For an Indian American trying to improve my Kannada in the year 2020, this was unacceptable! What good is an unsearchable dictionary? Surely, I couldn't be expected to scroll to the letter everytime I wanted to find a word! And thus, this project to OCR the entire dictionary was born!

An excerpt page from the dictionary (Upper left region of pg. 188)

Challenges and Approach

Approach 1: Infeasible to reconstruct the corrupted font table

At first, I tried to reconstruct the corrupted font table but quickly saw this was infeasible (if a pdf has multiple fonts, like this 1200 page pdf does, means you have to reconstruct the font table for potentially hundereds of fonts, made even more difficult by the prescence of Latin, extended Latin, and Kannada characters).

My next best bet was OCR'ing the dictionary. And while this may sound like a trivial and easy task when the document only has one script on it, mixed language text recognition is a whole other beast altogether! As an example, let's look at a single dictionary entry:

Here we see there is adjacent Kannada (U+0C80-U+0CFF), Kannada Transliteration (ISO 15919), IPA (International Phonetic Alphabet - uses many different unicode blocks), and English text (Basic Latin). That's four different adjacent scripts in one dictionary entry! This is truly a difficult problem to solve.

Approach 2: Perform OCR on the dictionary using Google Vision

I decided to use Google Vision to ocr, as it is probably the most state of the art mixed language text recognition at this point, and works even for languages often neglected in OCR (such as Kannada).

Hyperrecognition of Latin Characters

When I tried to OCR the above dictionary entry (ಅನಾಯಕತ್ವ), without providing any language hints, here was the output:

//without language hints
"'o SD o anāyakatva [əne:jakatve] n. [pol.)] 1\nleaderless state, utter lack of leadership 2 chaos,\nan-\narchy [Sk.].\n'"

As we can see here, it hyperrecognizes Latin text (which is to be expected). Next, I tried providing only Kannada as a language hint:

//with Kannada language hint
"'OJO AJIF3, anāyakatva [ǝne:jəkətve] n. (pol.) 1\nleaderless state, utter lack of leadership 2 chaos,\narchy [Sk.]\nan-\n'"

IPA characters messing up Kannada text recognition

Not looking too good, even with the language hint. What could be going wrong for it to fail the Kannada text recognition so miserably? So I tried recognizing a simpler text with just Kannada and English (basic latin), meaning no ipa (or transliteration).

mixed text with Kannada in one column and English in the other

I tried the above excerpt with Kannada language hint added (no language hint failed miserably), and the results are reproduced below (I just added in the columnar formatting for easier reading).

Left (Kannada)	Right (text)
ಮನೋಜ್ : ನೀವು ಬೆಳಿಗ್ಗೆ ಎಷ್ಟು	At what time do you get up in the
ಗಂಟೆಗೆ ಏಳೀರಿ ?	morning ?
ಮೋಹನ್ : ಆರು ಗಂಟೆಗೆ ಏಳೀನಿ.	I get up at six o'clock.
ಮನೋಜ್ : ಎದ್ದು ಏನು ಮಾಡ್ತೀರಿ ?	After getting up, what do you do?
ಮೋಹನ್ : ಎದ್ದು ವಾಕಿಂಗ್	I go for a walk. After the walk, I
ಹೋಗ್ತಿನಿ. ವಾಕಿಂಗ್	brush my teeth and I drink coffee.
ಹೋಗಿ ಬಂದು ಹಲ್ಲು
ಉಜ್ಜಿಕೊಳ್ಳಿನಿ. ಹಲ್ಲು
ಉಜ್ಜಿ ಕೊಂಡು ಕಾಫಿ
ಕುಡೀತೀನಿ.

So it works quite well without the IPA! It must that which is screwing up my recognition! Fortunately for me, all the IPA text in the dictionary is contained within brackets (bracket notation is used to represent narrow transcription in Phonetics). Which means that a simple regex expression can be used to remove the IPA text for each dictionary entry.

The same entry with the IPA text removed

After OCR'ing the above image (with Kannada language hint), both the Kannada text and the English definition have been recognized correctly:

"n.\n೧ಅನಾಯಕತ್ವ anayakatva\n(pol.) 1\nleaderless state, utter lack of leadership 2 chaos,\narchy [Sk.]\nan-\n"

The Kannada transliteration is not as much of a problem due to the relatively limited number of extra latin characters used in the Kannada Transliteration Standard (as opposed to IPA which has all sorts of funky characters). And if the Kannada text is read correctly, then the Transliteration can simply be generated from that.

Poor Archaic Character Recognition

But not all challenges are out of the way yet:

Because this is an etymological dictionary, it contains words from old Kannada, which includes letters that are no longer used in the modern Kannada alphabet.

Kannada literary works employed the letters ಱ (transliterated 'ṟ' or 'rh') and ೞ (transliterated 'ḻ', 'lh' or 'zh'), whose manner of articulation most plausibly could be > akin to those in present-day Malayalam and Tamil. The letters dropped out of use in the 12th and 18th centuries, respectively. Later Kannada works replaced 'rh' and 'lh' with > ರ (ra) and ಳ (la) respectively. Obsolete Kannada Letters

Since google vision presumambly is trained on data from modern Kannada, it cannot recognize ಱ or ೞ! When I tried the above excerpt, words where the second letter is ೞ, GCP failed miserably. (As we see below, it either hyperrecognizes the Kannada text as Latin Text, or simply reads the archaic character incorrectly, in this case ೞ as ಬಿ)

"wego\n?weg 3 erata\n? ಎಬಿವು' eravu\nn. [move) pulling (C. (Kitt.)) [Ka.\nD504(a)]\nn. (move] 1 pulling, dragging\n2 spasm, cramp (My. (Kitt.)) [Ka. D504(a)].\n•ಎಬಿವು' eravu\n283) vt. (move] 1 to rub;\nto rub off or out; to smear gently 2 to stroke gently\n[Ka. D505].\n"

Obviously something for google to improve on! However, Google Vision API doesn't accept custom characters. So for now, I have decided to try and use Tesseract to train a neaural net to recognize the IPA and Obsolete Kannada letters. This page will be updated as I go along.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
demo		demo
img_dir1		img_dir1
README.md		README.md
capture.ipynb		capture.ipynb
capture_mac.ipynb		capture_mac.ipynb
footnote_detection.ipynb		footnote_detection.ipynb
line_detect.ipynb		line_detect.ipynb
ocr_clean.ipynb		ocr_clean.ipynb
qt_utils.py		qt_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kannada-English dictionary OCR

Challenges and Approach

Approach 1: Infeasible to reconstruct the corrupted font table

Approach 2: Perform OCR on the dictionary using Google Vision

Hyperrecognition of Latin Characters

IPA characters messing up Kannada text recognition

Poor Archaic Character Recognition

About

Languages

p-acharya/opencv_ocr

Folders and files

Latest commit

History

Repository files navigation

Kannada-English dictionary OCR

Challenges and Approach

Approach 1: Infeasible to reconstruct the corrupted font table

Approach 2: Perform OCR on the dictionary using Google Vision

Hyperrecognition of Latin Characters

IPA characters messing up Kannada text recognition

Poor Archaic Character Recognition

About

Topics

Resources

Stars

Watchers

Forks

Languages