# Optical Character Recognition (OCR) using ImageMagic and Tesseract

Creating usable digital text is the precondition to text analysis. Much of computational text analysis relies on Optical Character Recognition — on the possibility of converting scanned images to machine-encoded text. Yet OCR quality remains an issue and fundamentally impact the possibilities of analysis. 

> “OCR techniques still favor a very particular type of Roman-based font, which omits non-Western print traditions like Chinese woodblocks, non-print traditions like medieval manuscripts, or even regionally eclectic print traditions like German Fraktur.” (Piper, Andrew, et al. "The Page Image: Towards a Visual History of Digital Documents." Book History, vol. 23, 2020, p. 365-397., p. 366)

> “OCR was largely developed to process typewritten, English-language, mid-twentieth-century business documents. With that kind of input, OCR is remarkably reliable, transcribing with accuracy in the upper 90 percents. Turn an OCR engine toward historical documents, however, with distinct typography, complex layouts, torn pages, smeared ink, and any number of features those OCR engines were not trained to discern, then the reliability of OCR transcription declines precipitously.” (Cordell, Ryan. ["Why You (A Humanist) Should Care About Optical Character Recognition"](https://ryancordell.org/research/why-ocr/))


### Is my PDF an image or a text?  

It may be the case that your PDFs already contain a layer of text, in which case you don’t need to OCR them, just convert them directly into txt. You can check using Xpdf which will output .txt files from PDFs — if there is no text layer the outputted .txt will be blank. 

Note that Xpdf is not available on the JupyterHub environment. You’d have to install [Xpdf](https://www.xpdfreader.com/download.html) onto your own devices to use it.

In [None]:
!pdftotext PATH/TO/FILENAME.pdf PATH/TO/OUTPUTFILENAME.txt
!open PATH/TO/FILENAME.txt

In [None]:
!pdftotext ocr-corpus/1998_Location.pdf ocr-corpus/1998_Location.txt
!open ocr-corpus/1998_Location.txt

## Image Preprocessing: Improve image quality with ImageMagic

Tesseract requires high quality TIFF (Tagged Image File Format) images to work well. Some images will be easier to OCR than others. A good quality image will improve the accuracy of the OCR. Some images might have a lot of “noise” — distracting variations in brightness, differences in fonts and sizes, errant markings, speckled pages, skewed pages, damage to the document. Things you can do to improve image quality include: crop the picture to remove excess border space; straight the image (deskew); remove noise.

We can do some preprocessing of the image to try and minimize noise and its impact on OCR quality.

In [None]:
#convert pdf to tiff and improve image quality
!convert -density 300 PATH/TO/INPUT_FILENAME.pdf -depth 8 -strip -background white -alpha off OUTPUT_FILENAME.tiff

In [None]:
#convert pdf to tiff and improve image quality
!convert -density 300 ocr-corpus/1998_Location.pdf -depth 8 -strip -background white -alpha off ocr-corpus/1998_Location.tiff

N.B. The JupyterHub environment seems to have some security patches that prevent ImageMagick from converting PDFs. You may need to install [ImageMagick](https://imagemagick.org/script/download.php) on your own devices and run these commands in the command line if it's not working.

`density *width*`
controls image resolution

`depth *value*`
controls depth of the image

`strip`
strips the document of any comments or any extraneous information

`background *colour*`
sets the background color

`alpha *type*`
controls the transparency of  a colour–if it is off it means that the source color will not be visible

The `density` and `depth` commands both make sure the file has the appropriate dots per inch (DPI) for OCR. The `strip`, `background`, and `alpha` commands make sure that the file has the right background.

## OCR with Tesseract and Pytesseract

Tesseract supports over 110 languages including non-western languages and writing systems. It is a free and open-source software maintained by Google. It can be a good alternative to commercial software, such as ABBYY FineReader.  

Note that Tesseract is not supported in the JupyterHub environment. You’d have to install [Tesseract and its language packages](https://tesseract-ocr.github.io/tessdoc/Installation.html) on your own devices.

Now that you have converted your files into TIFF and preprocessed your images you can use Tesseract to recognize and extract the text from the image. 

At its simplest, OCRing with tesseract follows a simple syntax:
`tesseract imagename outputbase`

In [None]:
!tesseract ocr-corpus/1998_Location.tiff ocr-corpus/1998_Location -l eng txt

A number of options can be added:
- `-l *LANG*`

Add `-l LANG` to the command where LANG is the three character language code from the list of supported languages.  

English is used as default language. 

Languages in Tesseract:
afr (Afrikaans), amh (Amharic), ara (Arabic), asm (Assamese), aze (Azerbaijani), aze_cyrl (Azerbaijani - Cyrilic), bel (Belarusian), ben (Bengali), bod (Tibetan), bos (Bosnian), bre (Breton), bul (Bulgarian), cat (Catalan; Valencian), ceb (Cebuano), ces (Czech), chi_sim (Chinese simplified), chi_tra (Chinese traditional), chr (Cherokee), cos (Corsican), cym (Welsh), dan (Danish), deu (German), div (Dhivehi), dzo (Dzongkha), ell (Greek, Modern, 1453-), eng (English), enm (English, Middle, 1100-1500), epo (Esperanto), equ (Math / equation detection module), est (Estonian), eus (Basque), fas (Persian), fao (Faroese), fil (Filipino), fin (Finnish), fra (French), frk (Frankish), frm (French, Middle, ca.1400-1600), fry (West Frisian), gla (Scottish Gaelic), gle (Irish), glg (Galician), grc (Greek, Ancient, to 1453), guj (Gujarati), hat (Haitian; Haitian Creole), heb (Hebrew), hin (Hindi), hrv (Croatian), hun (Hungarian), hye (Armenian), iku (Inuktitut), ind (Indonesian), isl (Icelandic), ita (Italian), ita_old (Italian - Old), jav (Javanese), jpn (Japanese), kan (Kannada), kat (Georgian), kat_old (Georgian - Old), kaz (Kazakh), khm (Central Khmer), kir (Kirghiz; Kyrgyz), kmr (Kurdish Kurmanji), kor (Korean), kor_vert (Korean vertical), lao (Lao), lat (Latin), lav (Latvian), lit (Lithuanian), ltz (Luxembourgish), mal (Malayalam), mar (Marathi), mkd (Macedonian), mlt (Maltese), mon (Mongolian), mri (Maori), msa (Malay), mya (Burmese), nep (Nepali), nld (Dutch; Flemish), nor (Norwegian), oci (Occitan post 1500), ori (Oriya), osd (Orientation and script detection module), pan (Panjabi; Punjabi), pol (Polish), por (Portuguese), pus (Pushto; Pashto), que (Quechua), ron (Romanian; Moldavian; Moldovan), rus (Russian), san (Sanskrit), sin (Sinhala; Sinhalese), slk (Slovak), slv (Slovenian), snd (Sindhi), spa (Spanish; Castilian), spa_old (Spanish; Castilian - Old), sqi (Albanian), srp (Serbian), srp_latn (Serbian - Latin), sun (Sundanese), swa (Swahili), swe (Swedish), syr (Syriac), tam (Tamil), tat (Tatar), tel (Telugu), tgk (Tajik), tha (Thai), tir (Tigrinya), ton (Tonga), tur (Turkish), uig (Uighur; Uyghur), ukr (Ukrainian), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek - Cyrilic), vie (Vietnamese), yid (Yiddish), yor (Yoruba)

If the language you are using is not included but you have trained model for this language already you can pipe in the model to Tesseract and use your own model for OCR.

- `-l *SCRIPT*`  

Specifies the script to use. 

Scripts in Tesseract: Arabic, Armenian, Bengali, Canadian_Aboriginal, Cherokee, Cyrillic, Devanagari, Ethiopic, Fraktur, Georgian, Greek, Gujarati, Gurmukhi, HanS (Han simplified), HanS_vert (Han simplified, vertical), HanT (Han traditional), HanT_vert (Han traditional, vertical), Hangul, Hangul_vert (Hangul vertical), Hebrew, Japanese, Japanese_vert (Japanese vertical), Kannada, Khmer, Lao, Latin, Malayalam, Myanmar, Oriya (Odia), Sinhala, Syriac, Tamil, Telugu, Thaana, Thai, Tibetan, Vietnamese.

- Specify output file  

You can output into different kinds of files (html, tsv, txt, pdf). You can output into multiple different files. List the file formats you want to output in. .txt is default output format.



For more parameters you can consult the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc).

### Pytesseract

Pytesseract is the Python wrapper for Google’s Tesseract OCR engine. 

Remember to change the location of the example file (`"ocr-corpus/1998_location.tiff"`) to your own files.

In [None]:
import pytesseract
from PIL import Image

# Opens the image
image_of_text = Image.open("ocr-corpus/1998_Location.tiff")

# Converts image to string
string_from_image = pytesseract.image_to_string(image_of_text, lang="eng")

# Print first 150 chars of string
print(string_from_image[:150])

In [None]:
# Write string to a new file
with open("ocr-corpus/1988_Location.txt", 'w') as f:
    f.write(string_from_image)

In [None]:
#List language codes for available language in Pytesseract
print(pytesseract.get_languages(config=''))

### Batch processing

Loop over files in a directory and convert any .tiff files

In [None]:
import os
def ocr_tiff_file(filepath, lang="eng"):
    text_as_image = Image.open(filepath)
    text_as_string = pytesseract.image_to_string(text_as_image, lang=lang)
    new_filepath = filepath.replace(".tiff", ".txt")
    with open(new_filepath, 'w') as f:
        f.write(text_as_string)
    print(f"Converted {filepath} to {new_filepath}")


ocr_corpus = os.listdir("ocr-corpus")
for file in ocr_corpus:
    if file.endswith(".tiff"):
        ocr_tiff_file("ocr-corpus/"+file)