<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Hannah Jacobs](http://hannahlangstonjacobs.com/) for the [2021 Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

Adapted by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
____

# A Gentle Introduction to Optical Character Recognition with PyTesseract

These [notebooks](https://docs.constellate.org/key-terms/#jupyter-notebook) describe how to turn images and/or pdf documents into plain text using Tesseract [optical character recognition](https://docs.constellate.org/key-terms/#ocr).

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:**

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Libraries Used:**
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).

**Learning Objectives:**
By the end of this lessons, students will be able to
1. Define "OCR"
2. Explain the importance of OCR for computer-aided reading and analysis
3. Perform basic OCR operations using Python, Tesseract, and Jupyter Notebooks

**Research Pipeline:**

1. Convert images/pdfs to text using OCR (this process)
2. Tokenize your texts
3. Perform analysis

## What is OCR? Why is it important?

In order to do text analysis (or [natural language processing](https://docs.constellate.org/key-terms/#nlp), we need to have our text in a machine-readable format such as plaintext. In practice, this usually means converting an image file (e.g. a file ending in .png or jpg) into a plaintext file (.txt). Text is machine-readable if you are able to select, copy, and paste it's individual characters.

The difference can be illustrated by a digital image (.png) of the print edition of Dr. Faust.

![An image of the German print edition of Dr. Faust](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/faust.png)

and the [text version found on Project Gutenberg](https://www.gutenberg.org/files/2229/2229-0.txt). While a human can read the text of the digital image, a computer is not able to manipulate the individual characters of the text. The digital text cannot be easily copied and pasted for manipulation in other applications. 

![Image of the text "Blackwell's" showing the pixels.](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/blackwell-pixels.jpeg)

While we might see this as the word `Blackwell's`, the computer understands the above as a series of squares, **pixels**, containing information about which color the pixel should be--*not* which character to display.  If we want the computer to be able to work this text *as* text, we need to convert the image above into this:

`01000010 01101100 01100001 01100011 01101011 01110111 01100101 01101100 01101100 00100111 01110011`

...which the computer will then display for human readers as `Blackwell's`. We can then use our computers to search for instances of this word, analyze its freqency, patterns in occurrence, collocation, and so on. We can also ask the computer to read this and any other words in the page aloud if we need to hear them instead of viewing them on a screen.
___

## What is OCR?

OCR, or "Optical Character Recognition," is **a computational process that converts digital images of text into computer-readable text**. OCR is both a noun and a verb.

More specifically:

>**OCR software attempts to replicate the combined functions of the human eye and brain, which is why it is referred to as artificial intelligence software.** A human can quickly and easily recognise text of varying fonts and of various print qualities on a newspaper page, and will apply their language and cognitive abilities to correctly translate this text into meaningful words. Humans can recognise, translate and interpret the text on a newspaper page very rapidly, even text on an old poor quality newspaper page from the 1800s. We can quickly scan layout, sections and headings, and read the text of articles in the right order (which is much more difficult than reading the page of a book). **OCR software can now do all these things too, but not to the same level of perfection as a human can.** - ([Holley, "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs"](http://www.dlib.org/dlib/march09/holley/03holley.html)).

    
>"Optical character recognition (OCR) software is **a type of artificial intelligence software designed to mimic the functions of the human eye and brain and discern which marks within an image represent letterforms or other markers of written language.** OCR scans an image for semantically-meaningful material and transcribes what language it finds into text data." - [Cordell, "Why You (A Humanist) Should Care About Optical Character Recognition"](https://ryancordell.org/research/why-ocr/).
    

## OCR Tools

If you have [Adobe Acrobat](https://acrobat.adobe.com/us/en/) on your computer, then you have probably already been using software that contains OCR functionality. Acrobat's OCR is designed to help users [edit scanned PDFs or PDFs created by others](https://helpx.adobe.com/acrobat/using/edit-scanned-pdfs.html). It can also be used to export editable text versions (e.g. Microsoft Word documents), or to ask the computer to read aloud the text contained in the PDF. However, *at scale* and working with *older printed documents, perhaps with irregular printing patterns*, Acrobat may not give you the best results.

### Questions when considering an OCR tool

* [Proprietary or open source?](#proprietary-or-open)
* [GUI (graphical user interface) or script-based?](#gui-or-script)
* [File types supported?](#file-types)
* [Languages supported?](#languages)
* [Which printed scripts can it read?](#print-scripts)
* [Preprocessing features?](#preprocessing)
* [Accuracy and error assessment?](#accuracy)

There may be other questions you'll need to add to this list, but it will get you started. Likewise, you may wish to reorder these questions based on your project's priorities.

#### Proprietary or open source? <a id="proprietary-or-open"></a>

Proprietary, meaning do you need to purchase a license? Knowing the resources you have or need to start your OCR project is key to how you make your decision. You may wish to work with a program such as [ABBYY FineReader](https://pdf.abbyy.com/pricing/), which includes a number of graphical features for preprocessing that you'd like to use. But you'll need to be prepared to pay $200-300 for it. If you don't have those funds, you may wish to work with a free tool. 

Although *free* software is not necessarily the same as *open source* software, [open source](https://en.wikipedia.org/wiki/Open-source_model) software is free. **Open source, in the software world, refers to software whose creators have made the underlying code available for others to edit and build upon.** You may opt to choose an open source OCR tool so that you have more access to the codebase, and therefore better understanding of the computation that goes into performing OCR on your corpora.

#### GUI (graphical user interface) or script-based?<a id="gui-or-script"></a>

If you are working on a project alone with no coding experience, you may be thinking that a GUI that provides the ease of clicking a button is the best way to go--and it may be if you have a small set of documents with modern typefaces. 

On the other hand, you may wish to learn some coding if there are a significant number of documents and/or those documents contain unusual features (typefaces, language, text layouts, etc.). If so, learning how to run OCR with Python is a great opportunity. Even if you're collaborating with a programmer who will write most of your OCR code, you may want to learn some of the concepts and basic steps behind the OCR to ensure you have a good understanding of this project phase and to aid communications with your collaborator.

#### File types supported?<a id="file-types"></a>

Does the OCR tool work only with PDFs, or can it also read image files? Which file type(s) are you working with? This may seem a small point, but if you have image files, and you purchase a license for OCR software that works only with PDFs, you may be a bit surprised. There are tools out there that can help you convert images to PDFs, but you may risk degrading the scanned text with these conversions.

#### Languages supported?<a id="languages"></a>

If you are working with texts that are not in English, it's a good idea to check. At this point, most OCR tools work with multiple languages.

#### Which printed scripts can it read?<a id="print-scripts"></a>

If you're working with a language written in a script no longer commonly in use, you may need to seek out some specific tools to assist you. Even if you're working with [late-nineteenth- and early-twentieth-century American non-English newspapers](https://chroniclingamerica.loc.gov/lccn/sn93060356/1917-01-18/ed-1/seq-1/#date1=1880&index=11&date2=1917&searchType=advanced&language=&sequence=0&words=son+sonille&proxdistance=5&state=Missouri&rows=20&ortext=son&proxtext=&phrasetext=&andtext=&dateFilterType=yearRange&page=1), you may need to find out which tools handle specific scripts.

#### Preprocessing features?<a id="preprocessing"></a>

**Preprocessing is a set of steps that we can use to try to minimize issues such as a skewed page, faded text, or smudges on a page *before* performing OCR.** Some OCR tools offer some preprocessing tools. Others don't. Even if a tool can run preprocessing, though, you may find you have a specific need that must be met with another tool. 

#### Accuracy and error assessment?<a id="accuracy"></a>

Can the tool help you evaluate how well the process has gone and where there may be errors to correct? Are there tools to support both automated and manual error correction? How will you know if the OCRed corpus you've produced is of a high enough quality?

### Popular OCR Tools

*This list is not comprehensive!*

#### [ABBYY Fine Reader](https://pdf.abbyy.com/)
Perhaps at the opposite end of the OCR spectrum from Pytesseract, ABBYY is another powerful OCR tool. It has a GUI (graphical user interface) in which users can make adjustments (preprocessing), and it also has an SDK (software developer toolkit) that programmers can use to run ABBYY tools in their own programs. ABBYY even has a cloud service. Like Tesseract, ABBYY supports many languages and a number of file formats. ABBYY is, however, proprietary--you'll need to be prepared to pay a minimum of $200 if your institution does not provide a license.

#### [Adobe Acrobat](https://acrobat.adobe.com/us/en/acrobat.html)
A common PDF reader, Acrobat can do a lot of things including OCR. It comes in DC and Pro DC versions, and both are paid. DC includes OCR functionality in the "Enhance PDF" menu.

#### [Amazon Textract](https://aws.amazon.com/textract/resources/?blog-posts-cards.sort-by=item.additionalFields.createdDate&blog-posts-cards.sort-order=desc)
Like Pytesseract, this tool from Amazon runs in Python. Like ABBYY Fine Reader, it's proprietary code, which means we don't know what's happening in Textract itself when we use it--it's a black box. There is a free tier to get started if you're working with fewer than 1,000 pages, and you can run your Textract code in Amazon's cloud environment. The cost to use it, if you are planning to learn a little programming or are working with a programmer, is significantly lower than the cost of an ABBYY license.

#### [Google Cloud Vision](https://cloud.google.com/vision/docs)
A competitor of Amazon's, Google's Cloud Vision API (application programming interface) is likewise proprietary after a certain number of uses, requires programming knowledge, and can be used in the cloud. This same tool can be used to perform computer vision tasks such as facial recognition. Because we don't know what's happening in Cloud Vision's code when we use it, we might not be able to explain unexpected results--it's another [black box](01-AlgorithmsOfResistance-WhatIsAnAlgorithm.ipynb#algorithms).

#### [Tesseract](https://tesseract-ocr.github.io/tessdoc/Home.html)
An OCR engine (basically, a collection of algorithms and training data) originally developed by Hewlett Packard and maintained by Google. Tesseract is open source and supports many languages and scripts. It also offers possibilities to customize OCR outputs in ways that may or may not be possible with proprietary software. The ability to add your own training data is also a big feature, though a resource-intensive process. Programmers have taken advantage of Tesseract being open source and have created [a number of tools based on Tesseract](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty) (some with GUIs).

#### [Pytesseract](https://pypi.org/project/pytesseract/)
Pytesseract (or Python-tesseract) is a powerful OCR tool made for the programming language Python using the Tesseract OCR Engine. It can work with many file formats and (human) languages, and, like [Tesseract](https://github.com/tesseract-ocr/tesseract), is open source. Since Pytesseract is used in a larger programming ecosystem, it can be combined with a variety of other Python packages to perform many different tasks. Furthermore, Python is both highly used and a popular computer language for beginning programmers, making it possible for users to move quickly from the basics of Python into working with Pytesseract.

## Introduction to Tesseract

[Tesseract](https://github.com/tesseract-ocr/tesseract) was initially developed by Hewlett-Packward between 1985-1994. HP made it open source in 2005. [Google developed it](https://opensource.google/projects/tesseract) between 2006-2018. It is still open source and maintained Zdenko Podobny. There is an [active user forum](https://groups.google.com/g/tesseract-ocr).

Tesseract supports over [100 languages](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) and can be [run in the command line](https://github.com/tesseract-ocr/tesseract#running-tesseract) on Windows, MacOS, and Linux. Its outputs can be stored in several interoperable file formats. There are a number of [third party GUIs available](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html).

The latest versions (4x) of Tesseract incorporate [LSTM (Long Short-Term Memory)](https://en.wikipedia.org/wiki/Long_short-term_memory), an artificial Recurrent Neural Network. LSTM is a set of algorithms that computers can run to process lots of data, "remember" that data, and apply what it "learns" from that data to other data as it's processing.

Because Tesseract is free and open source, it's [possible to retrain Tesseract in order to OCR a specific corpus](https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html). This requires a large and specific dataset, some expertise, and some time. But it's a key feature that you won't get from proprietary or closed-source software.

[PyTesseract](https://pypi.org/project/pytesseract/) is a "wrapper" -- basically it makes Tesseract legible to Python so that it can be incorporated into various Python environments and functionalities. This means that if you're already working in Python, you don't need to leave your environment to build a dataset. You could also build PyTesseract into a Python application and/or into a code base that you plan to reuse. It was [developed and maintained](https://github.com/madmaze/pytesseract) beginning in 2014 by a group of programmers led by Mattias Lee.

### Input Files

In order to perform OCR on a text corpus, we need the following:

- A **single file folder** containing all of the corpus files. If the corpus is small enough (e.g. 1 book), this could be simply a single file (e.g. a .pdf).
- All corpus files should be of the **same file format**.
- The chosen file format should be **interoperable** (usable by many software and operating systems) and stable (changes rarely if ever).

- For our work with Python and Tesseract, the files should be **images**, which means that each file will correspond to 1 single-sided page (recto or verso, assuming a book format).

![First page of the 1955 North Carolina Session Laws](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sessionlaw-example.jpeg)

**To keep image files organized,** it is helpful to create a file structure where every book is within a unique folder. Each book's folder then contains a series of numbered images for each page.

![Screenshot of a file structure for image files to be OCR'ed.](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/folder-structure.jpeg)

Note that the file naming structure identifies *both* which volume the images are part of *and* which scanned page they correspond to, which helps us maintain the order of the volume. These numbers *may not* correspond to page numbers because bookscanning usually includes the outer and inner covers, title pages, and other book pages that are not usually numbered.

Note that we are working with .jpg files here. The process we'll be using, though, can also be run with .png, .tiff, .jp2, and other common interoperable image formats.

### Output Files

For each folder of files (whether .jpg, .png, or .pdf), we will create a single plaintext file (.txt) that contains the full-text.  The plain text file format is interoperable, stable, and fully computer readable, meaning it will be ready for performing computational analysis and for storing in repositories and databases.

## PyTesseract Basics

Here we will describe the basic process of OCRing using PyTesseract. The first step is to install Tesseract on your machine using the command line.

In [1]:
# Install Tesseract in the Constellate Analytics Lab.
# The exclamation runs the command as a terminal command.
# This may take 1-2 minutes.
!conda install -c conda-forge -y tesseract

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: / ^C
failed with repodata from current_repodata.json, will retry with next repodata source.

CondaError: KeyboardInterrupt



We will also install `Pillow`, a library for analyzing images, and some Tesseract training data.

In [2]:
# Install additional libraries
# Pytesseract is a Python wrapper for Tesseract
!pip install pytesseract

# Install Tesseract training data in the Constellate Analytics Lab.
# The exclamation runs the command as a terminal command.

!wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
!mv eng.traineddata /srv/conda/envs/notebook/share/tessdata/eng.traineddata

[31mERROR: Invalid requirement: 'pillow,'[0m
/bin/bash: wget: command not found
mv: rename eng.traineddata to /srv/conda/envs/notebook/share/tessdata/eng.traineddata: No such file or directory


We will convert a [sample .jpg image](./data/ocr_sample.jpeg) to text. The sample comes from the Session Laws of the State of North Carolina. The material was OCRed for the [NEH-funded](https://www.neh.gov/), [Collections as Data](https://collectionsasdata.github.io/) project [On the Books: Jim Crow and Algorithms of Resistance](https://onthebooks.lib.unc.edu/).

In [None]:
# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract

# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("./data/ocr_sample.jpeg"), lang="eng"))