<img align="left" src="../All-sample-files/CC_BY.png"><br />

Created by [Hannah Jacobs](http://hannahlangstonjacobs.com/) for the [2021 Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

Adapted by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
____

# Optical Character Recognition Basics

These notebooks describe how to turn images and/or pdf documents into plain text using Tesseract optical character recognition.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](../Python-basics/python-basics-1.ipynb))

**Knowledge Recommended:**
* [Python Intermediate 2](../Python-intermediate/python-intermediate-2.ipynb)

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Libraries Used:**
* [Tesseract](https://tesseract-ocr.github.io/) for performing optical character recognition.

**Learning Objectives:**
By the end of this lessons, students will be able to
1. Define "OCR"
2. Explain the importance of OCR for computer-aided reading and analysis
3. Perform basic OCR operations using Python, Tesseract, and Jupyter Notebooks

**Research Pipeline:**

1. Digitize documents
2. **Optical Character Recognition**
3. Tokenize your texts
4. Perform analysis

## What is OCR? Why is it important?

In order to do text analysis (or natural language processing, we need to have our text in a machine-readable format such as plaintext. In practice, this usually means converting an image file (e.g. a file ending in .png or jpg) into a plaintext file (.txt). Text is machine-readable if you are able to select, copy, and paste it's individual characters.

The difference can be illustrated by a digital image (.png) of the print edition of Dr. Faust.

![An image of the German print edition of Dr. Faust](../All-sample-files/faust.png)

and the [text version found on Project Gutenberg](https://www.gutenberg.org/files/2229/2229-0.txt). While a human can read the text of the digital image, a computer is not able to manipulate the individual characters of the text. The digital text cannot be easily copied and pasted for manipulation in other applications. 

![Image of the text "Blackwell's" showing the pixels.](../All-sample-files/blackwell-pixels.jpeg)

While we might see this as the word `Blackwell's`, the computer understands the above as a series of squares, **pixels**, containing information about which color the pixel should be—*not* which character to display.  If we want the computer to be able to work this text *as* text, we need to convert the image above into this:

`01000010 01101100 01100001 01100011 01101011 01110111 01100101 01101100 01101100 00100111 01110011`

...which the computer will then display for human readers as `Blackwell's`. We can then use our computers to search for instances of this word, analyze its freqency, patterns in occurrence, collocation, and so on. We can also ask the computer to read this and any other words in the page aloud if we need to hear them instead of viewing them on a screen.
___

## What is OCR?

OCR, or "Optical Character Recognition," is **a computational process that converts digital images of text into computer-readable text**. OCR is both a noun and a verb.

More specifically:

>**OCR software attempts to replicate the combined functions of the human eye and brain, which is why it is referred to as artificial intelligence software.** A human can quickly and easily recognise text of varying fonts and of various print qualities on a newspaper page, and will apply their language and cognitive abilities to correctly translate this text into meaningful words. Humans can recognise, translate and interpret the text on a newspaper page very rapidly, even text on an old poor quality newspaper page from the 1800s. We can quickly scan layout, sections and headings, and read the text of articles in the right order (which is much more difficult than reading the page of a book). **OCR software can now do all these things too, but not to the same level of perfection as a human can.** - ([Holley, "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs"](http://www.dlib.org/dlib/march09/holley/03holley.html)).

    
>"Optical character recognition (OCR) software is **a type of artificial intelligence software designed to mimic the functions of the human eye and brain and discern which marks within an image represent letterforms or other markers of written language.** OCR scans an image for semantically-meaningful material and transcribes what language it finds into text data." - [Cordell, "Why You (A Humanist) Should Care About Optical Character Recognition"](https://ryancordell.org/research/why-ocr/).
    

## OCR Tools

If you have [Adobe Acrobat](https://acrobat.adobe.com/us/en/) on your computer, then you have probably already been using software that contains OCR functionality. Acrobat's OCR is designed to help users [edit scanned PDFs or PDFs created by others](https://helpx.adobe.com/acrobat/using/edit-scanned-pdfs.html). It can also be used to export editable text versions (e.g. Microsoft Word documents), or to ask the computer to read aloud the text contained in the PDF. However, *at scale* and working with *older printed documents, perhaps with irregular printing patterns*, Acrobat may not give you the best results.

### Questions when considering an OCR tool

* [Proprietary or open source?](#proprietary-or-open)
* [GUI (graphical user interface) or script-based?](#gui-or-script)
* [File types supported?](#file-types)
* [Languages supported?](#languages)
* [Which printed scripts can it read?](#print-scripts)
* [Preprocessing features?](#preprocessing)
* [Accuracy and error assessment?](#accuracy)

There may be other questions you'll need to add to this list, but it will get you started. Likewise, you may wish to reorder these questions based on your project's priorities.

#### Proprietary or open source? <a id="proprietary-or-open"></a>

Proprietary, meaning do you need to purchase a license? Knowing the resources you have or need to start your OCR project is key to how you make your decision. You may wish to work with a program such as [ABBYY FineReader](https://pdf.abbyy.com/pricing/), which includes a number of graphical features for preprocessing that you'd like to use. But you'll need to be prepared to pay $200-300 for it. If you don't have those funds, you may wish to work with a free tool. 

Although *free* software is not necessarily the same as *open source* software, [open source](https://en.wikipedia.org/wiki/Open-source_model) software is usually free. **Open source, in the software world, refers to software whose creators have made the underlying code available for others to edit and build upon.** You may opt to choose an open source OCR tool so that you have more access to the codebase, and therefore better understanding of the computation that goes into performing OCR on your corpora.

#### GUI (graphical user interface), local script, or API?<a id="gui-or-script"></a>

If you are working on a project alone with no coding experience, you may be thinking that a GUI that provides the ease of clicking a button is the best way to go—and it may be if you have a small set of documents with modern typefaces. 

On the other hand, you may wish to learn some coding if there are a significant number of documents and/or those documents contain unusual features (typefaces, language, text layouts, etc.). If so, learning how to run OCR with Python is a great opportunity. Even if you're collaborating with a programmer who will write most of your OCR code, you may want to learn some of the concepts and basic steps behind the OCR to ensure you have a good understanding of this project phase and to aid communications with your collaborator.

Python has two kinds of solutions: running a script locally or calling an API. A local script could use a Python library like [Tesseract](https://github.com/tesseract-ocr/tesseract?tab=readme-ov-file#tesseract-ocr), [EasyOCR](https://github.com/JaidedAI/EasyOCR), [OpenCV](https://opencv.org/), etc. You can also use Python with an API from [Google (Document AI or Cloud Vision AI)](https://cloud.google.com/document-ai/docs/enterprise-document-ocr) or [Amazon (Textract)](https://aws.amazon.com/textract/). These API services are often very accurate, but they incur a small cost for processing documents that depends on factors like number of documents, document complexity, and processing time.

#### File types supported?<a id="file-types"></a>

Does the OCR tool work only with PDFs, or can it also read image files? Which file type(s) are you working with? This may seem a small point, but if you have image files, and you purchase a license for OCR software that works only with PDFs, you may be a bit surprised. There are tools out there that can help you convert images to PDFs, but you may risk degrading the scanned text with these conversions.

#### Languages supported?<a id="languages"></a>

If you are working with texts that are not in English, it's a good idea to check. At this point, most OCR tools work with multiple languages, but not all languages have robust support.

#### Which printed scripts can it read?<a id="print-scripts"></a>

If you're working with a language written in a script no longer commonly in use, you may need to seek out some specific tools to assist you. Even if you're working with [late-nineteenth- and early-twentieth-century American non-English newspapers](https://chroniclingamerica.loc.gov/lccn/sn93060356/1917-01-18/ed-1/seq-1/#date1=1880&index=11&date2=1917&searchType=advanced&language=&sequence=0&words=son+sonille&proxdistance=5&state=Missouri&rows=20&ortext=son&proxtext=&phrasetext=&andtext=&dateFilterType=yearRange&page=1), you may need to find out which tools handle specific scripts.

#### Preprocessing features?<a id="preprocessing"></a>

**Preprocessing is a set of steps that we can use to try to minimize issues such as a skewed page, faded text, or smudges on a page *before* performing OCR.** Some OCR tools offer some preprocessing tools. Others don't. Even if a tool can run preprocessing, though, you may find you have a specific need that must be met with another tool. 

#### Accuracy and error assessment?<a id="accuracy"></a>

Can the tool help you evaluate how well the process has gone and where there may be errors to correct? Are there tools to support both automated and manual error correction? How will you know if the OCRed corpus you've produced is of a high enough quality?

### Popular OCR Tools

#### [ABBYY Fine Reader](https://pdf.abbyy.com/)
Perhaps at the opposite end of the OCR spectrum from Pytesseract, ABBYY is another powerful OCR tool. It has a GUI (graphical user interface) in which users can make adjustments (preprocessing), and it also has an SDK (software developer toolkit) that programmers can use to run ABBYY tools in their own programs. ABBYY even has a cloud service. Like Tesseract, ABBYY supports many languages and a number of file formats. ABBYY is, however, proprietary--you'll need to be prepared to pay a minimum of $200 if your institution does not provide a license.

#### [Adobe Acrobat](https://acrobat.adobe.com/us/en/acrobat.html)
A common PDF reader, Acrobat can do a lot of things including OCR. It comes in DC and Pro DC versions, and both are paid. DC includes OCR functionality in the "Enhance PDF" menu.

#### [Amazon Textract](https://aws.amazon.com/textract/resources/?blog-posts-cards.sort-by=item.additionalFields.createdDate&blog-posts-cards.sort-order=desc)
Like Pytesseract, this tool from Amazon runs in Python. Like ABBYY Fine Reader, it's proprietary code, which means we don't know what's happening in Textract itself when we use it--it's a black box. There is a free tier to get started if you're working with fewer than 1,000 pages, and you can run your Textract code in Amazon's cloud environment. The cost to use it, if you are planning to learn a little programming or are working with a programmer, is significantly lower than the cost of an ABBYY license.

#### [Google Cloud Vision](https://cloud.google.com/vision/docs)
A competitor of Amazon's, Google's Cloud Vision API (application programming interface) is likewise proprietary after a certain number of uses, requires programming knowledge, and can be used in the cloud. This same tool can be used to perform computer vision tasks such as facial recognition. Because we don't know what's happening in Cloud Vision's code when we use it, we might not be able to explain unexpected results--it's another black box.

#### [Tesseract](https://tesseract-ocr.github.io/tessdoc/Home.html)
An OCR engine (basically, a collection of algorithms and training data) originally developed by Hewlett Packard and maintained by Google. Tesseract is open source and supports many languages and scripts. It also offers possibilities to customize OCR outputs in ways that may or may not be possible with proprietary software. The ability to add your own training data is also a big feature, though a resource-intensive process. Programmers have taken advantage of Tesseract being open source and have created [a number of tools based on Tesseract](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty) (some with GUIs).

#### [Pytesseract](https://pypi.org/project/pytesseract/)
Pytesseract (or Python-tesseract) is a powerful OCR tool made for the programming language Python using the Tesseract OCR Engine. It can work with many file formats and (human) languages, and, like [Tesseract](https://github.com/tesseract-ocr/tesseract), is open source. Since Pytesseract is used in a larger programming ecosystem, it can be combined with a variety of other Python packages to perform many different tasks. Furthermore, Python is both highly used and a popular computer language for beginning programmers, making it possible for users to move quickly from the basics of Python into working with Pytesseract.

## Introduction to Tesseract

[Tesseract](https://github.com/tesseract-ocr/tesseract) was initially developed by Hewlett-Packward between 1985-1994. HP made it open source in 2005. [Google developed it](https://opensource.google/projects/tesseract) between 2006-2018. It is still open source and maintained Zdenko Podobny. There is an [active user forum](https://groups.google.com/g/tesseract-ocr).

Tesseract supports over [100 languages](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) and can be [run in the command line](https://github.com/tesseract-ocr/tesseract#running-tesseract) on Windows, MacOS, and Linux. Its outputs can be stored in several interoperable file formats. There are a number of [third party GUIs available](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html).

Tesseract 5.0, the latest version, incorporates [LSTM (Long Short-Term Memory)](https://en.wikipedia.org/wiki/Long_short-term_memory), an artificial Recurrent Neural Network. LSTM is a set of algorithms that computers can run to process lots of data, "remember" that data, and apply what it "learns" from that data to other data as it's processing.

Because Tesseract is free and open source, it's [possible to retrain Tesseract in order to OCR a specific corpus](https://tesseract-ocr.github.io/tessdoc/). This requires a large and specific dataset, some expertise, and some time. But it's a key feature that you won't get from proprietary or closed-source software.

[PyTesseract](https://pypi.org/project/pytesseract/) is a "wrapper" -- basically it makes Tesseract legible to Python so that it can be incorporated into various Python environments and functionalities. This means that if you're already working in Python, you don't need to leave your environment to build a dataset. You could also build PyTesseract into a Python application and/or into a code base that you plan to reuse. It was [developed and maintained](https://github.com/madmaze/pytesseract) beginning in 2014 by a group of programmers led by Mattias Lee.

### Input Files

In order to perform OCR on a text corpus, we need the following:

- A **single file folder** containing all of the corpus files. If the corpus is small enough (e.g. 1 book), this could be simply a single file (e.g. a .pdf).
- All corpus files should be of the **same file format**.
- The chosen file format should be **interoperable** (usable by many software and operating systems) and stable (changes rarely if ever).

- For our work with Python and Tesseract, the files should be **images**, which means that each file will correspond to 1 single-sided page (recto or verso, assuming a book format).

![First page of the 1955 North Carolina Session Laws](../All-sample-files/sessionlaw-example.jpeg)

**To keep image files organized,** it is helpful to create a consistent structure. Here is an example from the On the Books project. Every book is a unique folder in the file structure. Each book's folder contains a series of numbered images for each page.

![Screenshot of a file structure for image files to be OCR'ed.](../All-sample-files/folder-structure.jpeg)

Note that the file naming structure identifies *both* which volume the images are part of *and* which scanned page they correspond to, which helps maintain the order of the volume. These numbers *may not* correspond to page numbers because bookscanning usually includes the outer and inner covers, title pages, and other book pages that are not usually numbered.

### Output Files

For each folder of files (whether .jpg, .png, or .pdf), we will create a single plaintext file (.txt) that contains the full-text.  The plain text file format is interoperable, stable, and fully computer readable, meaning it will be ready for performing computational analysis and for storing in repositories and databases.

## PyTesseract Basics

Here we will describe the basic process of OCRing using PyTesseract. PyTesseract is installed by default on Constellate. Let's download two sample `.jpg` examples.

We will convert a sample `.jpg` image to text. The sample comes from the Session Laws of the State of North Carolina. The material was OCRed for the [NEH-funded](https://www.neh.gov/), [Collections as Data](https://collectionsasdata.github.io/) project [On the Books: Jim Crow and Algorithms of Resistance](https://onthebooks.lib.unc.edu/).

In [None]:
# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract

# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("../All-sample-files/ocr_sample.jpg"), lang="eng"))

Let's break down the above code, from the inside out:

1. `Image.open("../All-sample-files/ocr_sample.jpg"), lang="eng")` - Open the image file `ocr_sample.jpg`. Set the language to English. The `Image.open()` function comes from the [Python Imaging Library](https://pillow.readthedocs.io/en/stable/index.html) (PIL).

2. `pytesseract.image_to_string()` - Using PyTesseract's `image_to_string` function, detect alphanumeric characters in the image and convert them into computer-readable text.


3. `print()` - Display the computer-readable text output.

## Tesseract Options 

Tesseract offers a number of different modes, or settings, that we can use to customize output. There are two types of modes: OEMs (OCR Engine Modes), which specify which OCR tools are available to Tesseract to use, and PSMs (Page Segmentations Modes), which specify how the OCR tools should read the image files--how to separate and order sections of text in the image file.

### OCR Engine Modes (OEMs)

Run the following command to view the list of OEMs.

In [None]:
# List the Tesseract OCR Engine Modes (OEMs)
# Run a terminal command using an exclamation point
!tesseract --help-oem

Here's more of an explanation of OCR Engine Modes (OEMs):

- *0 - Original Tesseract only.* - This mode runs only the main Tesseract mode.
  
- *1 - Cube only.* - This mode runs only Cube, [according to Google](https://code.google.com/archive/p/tesseract-ocr-extradocs/wikis/Cube.wiki), "an alternative recognition mode for Tesseract. It is slower than the original recognition engine, but often produces better results." [A Nanonets tutorial explains](https://nanonets.com/blog/ocr-with-tesseract/) that this is the LSTM mode. There is not much documentation out about this.
  
- *2 - Tesseract + Cube.* - Both Tesseract (Nanonets refers to this as "Legacy") and Cube (LSTM) modes are used.

- *3 - Default, based on what is available.* - Tesseract will choose an OEM based on the configurations (language, PSM) we give it. If we don't include the configuration information, Tesseract will run in OEM 3.

___
<h3 style="color:red; display:inline">Try it! &lt; / &gt; </h3>

Run the following script, trying each of the different OEMs in turn replace the number in the first line to change the OEM.
___

In [None]:
# Change the OEM number below to try
# running another OCR mode.
# 3 is the default setting.
custom_oem_config = r'--oem 3'

# Open a specific image file, convert the text in the image to computer-readable text (OCR)
# following the language and mode configuration we specify,
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("../All-sample-files/ocr_sample.jpg"), lang="eng", config=custom_oem_config))

### Page Segmentation Modes (PSM)

Run the following command to view all of the PSMs:

In [None]:
!tesseract --help-psm

By default (option 3), PyTesseract will not try to detect the script or orientation of the text. It *will* try to detect page segments and divide them from one another. If you're not analyzing paged documents (or you're not interested in segmentation), then another PSM may be more appropriate. 

You can also select a PSM based on whether you'd like automatic Orientation and Script Detection (OSD). Again, depending on your image source material, an alternative PSM may be a better choice. 

Let's try another PSM.

In [None]:
# Change the PSM number below to try
# running another page segmentation mode.
# 3 is the default setting.
custom_psm_config = r'--psm 3'

# Open a specific image file, convert the text in the image to computer-readable text (OCR)
# following the language and mode configuration we specify,
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("../All-sample-files/ocr_sample.jpg"), lang="eng", config=custom_psm_config))

Many of the PSMs are meant for images that have little text in them -- such as images that include road or store signs. [See Tesseract's documentation on improving OCR quality](https://tesseract-ocr.github.io/tessdoc/ImproveQuality).

**Most of the time, the default OEM and PSM is best.** There may be times when you are working with materials for which experimenting with these options may be useful. For help choosing a PSM, read ["Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy"](https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/).

Note that it's possible to customize the `oem` and `psm` together. Here's how:

In [None]:
# Change the numbers below to try
# running other modes together.
custom_oem_psm_config = r'--oem 3 --psm 4'

# Open a specific image file, convert the text in the image to computer-readable text (OCR)
# following the language and mode configuration we specify,
# and then print the results for us to see here.
print(
    pytesseract.image_to_string(
        Image.open("../All-sample-files/ocr_sample.jpg"),
        lang="eng",
        config=custom_oem_psm_config)
)

### Saving OCR Strings to a Documents (.txt, .pdf, and .html)

Tesseract can convert OCR'ed images into text, searchable PDF, and [hOCR (HTML)](https://en.wikipedia.org/wiki/HOCR).

___
<h3 style="color:red; display:inline">Try it! &lt; / &gt; </h3>

The scripts below output various file formats.
___

In [None]:
# Output to text file (.txt)

from pathlib import Path

# File location
# You can change the filename in quotes below to OCR a different file.
inputfile_path = Path("../All-sample-files/ocr_sample.jpg")

# Open the file named above. 
# While it's open, do several things:
with open(inputfile_path, 'rb') as f:
        
    # Read the file using PIL's Image module.
    img = Image.open(f)
    
    # Run OCR on the open file.
    ocrText = pytesseract.image_to_string(img)

# The image file above will be closed before moving on to this line.
# The OCR'ed text has been pulled from the image and stored in
# a Python variable for us to continue to use.

# Create and open a new text file, name it to match its input file,
# declare its encoding to be UTF-8 so that it correctly outputs
# non-ASCII characters.

# Define the output file path and name
outputfile_path = inputfile_path.with_suffix('.txt')

with open(outputfile_path, "w", encoding="utf-8") as outFile:
        
    # and write the OCR'ed text to the file.
    outFile.write(ocrText)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(outputfile_path, "created.")

In [None]:
# Output to pdf file (.pdf)

# File location
# You can change the filename in quotes below to OCR a different file.
inputfile_path = Path("../All-sample-files/ocr_sample.jpg")

# Run OCR on an image file and save it as a PDF object (not file)
# within Python.
pdf = pytesseract.image_to_pdf_or_hocr(inputfile_path.as_posix(), extension='pdf')

# Define the output file path and name
outputfile_path = inputfile_path.with_suffix('.pdf')

# Create a new empty pdf.
with open(outputfile_path, 'w+b') as f:
    
    # Save the PDF object to the new empty PDF file.
    f.write(pdf)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(outputfile_path, "created successfully.")

In [None]:
# Output to html file (.html)

# File location
# You can change the filename in quotes below to OCR a different file.
inputfile_path = Path("../All-sample-files/ocr_sample.jpg")

# Run OCR on an image file and save it as an HTML object (not file)
# within Python.
hocr = pytesseract.image_to_pdf_or_hocr(inputfile_path.as_posix(), extension='hocr')

# Define the output file path and name
outputfile_path = inputfile_path.with_suffix('.html')

# Create a new empty HTML file. Open it in "w+b" mode.
# "w+b" is a mode that tells Python to write whatever
# data we give to a file in binary mode--meaning that 
# it will not apply any encoding or try to translate
# a non-ASCII character to an ASCII character.
with open(outputfile_path, 'w+b') as f:
    
    # Save the PDF object to the new empty PDF file.
    f.write(hocr)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(outputfile_path, "created succesfully.")

### Languages

If we do not include `lang="eng"` when we run the above code, Tesseract will *assume* English. By default, Constellate lab sessions also include support for English, Spanish, and Italian. We do not include all the languages available for Tesseract because that would negatively impact loading speed of the lab. Tesseract documentation contains a list of all the languages available with their language codes. [A table of these is available here.](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) If there is another language you would like to work with, this notebook can be run on your local machine and/or you can reach out to us about the possibility of including it by default in the lab.

In [None]:
# Display a list of languages in their 3-letter codes supported by Tesseract.
print(pytesseract.get_languages(config=''))

In [None]:
# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.

# Try changing the language paramater to 'eng'. Does it change the output?
print(pytesseract.image_to_string(Image.open("CHANGE-THE-FILENAME-HERE"), lang="CHANGE-LANGUAGE-CODE"))

___
<h3 style="color:red; display:inline">Try it! &lt; / &gt; </h3>

Try OCR'ing the first page from Gabriel Garcia-Marquez's *Cien Años de Soledad* (*One Hundred Years of Solitude*): 

![The first page from one hundred years of solitude](../All-sample-files/cien-años-de-soledad.png)

---
Try changing the [3-letter language code](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) to match the language in the text (Spanish).
___

It is also possible to use multiple languages. The syntax will be `lang="lan+gua"` -- replace `lan` and `gua` with the correct language codes.

## Practice: Apply to your own files

Use the following code blocks to try OCR'ing various texts. You could use your own files containing digitized texts or locate files to try via [JStor](https://www.jstor.org/), the [Internet Archive](https://archive.org/), [Chronicling America](https://chroniclingamerica.loc.gov/) or other resources. Try texts in different languages, fonts or types, formats, layouts, etc. 

### Upload your selected text(s) to the `data/` folder in your space in the Constellate Analytics Lab:

- Make sure that the texts you select are stored in an image (.jpg, .png, .tiff) format. If you have selected a text with multiple pages, make sure each page is stored in a separate file. *If you have PDF files and are not sure how to generate images from them, bring them to Lesson 02. We'll be looking at how to generate image files together during the lesson.*

1. Navigate to the `data/` folder.
2. Select the up arrow above the file explorer to upload.
3. If the filename is long, rename it to something simple.

### Perform OCR on your image file. 

Use the code blocks above or start fresh below. Change the language attribute to match the text's language. Try out the various settings we looked at above.


In [None]:
# Output to text file (.txt)

# Select the oem and psm
custom_oem_psm_config = r'--oem 3 --psm 3'

# File location
# You can change the filename in quotes below to OCR a different file.
inputfile_path = Path("../All-sample-files/ocr_sample.jpg")

# Open the file named above. 
# While it's open, do several things:
with open(inputfile_path, 'rb') as f:
        
    # Read the file using PIL's Image module.
    img = Image.open(f)
    
    # Run OCR on the open file.
    ocrText = pytesseract.image_to_string(img)

# The image file above will be closed before moving on to this line.
# The OCR'ed text has been pulled from the image and stored in
# a Python variable for us to continue to use.

# Create and open a new text file, name it to match its input file,
# declare its encoding to be UTF-8 so that it correctly outputs
# non-ASCII characters.

# Define the output file path and name
outputfile_path = inputfile_path.with_suffix('.txt')

with open(outputfile_path, "w", encoding="utf-8") as outFile:
        
    # and write the OCR'ed text to the file.
    outFile.write(ocrText)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(outputfile_path, "created.")

## Resources <a class="anchor" id="resources"></a>
---


* [PyTesseract documentation](https://github.com/madmaze/pytesseract)
* [Tesseract documentation](https://tesseract-ocr.github.io/)

### Jupyter Notebooks Tutorials & Reference

* [Jupyter Notebooks documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html)
* [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/)

### Readings on OCR

* Algun, Selcuk. 2018. ["Review for Tesseract and Kraken OCR for text recognition."](Review for Tesseract and Kraken OCR for text recognition) *Data Driven Investor.*
* Bakker, Rebecca. ["OCR for Digital Collections."](https://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1047&context=glworks) *FIU Digital Commons.*
* Baumman, Ryan. ["Automatic evaluation of OCR quality."](https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html) */etc.*
* Cordell, R. 2017. ["Q i-jtb the Raven": Taking Dirty OCR Seriously."](https://ryancordell.org/research/qijtb-the-raven/) *Book History*, 20, 188-225.
* Cordell, Ryan. 2019. ["Why You (A Humanist) Should Care About Optical Character Recognition."](https://ryancordell.org/research/why-ocr/) *Ryan Cordell.* 
* Coyle, Karen. ["Digital Urtext."](https://kcoyle.blogspot.com/2012/04/digital-urtext.html) *Coyle's InFormation.*
* Hawk, Brandon W. ["OCR and Medieval Manuscripts: Establishing a Baseline."](https://brandonwhawk.net/2015/04/20/ocr-and-medieval-manuscripts-establishing-a-baseline/) *Brandon W. Hawk.* (This post is a comparison of ABBYY FineReader & Adobe Acrobat OCR technologies as applied to medieval texts.)
* Holley, Rose. 2009. [How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs,"](http://www.dlib.org/dlib/march09/holley/03holley.html) *D-Lib Magazine* 15, no. 3/4.
* Milligan, I. 2013. ["Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010.](https://uwspace.uwaterloo.ca/handle/10012/11748) *The Canadian Historical Review* 94(4), 540-569.
* Smith, David, and Ryan Cordell. 2018. "A Research Agenda for Historical and Multilingual Optical Character Recognition."
* Smith, Ray. 2007. ["An Overview of the Tesseract OCR Engine."](https://tesseract-ocr.github.io/docs/tesseracticdar2007.pdf)
* Smith, Ray, Daria Antonova, and Dar-Shyang Lee. 2009. ["Adapting the Tesseract open source OCR engine for multilingual OCR."](https://dl.acm.org/doi/10.1145/1577802.1577804) MOCR '09: Proceedings of the International Workshop on Multilingual OCR.

### Additional Reading

* Rockwell, Geoffrey, and Stéfan Sinclair. 2016. [*Hermeneutica: Computer Assisted Interpretation in the Humanities.*](http://hermeneuti.ca/)
* Underwood, Ted. ["The challenges of digital work on early-19c collections."](https://tedunderwood.com/2011/10/07/the-challenges-of-digital-work-on-early-19c-collections/) *The Stone and the Shell.*
* [TranScriptorium's handwritten text recognition project results.](https://cordis.europa.eu/project/id/600707/results)
* ["How to Transcribe Documents with Transkribus - Introduction."](https://readcoop.eu/transkribus/howto/how-to-transcribe-documents-with-transkribus-introduction/) *Read Coop.*
* **[Basics of Fair Use](https://copyright.columbia.edu/basics/fair-use.html)** from Columbia University.

### OCR Tutorials & Reference

The following is a list of tutorials that include different scholars' approaches to OCR. Some also use Tesseract, but most use different scripting or programming languages. There is no single best way to do OCR, so if you have the time they worth trying to see which works best for your project.

* Aidan. ["OCR with Python."](https://medhieval.com/classes/hh2019/blog/ocr-with-python/) *Hacking the Humanities 2019.* 
* Akhlaghi, Andrew. ["OCR and Machine Translation."](http://programminghistorian.org/en/lessons/OCR-and-Machine-Translation) *The Programming Historian.* (Note that this tutorial uses Tesseract but works with the bash scripting language instead of Python.)
* Baumman, Ryan. ["Command-Line OCR with Tesseract on Mac OS X."](https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html) */etc.*
* Dull, Joshua. ["Text Recognition with Adobe Acrobat and ABBYY FineReader."](https://github.com/JoshuaDull/Text-Recognition-Introduction/)
* Graham, Shawn. ["Extracting Text from PDFs; Doing OCR; all within R."](https://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/) *Electric Archaeology.* (This blog post describes a method for OCR using the R programming language.)
* Mähr, Moritz. ["Working with batches of PDF files."](https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files) *The Programming Historian.* (Note that this tutorial uses Tesseract and works in the command line without Python.)
* Rosebrock, Adrian. ["Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy"](https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/)
* Shperber, Gidi. ["A gentle introduction to OCR."](https://towardsdatascience.com/a-gentle-introduction-to-ocr-ee1469a201aa) *Toward Data Science.* October 22, 2018.
* Tarnopol, Rebecca. ["How to OCR Documents for Free in Google Drive."](https://business.tutsplus.com/tutorials/how-to-ocr-documents-for-free-in-google-drive--cms-20460) *TutsPlus.*
