# Images to Text: A Gentle Introduction to Optical Character Recognition with PyTesseract

***Lesson 01: What is OCR? Why is it important?*** 

Monday, June 14 2021 - [13:00-14:30 UTC](https://savvytime.com/converter/utc/jun-14-2021/16-00)

## Table of Contents

- [Welcome](#welcome)
    - [Introductions](#intros)
    - [Course Format & Expectations](#expectations)
    - [Course Materials](#materials)
- [Lesson 01: What is OCR? Why is it important?](#lesson-01)
    - [Computers "Reading"](#reading)
    - [What is OCR?](#what-is-ocr)
    - [Why is OCR important?](#why-ocr)
    - [How is OCR used in teaching & research?](#teaching-research)
    - [How to OCR: Basics](#basics)
- [Homework](#homework)
- [Resources](#resources)

## Welcome
---

Welcome to Images to Text: A Gentle Introduction to OCR with PyTesseract! I'm looking forward to meeting you all and learning more about your work. <mark style="background-color:pink;">Please read through [this notebook](00-CoursePreparation.ipynb) and complete the ["things to do" list](00-CoursePreparation.ipynb#to-do) before our first lesson.</mark> See you soon on Slack and Zoom!

### Land Acknowledgment

These materials were prepared and are presented on the ancestral homelands of the Haliwa-Saponi, Sappony, and Occaneechi Band of the Saponi nations, whose lands are now known as Durham, North Carolina. This acknowledgement reminds us of the significance of place even in a virtual space, and of our ongoing need to build a more inclusive and equitable society. 

Learn more about [land acknowledgments](https://nativegov.org/a-guide-to-indigenous-land-acknowledgment/). Learn about the [Occaneechi Band of the Saponi Nation Homeland Preservation project](https://obsn.org/homeland-preservation-project/).

### About this Course

This course will introduce the concept of “Optical Character Recognition” (OCR), various tools available for performing OCR, and important considerations for successfully OCRing digitized text. Using Tesseract in Python, we’ll walk through the entire process using a variety of examples to show the range of challenges scholars can face when performing OCR. [This webpage contains the full course description and schedule.](http://labs.jstor.org/tapi-courses/)

### Learning Objectives

By the end of the course, participants should be able to 

- define "OCR";
- explain the importance of OCR for computer-aided reading and analysis;
- point to use cases for OCR in teaching & research;
- use the course’s Jupyter Notebooks to perform OCR on their own;
- describe and implement an OCR workflow;
- identify possible technical challenges presented by specific texts and propose potential solutions; 
- assess the degree of accuracy they have achieved in performing OCR.

### Introductions<a class="anchor" id="intros"></a>

Who are we?

- Name
- [Pronouns](https://www.mypronouns.org/what-and-why/)
- Tweet length: Why are you here?

### Course Format & Expectations <a class="anchor" id="expectations"></a>

#### Format

- **We're using at least 7 different tools/platforms--that's a lot.** I will probably get lost in them at some point, so don't worry if you do, too. Generally speaking, 
    - **Zoom** will be used for our 3 synchronous sessions and office hours. 
    - **Slack** is available for asynchronous and synchronous communication--it's optional, but it's a great place to network outside of class. You can also **email** me if you prefer.
    - **Jupyter Notebooks** (where you're probably reading this) hosted on **Constellate's Binder** and **Github** will be our workspace where all of the workshop content is stored and where you can experiment with your own content.
    

- **We'll use a mixture of presentation, small group discussion, and large group discussion formats.** There will be some small (or as big as you make them) optional homework assignments. I will also be available for office hours after each session and the Friday prior to our workshop.


- **I (Hannah) am committed to making this workshop as accessible as possible.** All work will be done in your browser, so you won't need to download anything to participate. Automatic captions will be provided by Zoom (with some mistakes inevitable). These Jupyter Notebooks should be optimized for screen readers. If you notice any issues with these, or if there are other ways I can make these workshops more accessible, please let me know.


- **I love questions, especially when I don't know the answer.** This helps me learn, too. If I, or fellow participants, can't answer a question immediately, I'll work to help you find the answer.


- **This workshop has been created as an open educational resource.** You are welcome to share and reuse this content for your own teaching and research. Please do credit the TAP Institute and me.


#### Expectations

- **This virtual space is open and welcoming to all people.** I will endeavor to treat everyone with respect and dignity, and by participating in this workshop, you agree to do the same.


- **We're located around the world -- do what you need to do.** If this course occurs during a mealtime, don't hesitate to eat. If you prefer, if your internet is unstable, feel free to leave your video off. You can use either audio or chat to ask questions.


- **We won't cover everything,** but after this workshop you will have the resources you need to learn more about OCR.


- **All of the necessary code will be provided.** There will be opportunities to modify code and, for those who wish, to write your own code.


- **Any "homework" exercises are optional,** though, if you have the time to do them, they can greatly enhance your experience and understanding.


- **Use Slack as much or as little as you like.** Just be mindful that we are located in a variety of time zones, so we may not answer immediately. 


- **I'm available Monday, Wednesday, and Friday after our workshop sessions to chat on Zoom or Slack.** Tuesday and Thursday, I'll be checking Slack intermittently.


#### What else?


### Course Materials<a class="anchor" id="materials"></a>

[Everything you need to know to get set up for this course is available in this Jupyter Notebook.](00-CoursePreparation.ipynb)

[You can return to these materials here.](https://nkelber.github.io/tapi2021/book/courses/ocr.html)

Have you:

- successfully opened this notebook?
- edited this notebook or added your own notes? (Here's [a reference for Markdown](https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet).)
- saved [a link to this notebook](https://nkelber.github.io/tapi2021/book/courses/ocr.html) to come back to later?
- run the following line of code?

In [None]:
print("Hello, world!")

### Questions?

## Lesson 01: What is OCR? Why is it important?<a class="anchor" id="lesson-01"></a>
---

### Learning Objectives

<div class="alert alert-block alert-info">
    <strong>By the end of this lesson, you should be able to</strong>
    <ul>
        <li>define "OCR";</li>
        <li>explain the importance of OCR for computer-aided reading and analysis;</li>
        <li>perform basic OCR operations using Python, Tesseract, and Jupyter Notebooks.</li>
    </ul>
</div>

*Examples for today's exercises are drawn from the [On the Books: Algorithms of Resistance](https://onthebooks.lib.unc.edu/) project.*

### Computers "Reading"<a class="anchor" id="reading"></a>

#### How is text represented by computers?

#### An Example:

What is the difference between [this text](https://www.gutenberg.org/files/2229/2229-0.txt) and [this text](faust.png)?

#### A Closer Look

[<img src="images/06-corpus-03.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of a volume of North Carolina laws shown in PDF format on the Internet Archive" title="Screenshot of a volume of North Carolina laws shown in PDF format on the Internet Archive" />](https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up)

The text as it's shown to us in the screenshot above and in the Internet Archive is stored *image files*.

While we humans see this file, know that it contains text, and possibly can read the text shown, the computer doesn't understand it that way. Here's a small amount of what the computer "reads":

<img src="images/07-ocr-01.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of text stored in an image format from a page of North Carolina laws" title="Screenshot of text stored in an image format from a page of North Carolina laws" />

While we might see this as the word `Blackwell's`, the computer understands the above as a series of squares, **pixels**, containing information about which color the pixel should be--*not* which character to display.  If we want the computer to be able to work this text *as* text, we need to convert the image above into this:

`01000010 01101100 01100001 01100011 01101011 01110111 01100101 01101100 01101100 00100111 01110011`

...which the computer will then display for human readers as `Blackwell's`. We can then use our computers to search for instances of this word, analyze its freqency, patterns in occurrence, collocation, and so on. We can also ask the computer to read this and any other words in the page aloud if we need to hear them instead of viewing them on a screen.

It was created by [archivists who used a scanner or camera to create a digital copy of the print (paper) version of this text](https://www.digitalnc.org/policies/digitization-guidelines/). The printed volume is represented by 1 image per single-sided page. The archivists then created several different digital versions (file formats), added information (metada) about the volume, and uploaded all of these files to the Internet Archive to share with the world.

## What is OCR? <a class="anchor" id="what-is-ocr"></a>

OCR, or "Optical Character Recognition," is **a computational process that converts digital images of text into computer-readable text**.

OCR is both a noun and a verb.

<div class="alert alert-block alert-success">
    <p>More specifically:</p>
    
<blockquote><strong>OCR software attempts to replicate the combined functions of the human eye and brain, which is why it is referred to as artificial intelligence software.</strong> A human can quickly and easily recognise text of varying fonts and of various print qualities on a newspaper page, and will apply their language and cognitive abilities to correctly translate this text into meaningful words. Humans can recognise, translate and interpret the text on a newspaper page very rapidly, even text on an old poor quality newspaper page from the 1800s. We can quickly scan layout, sections and headings, and read the text of articles in the right order (which is much more difficult than reading the page of a book). <strong>OCR software can now do all these things too, but not to the same level of perfection as a human can.</strong> - (<a href="http://www.dlib.org/dlib/march09/holley/03holley.html" alt="Holley, How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs">Holley, "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs"</a></blockquote>
    
<blockquote>"Optical character recognition (OCR) software is <strong>a type of artificial intelligence software designed to mimic the functions of the human eye and brain and discern which marks within an image represent letterforms or other markers of written language.</strong> OCR scans an image for semantically-meaningful material and transcribes what language it finds into text data." (<a href="https://ryancordell.org/research/why-ocr/" alt="Why You (A Humanist) Should Care About Optical Character Recognition">Cordell, "Why OCR?"</a>)</blockquote>
    
</div>

### Inputs

In order to perform OCR on a text corpus, we need the following:

- A **single file folder** containing all of the corpus files. If the corpus is small enough (e.g. 1 book), this could be simply a single file (e.g. a .pdf).
- All corpus files should be of the **same file format**.
- The chosen file format should be **interoperable** (usable by many software and operating systems) and stable (changes rarely if ever).


- For our work with Python and Tesseract, the files should be **images**, which means that each file will correspond to 1 single-sided page (if in a book format) in the corpus. 

<img src="sessionlawsresol1955nort_0057.jpg" width="40%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="First page of the 1955 North Carolina Session Laws." title="First page of the 1955 North Carolina Session Laws." />



**To keep all of these image files organized,** create a file structure that looks like the below: 1 file folder for the entire corpus, and 1 subfolder for each volume in the corpus containing an image file for every page in the volume.

<img src="images/08-ocr-01.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of a file structure for image files to be OCR'ed." title="Screenshot of a file structure for image files to be OCR'ed." />

Note that the file naming structure identifies *both* which volume the images are part of *and* which scanned page they correspond to, which helps us maintain the order of the volume. These numbers *may not* correspond to page numbers because the scanning included outer and inner covers as well as title pages, etc.

Note that we are working with .jpg files here. The process we'll be using, though, can also be run with .png, .tiff, .jp2, and other common interoperable image formats.

### Outputs

From the images, the OCR process will create plain text:


    SESSION LAWS

    OF THE

    STATE OF NORTH CAROLINA

    SESSION 1955

    S. B. 4 CHAPTER 1

    AN ACT T0 AUTHORIZE THE BOARD OF TRUSTEES OF THE
    SOUTHERN PINES SCHOOL DISTRICT T0 TRANSFER CERTAIN
    FUNDS FROM ITS DEBT SERVICE ACCOUNT TO ITS CAPITAL
    OUTLAY OR CURRENT EXPENSE ACCOUNTS, OR TO BOTH
    SUCH ACCOUNTS.

    The General Assembly 0/ Alarm Carolina do amt:

    Secﬁon 1. The Board of Trustees of the Southern Pines School Dis-
    trim, is hereby authorized and empowered to transfer all surplus funds held
    by it in its debt service account on the date at the ratiﬁcation of this Act
    or on July 1, 1955, to its capital outlay account or current expense account,
    or to hnth such accounts, and to use said funds for capital outlay or current
    expense purposes, m- bath, including the construction of school buildings,

    See. 2. A11 laws and clauses ni laws in conﬂict with this Act are hereby
    repealed.

    Sec. 3‘ This Act shah become effective on and after its ratiﬁcation.

    In the General Assembly read three times and ratiﬁed, this the 14th
    day of January, 1955.

    H. B. 13 CHAPTER 2

    AN ACT TO PERMIT THE BOARD OF COMMISSIONERS OF
    CATAWBA COUNTY TO MAKE APPROPRIATIONS FOR BUILDING
    WATER LINES, SEWER LINES OR EITHER OF THEM, FROM THE
    CORPORATE LIMITS OF MUNICIPALITIES TO COMMUNITIES IN
    THE COUNTY.

    The General Assembly a/ North Carolina do enact:

    Section 1. The Board of County Commissioners of Catawba County is
    hereby authorized and empowered in its discretion to expend out of non»
    fax funds available to said board such amount or amaunts as it may deem
    Wise, not exceeding in the aggregate the sum of one hundred and twenty-

    1


We'll organize a corpus of such text in **1 file folder containing 1 file per volume in the .txt (plain text) format.** The plain text file format is interoperable, stable, and fully computer readable, meaning it will be ready for performing computational analysis and for storing in repositories and databases.

<img src="images/08-ocr-02.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of the file structure for files after OCR." title="Screenshot of the file structure for files after OCR." />

## Why is OCR important? <a class="anchor" id="why-ocr"></a>

What can we *do* with text that has been OCR'ed?

<div class="alert alert-block alert-success">
    <blockquote>"Typically <strong>OCR is used in situations where manual transcription would be too costly or time consuming</strong>—a subjective designation to be sure—such as when a large corpus has been scanned. Relative to manual transcription, <strong>OCR is a quick and affordable means for creating computable text data from a large collection of images</strong>." (<a href="https://ryancordell.org/research/why-ocr/" alt="Why You (A Humanist) Should Care About Optical Character Recognition">Cordell, "Why OCR?"</a>)</blockquote>
    
</div>

## How is OCR used in research & teaching workflows? <a class="anchor" id="teaching-research"></a>

- Search
- Analysis
- Creating accessible course materials

As Ryan Cordell [has written](https://ryancordell.org/research/why-ocr/), **"even if you’ve never heard of OCR it may nonetheless be important or even essential to your research and teaching."**

With computer-readable text, we can search for words, analyze word frequencies and patterns, and ask computers to do things like read text aloud to us. The first and last of these capabilities can help us **access information**. Have you ever run a keyword search within a text on Google Books or [HathiTrust](https://babel.hathitrust.org/cgi/pt?id=uiug.30112001872933&view=1up&seq=79&q1=freedom)? Your ability to search a scanned text on these platforms has been made possible by OCR. Do you ever ask Adobe Acrobat, Siri, Cortana, Alexa, or Google Assistant to read a PDF aloud to you? That's also made possible by OCR.

We can also use OCR outputs (often with some human intervention) to create datasets for [**text analysis**](https://digitalpedagogy.hcommons.org/keyword/Text-Analysis). These analyses are conducted with the goal of exploring vast collections of information, and *perhaps* creating new understandings of those materials and their significance, using methods that humans can't conceivably complete in a reasonable amount of time without computational support. Here are a few examples of work that OCR has made possible:

- [A computational analysis of the National Security Archive’s Kissinger Collections](https://blog.quantifyingkissinger.com/), containing tens of thousands of communications between Henry Kissinger and government officials between 1969-1977 (think Civil Rights, Vietnam, Watergate, Cold War...).
- [Topic modeling of fugitive slave advertisements in Richmond, Virginia's, Civil War-era newspaper *The Dispatch*](https://dsl.richmond.edu/dispatch/topic/9).
- The analysis of [hundreds of years' worth of women's writing](https://wwp.northeastern.edu/).


### Exercise

The above is the *promise* of OCR, but *OCR has its limitations:* 

1. Navigate to [*Chronicling America*](https://chroniclingamerica.loc.gov/)
2. Search for a term of interest.
3. Click on a search result to view the newspaper page. Click "Text" in the menu above the newspaper to view the OCR output. What do you notice?
4. Return to your search results and select another result. View its OCR output. What do you notice?
5. Consider:
    - How is the digitized page different from the generated text?
    - How does the quality of the generated text compare to that of the digitized page? 
    - What are the implications for research and teaching?


## BREAK

<img src="images/noun_Cafe_3166430.png" width="20%" alt="A coffee cup on a saucer with steam rising from the cup."/>

## How to OCR: Basics <a class="anchor" id="basics"></a>

### Why Python & Tesseract

- free vs. paid
- open vs. proprietary
- levels of customization/control
- quality

[A short list of tools.](OCR_Tools.ipynb)

#### A Very Brief Overview of Tesseract & PyTesseract

[Tesseract](https://github.com/tesseract-ocr/tesseract) was initially developed by Hewlett-Packward between 1985-1994. HP made it open source in 2005. [Google developed it](https://opensource.google/projects/tesseract) between 2006-2018. It is still open source and maintained Zdenko Podobny. There is an [active user forum](https://groups.google.com/g/tesseract-ocr).

Tesseract supports over [100 languages](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) and can be [run in the command line](https://github.com/tesseract-ocr/tesseract#running-tesseract) on Windows, MacOS, and Linux. Its outputs can be stored in several interoperable file formats. There are a number of [third party GUIs available](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html).

The latest versions (4x) of Tesseract incorporate [LSTM (Long Short-Term Memory)](https://en.wikipedia.org/wiki/Long_short-term_memory), an artificial Recurrent Neural Network. LSTM is a set of algorithms that computers can run to process lots of data, "remember" that data, and apply what it "learns" from that data to other data as it's processing.

Because Tesseract is free and open source, it's [possible to retrain Tesseract in order to OCR a specific corpus](https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html). This requires a large and specific dataset, some expertise, and some time. But it's a key feature that you won't get from proprietary or closed-source software.

[PyTesseract](https://pypi.org/project/pytesseract/) is a "wrapper" -- basically it makes Tesseract legible to Python so that it can be incorporated into various Python environments and functionalities. This means that if you're already working in Python, you don't need to leave your environment to build a dataset. You could also build PyTesseract into a Python application and/or into a code base that you plan to reuse. It was [developed and maintained](https://github.com/madmaze/pytesseract) beginning in 2014 by a group of programmers led by Mattias Lee.

### The Code

In order to run Tesseract in Binder, we need to run the following line of code: select the code block, and press `Shift+Enter` or `Shift+Return` on your keyboard. 

**Wait until the following code finishes (it will take 1-2 minutes) before you continue.** If possible, avoid interrupting (closing your browser, pressing the stop button in the menu, etc.) this process as running it a second time will produce errors.

In [None]:
# Install tesseract on Binder.
# The exclamation runs the command as a terminal command.
# This may take 1-2 minutes.
# Source: Nathan Kelber & JStor Labs Constellate team.
!conda install -c conda-forge -y tesseract

When the above code finishes running, run the following code to produce OCRed text from [a single printed page](sessionlawsresol1955nort_0057.jpg).

In [None]:
# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract

# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng"))

Let's break down the above code, from the inside out:

1. `Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng")` - Open the image file `sessionlawsresol1955nort_0057.jpg` and set its language to English.


2. `pytesseract.image_to_string()` - Using PyTesseract's `image_to_string` function, detect alphanumeric characters in the image and convert them into computer-readable text.


3. `print()` - Display the computer-readable text output.

### Variations on a Theme: Tesseract's Options <a class="anchor" id="variations"></a>

<mark style="background-color:pink"><strong>UPDATE</strong></mark>: Feel free to work through this section and the homework below before Wednesday. We'll also take time to cover this at the beginning of Wednesday's lesson.</mark>

Tesseract offers a number of different modes, or settings, that we can use to customize output. There are two types of modes: OEMs (OCR Engine Modes), which specify which OCR tools are available to Tesseract to use, and PSMs (Page Segmentations Modes), which specify how the OCR tools should read the image files--how to separate and order sections of text in the image file.

#### OCR Engine Modes (OEMs)

Run the following command to view the list of OEMs.

In [None]:
!tesseract --help-oem

Run the following script, trying each of the different OEMs in turn replace the number (X) in the first line to change the OEM: 

`custom_oem_config = r'--oem X`

In [None]:
# Change the OEM number below to try
# running another OCR mode.
# 3 is the default setting.
custom_oem_config = r'--oem X'

# Open a specific image file, convert the text in the image to computer-readable text (OCR)
# following the language and mode configuration we specify,
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng", config=custom_oem_config))

*What did you notice about the different modes? How did they differ from one another?*

Here's more of an explanation of OCR Engine Modes (OEMs):

- *0 - Original Tesseract only.* - This mode runs only the main Tesseract mode.
  
- *1 - Cube only.* - This mode runs only Cube, [according to Google](https://code.google.com/archive/p/tesseract-ocr-extradocs/wikis/Cube.wiki), "an alternative recognition mode for Tesseract. It is slower than the original recognition engine, but often produces better results." [A Nanonets tutorial explains](https://nanonets.com/blog/ocr-with-tesseract/) that this is the LSTM mode. There is not much documentation out about this.
  
- *2 - Tesseract + Cube.* - Both Tesseract (Nanonets refers to this as "Legacy") and Cube (LSTM) modes are used.

- *3 - Default, based on what is available.* - Tesseract will choose an OEM based on the configurations (language, PSM) we give it. Even if we don't include the configuration information, Tesseract will run in OEM 3.

#### Page Segmentation Modes (PSM)

Run the following command to view all of the PSMs:

In [None]:
!tesseract --help-psm

This time, our configuration looks like

`custom_oem_config = r'--psm X'`

In [None]:
# Change the PSM number below to try
# running another page segmentation mode.
# 3 is the default setting.
custom_psm_config = r'--psm 3'

# Open a specific image file, convert the text in the image to computer-readable text (OCR)
# following the language and mode configuration we specify,
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng", config=custom_psm_config))

Many of the PSMs are meant for images that have little text in them -- such as images that include road or store signs. [See Tesseract's documentation on improving OCR quality.](https://tesseract-ocr.github.io/tessdoc/ImproveQuality)

**Most of the time, the default OEM and PSM is best.** There may be times when you are working with materials for which experimenting with these options may be useful.

Note that it's possible to customize the `oem` and `psm` together. Here's how:

In [None]:
# Change the numbers below to try
# running other modes together.
custom_oem_psm_config = r'--oem 3 --psm 4'

# Open a specific image file, convert the text in the image to computer-readable text (OCR)
# following the language and mode configuration we specify,
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng", config=custom_oem_psm_config))

#### File Formats

In addition to .txt, Tesseract can convert OCR'ed images into [hOCR (HTML)](https://en.wikipedia.org/wiki/HOCR), searchable PDF, and TSV.

*Exercise*

The scripts below output various file formats. Try each and then click the file link below each script to view the output. You'll also find the files by clicking on the Jupyter icon at the top of this window.

1. Text: Note that this script is more detailed than our initial `print(pytesseract.image_to_string(Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng"))`. Read through the comments (#) below to learn

In [None]:
# Name the image file. (Assign it to a variable.)
# You can change the filename in quotes below to OCR a different file.
file = "sessionlawsresol1955nort_0057.jpg"

# Open the file named above. 
# While it's open, do several things:
with open(file, 'rb') as inputFile:
        
    # Read the file using PIL's Image module.
    img = Image.open(inputFile)
    
    # Run OCR on the open file.
    ocrText = pytesseract.image_to_string(img)
        
    # Get a file name--without the extension-- 
    # to use when we name the output file.
    fileName = file.strip('.jpg')

# The image file above will be closed before moving on to this line.
# The OCR'ed text has been pulled from the image and stored in
# a Python variable for us to continue to use.

# Create and open a new text file, name it to match its input file,
# declare its encoding to be UTF-8 so that it correctly outputs
# non-ASCII characters.
with open(fileName + ".txt", "w", encoding="utf-8") as outFile:
        
    # and write the OCR'ed text to the file.
    outFile.write(ocrText)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(fileName, "text file successfully created.")

Open [sessionlawsresol1955nort_0057.txt](sessionlawsresol1955nort_0057.txt) to see the results.

2. PDF:

In [None]:
# Name the image file. (Assign it to a variable.)
# You can change the filename in quotes below to OCR a different file.
file = "sessionlawsresol1955nort_0057.jpg"

# Get a file name--without the extension-- 
# to use when we name the output file.
fileName = file.strip('.jpg')

# Run OCR on an image file and save it as a PDF object (not file)
# within Python.
pdf = pytesseract.image_to_pdf_or_hocr(file, extension='pdf')

# Create a new empty pdf.
with open(fileName + ".pdf", 'w+b') as f:
    
    # Save the PDF object to the new empty PDF file.
    f.write(pdf)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(fileName, "PDF successfully created.")

Open [sessionlawsresol1955nort_0057.pdf](sessionlawsresol1955nort_0057.pdf) to see the results. *What do you notice about this PDF?* Try running a search within file (Command+F or Control+F to open the Find search box).

In [None]:
# Name the image file. (Assign it to a variable.)
# You can change the filename in quotes below to OCR a different file.
file = "sessionlawsresol1955nort_0057.jpg"

# Get a file name--without the extension-- 
# to use when we name the output file.
fileName = file.strip('.jpg')

# Run OCR on an image file and save it as an HTML object (not file)
# within Python.
hocr = pytesseract.image_to_pdf_or_hocr(file, extension='hocr')

# Create a new empty HTML file. Open it in "w+b" mode.
# "w+b" is a mode that tells Python to write whatever
# data we give to a file in binary mode--meaning that 
# it will not apply any encoding or try to translate
# a non-ASCII character to an ASCII character.
with open(fileName + ".html", 'w+b') as f:
    
    # Save the PDF object to the new empty PDF file.
    f.write(hocr)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(fileName, "HTML successfully created.")

Open [sessionlawsresol1955nort_0057.html](sessionlawsresol1955nort_0057.html) to see the results. *What do you notice about this HTML file?* If you wish, save it to your Desktop and open in a text editor to view the HTML syntax.

#### Languages

If we do not include `lang="eng"` when we run the above code, Tesseract will *assume* English. Run the following to get a list of all the language codes. [A table of these is available here.](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)

In [None]:
# Display a list of languages in their 3-letter codes supported by Tesseract.
print(pytesseract.get_languages(config=''))

*Exercises*

Try OCR'ing [this file](faust.png) ([Source](https://archive.org/details/fausteinetragodi00goet/page/n7/mode/2up)). Change the [3-letter language code](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) to match the language in that document. *To do this, you will need to modify the code below to include the correct file name and language code.*

In [None]:
# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.

# REPLACE THE FILE NAME with one of the sample files above. (faust.png)
# REPLACE THE LANGUAGE attribute with the correct language code(s). (deu)
print(pytesseract.image_to_string(Image.open("REPLACE-THIS-FILE-NAME.jpg"), lang="lan"))

Try OCR'ing [this file](bible.png) ([Source](https://archive.org/details/holybibleinhindi00alla)), which includes [multiple languages](). The syntax will be `lang="lan+gua"` -- replace `lan` and `gua` with the correct language codes. *The first language will be the "primary" language. Try changing the order of the languages to see how the output changes.*

In [None]:
# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.

# REPLACE THE FILE NAME with one of the sample files above. (bible.png)
# REPLACE THE LANGUAGE attribute with the correct language code(s).
print(pytesseract.image_to_string(Image.open("REPLACE-THIS-FILE-NAME.jpg"), lang="lan+gua"))

## Before We Go
---

Post to the course Slack channel 1-2 questions that have come up for you as a result of today's lesson.

## Homework <a class="anchor" id="homework"></a>
---

Use the following code blocks to try OCR'ing various texts. You could use your own files containing digitized texts or locate files to try via [JStor](https://www.jstor.org/), the [Internet Archive](https://archive.org/), [Chronicling America](https://chroniclingamerica.loc.gov/) or other resources. Try texts in different languages, fonts or types, formats, layouts, etc. 

### 1. Upload your selected text(s) to a new folder in your space on Binder:

- Make sure that the texts you select are stored in an image (.jpg, .png, .tiff) format. If you have selected a text with multiple pages, make sure each page is stored in a separate file. *If you have PDF files and are not sure how to generate images from them, bring them to Lesson 02. We'll be looking at how to generate image files together during the lesson.*

- Click the "Jupyter" icon at the top of the browser. (Recommended: right click and select "Open in New Tab.")

- Above the list of files in Jupyter at the top right, click "New" and create a new folder.

- A new "Untitled Folder" will be created. Select the check box to the left of your new folder.

- Click "Rename" at the top left and give your folder a name.

- Click your new folder's name in the list of files and folders.

- In your new folder, select "Upload" in the top right and upload your chosen image file(s).

### 2. Perform OCR on your image file. 

Use the code blocks above or start fresh below. Change the language attribute to match the text's language. Try out the various settings we looked at above.

Below, make sure to replace "FOLDER NAME/FILE NAME" with your folder name and specific file name. For example, `sample/sessionlawsresol1955nort_0057.jpg`.

In [None]:
# Name the image file. (Assign it to a variable.)
# REPLACE THE FOLDER AND FILE NAME BELOW.
file = "FOLDER_NAME/FILE_NAME.jpg"

custom_oem_psm_config = r'--oem 3 --psm 3'

# Open the file named above. 
# While it's open, do several things:
with open(file, 'rb') as inputFile:
        
    # Read the file using PIL's Image module.
    img = Image.open(inputFile)
    
    # Run OCR on the open file.
    ocrText = pytesseract.image_to_string(img, lang="eng", config=custom_oem_psm_config)
        
    # Get a file name--without the extension-- 
    # to use when we name the output file.
    fileName = file.strip('.jpg')

# The image file above will be closed before moving on to this line.
# The OCR'ed text has been pulled from the image and stored in
# a Python variable for us to continue to use.

# Create and open a new text file, name it to match its input file,
# declare its encoding to be UTF-8 so that it correctly outputs
# non-ASCII characters,
with open(fileName + ".txt", "w", encoding="utf-8") as outFile:
        
    # and write the OCR'ed text to the file.
    outFile.write(ocrText)

# Display a message to let us know the file has been created
# and the script successfully completed.
print(fileName, "text file successfully created.")

### 3. Repeat steps 1-2 with other files.

## Resources <a class="anchor" id="resources"></a>
---

### Jupyter Notebooks Tutorials & Reference

- [Tutorial: Getting Started with Jupyter Notebooks](https://ithaka.github.io/tdm-notebooks/getting-started-with-jupyter.html)
- [Jupyter Notebooks documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html)
- Jupyter Notebooks keyboard shortcuts: press Esc+H to show a full list.


- ["Markdown for Jupyter notebooks cheatsheet."](https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet) *IBM Watson Studio Local.*
- Olivia Smith. ["Markdown in Jupyter Notebook."](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook) *Data Camp.*


### Readings on OCR

- Algun, Selcuk. 2018. ["Review for Tesseract and Kraken OCR for text recognition."](Review for Tesseract and Kraken OCR for text recognition) *Data Driven Investor.*
- Bakker, Rebecca. ["OCR for Digital Collections."](https://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1047&context=glworks) *FIU Digital Commons.*
- Baumman, Ryan. ["Automatic evaluation of OCR quality."](https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html) */etc.*
- Cordell, R. 2017. ["Q i-jtb the Raven": Taking Dirty OCR Seriously."](https://ryancordell.org/research/qijtb-the-raven/) *Book History*, 20, 188-225.
- Cordell, Ryan. 2019. ["Why You (A Humanist) Should Care About Optical Character Recognition."](https://ryancordell.org/research/why-ocr/) *Ryan Cordell.* 
- Coyle, Karen. ["Digital Urtext."](https://kcoyle.blogspot.com/2012/04/digital-urtext.html) *Coyle's InFormation.*
- Hawk, Brandon W. ["OCR and Medieval Manuscripts: Establishing a Baseline."](https://brandonwhawk.net/2015/04/20/ocr-and-medieval-manuscripts-establishing-a-baseline/) *Brandon W. Hawk.* (This post is a comparison of ABBYY FineReader & Adobe Acrobat OCR technologies as applied to medieval texts.)
- Holley, Rose. 2009. [How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs,"](http://www.dlib.org/dlib/march09/holley/03holley.html) *D-Lib Magazine* 15, no. 3/4.
- Milligan, I. 2013. ["Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010.](https://www.muse.jhu.edu/article/527016) *The Canadian Historical Review* 94(4), 540-569.
- Smith, David, and Ryan Cordell. 2018. ["A Research Agenda for Historical and Multilingual Optical Character Recognition."](http://hdl.handle.net/2047/D20297452)
- Smith, Ray. 2007. ["An Overview of the Tesseract OCR Engine."](https://tesseract-ocr.github.io/docs/tesseracticdar2007.pdf)
- Smith, Ray, Daria Antonova, and Dar-Shyang Lee. 2009. ["Adapting the Tesseract open source OCR engine for multilingual OCR."](https://dl.acm.org/doi/10.1145/1577802.1577804) MOCR '09: Proceedings of the International Workshop on Multilingual OCR.
- Tanner, Simon. ["Deciding whether Optical Character Recognition is feasible."](https://www.kb.nl/sites/default/files/docs/OCRFeasibility_final.pdf) *King's Digital Consultancy Services.*


### OCR Tutorials & Reference

*The following is a list of tutorials that include different scholars' approaches to OCR. Some also use Tesseract, but most use different scripting or programming languages. There is no single best way to do OCR, so if you have the time they worth trying to see which works best for your project.*

- Aidan. ["OCR with Python."](https://medhieval.com/classes/hh2019/blog/ocr-with-python/) *Hacking the Humanities 2019.*
- Akhlaghi, Andrew. ["OCR and Machine Translation."](http://programminghistorian.org/en/lessons/OCR-and-Machine-Translation) *The Programming Historian.* (Note that this tutorial uses Tesseract but works with the bash scripting language instead of Python.)
- Baumman, Ryan. ["Command-Line OCR with Tesseract on Mac OS X."](https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html) */etc.*
- Dull, Joshua. ["Text Recognition with Adobe Acrobat and ABBYY FineReader."](https://github.com/JoshuaDull/Text-Recognition-Introduction/)
- Graham, Shawn. ["Extracting Text from PDFs; Doing OCR; all within R."](https://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/) *Electric Archaeology.* (This blog post describes a method for OCR using the R programming language.)
- Mähr, Moritz. ["Working with batches of PDF files."](https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files) *The Programming Historian.* (Note that this tutorial uses Tesseract and works in the command line without Python.)
- Shperber, Gidi. ["A gentle introduction to OCR."](https://towardsdatascience.com/a-gentle-introduction-to-ocr-ee1469a201aa) *Toward Data Science.* October 22, 2018.
- Tarnopol, Rebecca. ["How to OCR Documents for Free in Google Drive."](https://business.tutsplus.com/tutorials/how-to-ocr-documents-for-free-in-google-drive--cms-20460) *TutsPlus.*


- [**PyTesseract documentation**](https://github.com/madmaze/pytesseract)
- [**Tesseract documentation**](https://tesseract-ocr.github.io/)


### Additional Reading

- Rockwell, Geoffrey, and Stéfan Sinclair. 2016. [*Hermeneutica: Computer Assisted Interpretation in the Humanities.*](http://hermeneuti.ca/)
- Underwood, Ted. ["The challenges of digital work on early-19c collections."](https://tedunderwood.com/2011/10/07/the-challenges-of-digital-work-on-early-19c-collections/) *The Stone and the Shell.*
- [TranScriptorium's handwritten text recognition project results.](https://cordis.europa.eu/project/id/600707/results)
- ["How to Transcribe Documents with Transkribus - Introduction."](https://readcoop.eu/transkribus/howto/how-to-transcribe-documents-with-transkribus-introduction/) *Read Coop.*

- **[Basics of Fair Use](https://copyright.columbia.edu/basics/fair-use.html)** from Columbia University.


### Additional Tutorials -- Thanks to everyone who contributed!

- ["Social Network Analysis from Theory to Applications with Python."](https://towardsdatascience.com/social-network-analysis-from-theory-to-applications-with-python-d12e9a34c2c7) *Toward Data Science.*
- [TAP Institute Pandas course materials](https://nkelber.github.io/tapi2021/book/courses/pandas.html) -- for "cleaning" data using Python.
- ["Cleaning Data with Open Refine."](https://programminghistorian.org/en/lessons/cleaning-data-with-openrefine) *The Programming Historian.*
- [Working with Files in Python](https://automatetheboringstuff.com/chapter8/) from *Automate the Boring Stuff*.