<img align="left" src="../All-sample-files/CC_BY.png"><br />

Created by [Hannah Jacobs](http://hannahlangstonjacobs.com/) for the [2021 Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

Adapted by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
____

# Creating an OCR Workflow (Pre-Processing)

These notebooks describe how to turn images and/or pdf documents into plain text using Tesseract optical character recognition. The goal of this notebook is to help users design a workflow for a research project.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](../Python-basics/python-basics-1.ipynb))
* [Optical Character Recognition Basics](./ocr-basics.ipynb)

**Knowledge Recommended:**

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Libraries Used:**
* [Tesseract](https://tesseract-ocr.github.io/) for performing optical character recognition.
* [poppler](https://github.com/cbrunet/python-poppler) for working with pdf files
* [pdf2image](https://pdf2image.readthedocs.io/en/latest/) for converting pdf files into image files

**Learning Objectives:**

1. Describe and implement an OCR workflow for pre-processing
2. Explain the importance of performing adjustments (pre-processing) to inputs before running OCR
3. Identify possible technical challenges presented by specific texts and propose potential solutions

**Research Pipeline:**

1. Digitize documents
2. **Optical Character Recognition**
3. Tokenize your texts
4. Perform analysis
___

## A Full OCR Workflow

In addition to examining your documents and tools, you also need to carefully consider issues of time, labor, and funding. Is this project small enough for a single person to complete? How many labor-hours will it take? How many computing hours? As you complete each step, keep in mind how long certain processes take. You may need to make some hard decisions about how much to do, how accurate your text will be, or whether the project is even feasible without more funding and support. It is common for OCR project planners to greatly underestimate the necessary time, so leave generous cushion for budget and labor-hour overruns. 

The full OCR workflow will look something like this:

1. Digitize
    * Acquire materials
    * Photograph (at high-resolution using archival format, such as tiff/jpeg2000)
    * Quality check (for missing pages, blurry scans, etc.)
    * Organize 
    * Archive (into a long-term digital repository)
2. Pre-processing (prepare image files)
    * Convert files (to a compatible image format)
    * Organize files (into folders by volume)
    * Image correction (adjust skew, warp, noise, rotatation, scale, layout order, etc.)
    * Quality check
3. OCR batch processing
4. Post-processing (quality assessment)
    * Dictionary assessment
    * Random sample assessment
5. Archive
    * Choosing a repository
    * Data and metadata format
    * Backup and hashing

This notebook focuses on the OCR preprocessing and the next notebook will focus on post-processing. Beyond these two processes, the digitization and archiving steps take significant consideration, time, expertise, and effort. Ideally, these processes should be completed by experts in each domain.

The path outlined here is linear. In practice, however, many of these steps are more recursive and looping. As problems are discovered, the workflow will need to be adapted and improved. Again, leave cushion for budget and labor-hour overruns; you will find problems that were not obvious at the beginning of the process. For large projects with limited budgets, you will need to set goals for your accuracy and speed. Be ready to make compromises. If you're pursuing grant funding, consider separating steps 1 and 2 into two different applications.

**A note on digitizing your own corpus:**
If you're doing the scanning yourself or will be working with someone to digitize materials, it's a good idea to carefully plan your scanning process. Every step matters in terms of generating the best possible OCR results. [Digital NC](https://www.digitalnc.org/) have posted their digitization guidelines along with descriptions of their scanning equipment. These can provide a helpful starting point if you will be beginning your project with undigitized materials.
___

## Opening questions for your OCR workflow

1. [How much text?](#how-much)
2. [Born-digital or digitized?](#born-digital)
3. [Hand-written manuscript or printed using a press?](#hand-written)
4. [Text formatting](#formatting)
5. [Text condition](#text-condition)
6. [Image quality](#image-quality)
7. [Historical script](#historical-script)
8. [Language support](#language-support)


### How much text? <a id="how-much"></a>
We begin with this question because if you have only a few pages, there may be merit in typing them out by hand in a text editor, and perhaps working with a team to do so. If you have hundreds of thousands of pages, though, it will take far longer than you have time for, even working with a team, to manually transcribe every page you need to complete a project. For large projects, you'll want to start with an automated transcription (OCR) process and then work to correct what the computer outputs.

### Born-digital or digitized?<a id="born-digital"></a>
In most cases, born-digital texts in PDF and image formats are easier for a computer to "recognize" than scanned documents, even if the scanners use the highest resolution equipment. This is particularly true of older printed texts with unique scripts or layouts.

An exception to this is if a born-digital text is stored in an image or other non-text-editable format that is uncommon, proprietary, or outdated. Then computers may have a hard time accessing the file in order to parse the text contained. (So always save documents in an interoperable—can be opened by different software programs—file format either as [editable text](https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html#textualdata) or as [non-editable image or archival document--PDF--formats](https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html#scannedtext).)

### Hand-written manuscript or printed using a press?<a id="hand-written"></a>
OCR technologies were initially developed to deal only with digitized texts created using a [printing press](https://en.wikipedia.org/wiki/Printing_press). This was because printing presses offer a certain amount of consistency in typeface, font, and layout that programmers could use to create rules for computers to follow (algorithms!). 

Meanwhile, handwriting is, by and large, more individualistic and inconsistent. Most programs for OCR still focus only on printed texts, but there are a growing number of projects and toolkits now available for what's called variously ["digital paleography"](https://academic.oup.com/dsh/article/32/suppl_2/ii89/4259068), ["handwriting recognition" (HWR)](https://en.wikipedia.org/wiki/Handwriting_recognition), and ["handwritten text recognition" (HTR)](https://en.wikipedia.org/wiki/Handwriting_recognition). [Transkribus](https://readcoop.eu/transkribus/) is a popular example.

As an example, let's compare excerpts from Toni Morrison's *Beloved*. The first image below is a page from an early draft, written in Morrison's own hand on a legal pad. The second image is a segment from a digitized print version. These are not the same passages, but they are noticably different in how we read them: Try reading each. What's different about the experience--think about order of reading, ease of reading, and any other differences that come to mind:

![A page from Toni Morrison's early draft of Beloved. Courtesy of Princeton University Library" title="A page from Toni Morrison's early draft of Beloved. Courtesy of Princeton University Library](../All-sample-files/07-ocr-03.jpeg)


**An early draft of Toni Morrison's *Beloved*. Image credit: [Princeton University Library](https://blogs.princeton.edu/manuscripts/2016/06/07/toni-morrison-papers-open-for-research/)**


![Screenshot of a page in Toni Morrison's Beloved. Preview hosted on Google Books.](../All-sample-files/07-ocr-02.jpeg)

**Screenshot from a digitized version of the published *Beloved*, available in [Google Books](https://www.google.com/books/edition/Beloved/sfmp6gjZGP8C?hl=en&gbpv=1&dq=toni+morrison+beloved&printsec=frontcover).**

### Text formatting?<a id="formatting"></a>
Look at the texts above again. How are they formatted similarly or differently? While both use a left-to-right writing system, the printed version appears in a single column that is evenly spaced both horizontally and vertically. The manuscript text appears on lined paper in a single column, but it includes a number of corrections written between lines or even in different directions (vertically) on the page. You might have tilted your head to read some of that text—if you had been holding the paper in your hands, you might have turned the paper 90 degrees. But computers don't necessarily know to do that (yet). They need a predictable pattern to follow, which the printed text provides.

That said, not all historical printings are as regular as this *Beloved* excerpt. Let's take a look at one more example from *On The Books*:

![Screenshot from the 1887 North Carolina session laws digitized by UNC Libraries and shared via the Internet Archive.](../All-sample-files/07-ocr-04.jpeg)

**Screenshot from the 1887 North Carolina session laws digitized by UNC Libraries and shared via the Internet Archive.**

Like the printed *Beloved* example, this selection from the [1887 North Carolina session laws](https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up) was created using a printing press and with mostly even vertical spacing between lines that run left to right. However, in addition to the changing typeface, there is a main column of text next to a much smaller column of ["marginalia"](https://en.wikipedia.org/wiki/Marginalia) annotations. These annotations were created to aid readers who would have been looking for quick topical references rather than reading a volume from start to finish. These created a problem for the *On The Books* team because the computer read them as being part of the main text. What resulted (with other OCR errors removed) would have looked like:

`SECTION 1. The Julian S. Carr, of Durham, North Carolina, Mar- Body politic. cellus E. McDowell, Samuel H. Austin, Jr., and John A. McDowell,`

What's the problem here? The marginalia, `Body politic`, has been interspersed with the text as the computer "reads" all the way across the page. The line should read:

`SECTION 1. The Julian S. Carr, of Durham, North Carolina, Mar-cellus E. McDowell, Samuel H. Austin, Jr., and John A. McDowell,`

The computer doesn't realize that it's creating errors, and if the annotations are not in any way mispelled, the *On The Books* team might have a hard time finding and removing all of these insertions. The insertions might then have also caused major difficulties in future computational analyses.

Because marginalia would have caused such havoc in their dataset, the *On The Books* team decided to remove the marginalia as part of preparing for OCR. You can [find the documentation about this in the team's Github](https://github.com/UNC-Libraries-data/OnTheBooks/tree/master/examples/marginalia_determination).

### Text condition?<a id="text-condition"></a>
Even with the use of [state of the art scanning equipment](https://www.digitalnc.org/about/what-we-use-to-digitize-materials/), extraneous marks (from annotations to document stains) can interfere with OCR. Here are some examples.

*Someone writing on a printed text.* These check marks might be read as "l" or "V" by the computer:

![Screenshot of check marks written in 1887 North Carolina sessions law digitized by UNC Libraries and shared via the Internet Archive.](../All-sample-files/07-ocr-05.jpeg)

`not be worked on said railroad in the counties of New l Hanover or Pender.`

The printed text has faded so that individual characters are broken up, and the ink is harder to read. (Historic newpapers are notorious for this. [Here's an example](https://chroniclingamerica.loc.gov/lccn/sn85042104/1897-01-14/ed-1/seq-6/#date1=1890&index=2&rows=20&words=asylum+ASYLUM+Asylum&searchType=basic&sequence=0&state=North+Carolina&date2=1910&proxtext=asylum&y=0&x=0&dateFilterType=yearRange&page=1).):

![Screenshot of faded text printed in 1887 North Carolina sessions law digitized by UNC Libraries and shared via the Internet Archive.](../All-sample-files/07-ocr-06.jpeg)

`three hundred dollars' t\"Orth of property and the same arnouut on each poll, which shall constitute and be held for a S€1'.)arate fund,`

A *smudge, spot, or spill has appeared on the page*, causing the computer to misinterpret a character or erroneously add characters.

There is also one additional possibility that can be a result of close binding, or the human doing the scanning avoiding the possibility of breaking tight or damaged binding: that is, **text that is rotated slightly** on the digitized page so that it appears at a slight angle.

![Screenshot of tilted text in 1887 North Carolina sessions law digitized by UNC Libraries and shared via the Internet Archive.](../All-sample-files/07-ocr-08.jpeg)

### Image Quality<a id="image-quality"></a>

The higher quality the digitization, the better the OCR—this is the general rule. We can begin, though, with the number of pixels per image--that is, the number of pixels per *inch*. **In an ideal world, you will start with images that were scanned at 300 ppi or better.** Remember that computers present images as a grid of pixels, usually squares but sometimes rectangles, and that each carry specific color information. Put hundreds, thousands, millions of pixels together, and we have an image. 

![Screenshot of text stored in an image format from a page of North Carolina laws](../All-sample-files/07-ocr-01.jpeg)

A common way for computer programmers to measure image quality is by assessing the number of pixels per inch (ppi). This is important for many reasons: a photographer will want to keep their number of pixels high (perhaps 300 ppi) in preparation for printing, but a web designer will want a much lower number of pixels (72 ppi) to keep an image looking crisp while also keeping file sizes small to avoid slowing down webpage loading time. If you've ever opened a webpage and seen text but had to wait a few seconds for images to load, you've seen the difference between how long it takes for text vs. an image to load. The more pixels, the larger the file (in kilobytes, megabytes, or even gigabytes), and large files take longer to move from a server to your computer—add in low bandwidth internet, and the load time increases exponentially. 

So, what's the difference? Let's look:

![An image of the letter S at 72 ppi and 300 ppi.](../All-sample-files/ppi-comparison.png)
The left image shows a scanned letter S at 72 ppi. The visible squares represent individual pixels. Note that each pixel represents one color from the page, and there is a transition between pixels representing ink and those representing paper. 

The right image is the same letter S rescaled to 300 ppi. The squares here appear smaller because there are far more of them. Note that instead of there being only a line 1-2 pixels wide making up the S shape, there are far more for Tesseract to "read" and interpret.

[Per its documentation](https://tesseract-ocr.github.io/tessdoc/ImproveQuality), Tesseract works best with an image resolution of 300 ppi. The documentation actually uses "dpi", or [dots per inch](https://en.wikipedia.org/wiki/Dots_per_inch). If you're beginning your project by scanning materials, this unit will be important when you set up your scanner, but once you move into image processing, we're dealing with [pixels per inch](https://en.wikipedia.org/wiki/Pixel_density). These are not the same, but many people use dpi and ppi interchangeably.

### Historical script?<a id="historical-script"></a>
This applies mainly to students and scholars working with *historical texts printed or written in scripts that are not commonly legible to humans (or computers) today*. These could be anything from medieval scripts like [Carolingian miniscule](https://en.wikipedia.org/wiki/Carolingian_minuscule) to neogothic scripts used in [twentieth-century German-American newspapers](https://chroniclingamerica.loc.gov/lccn/sn84027107/1915-07-01/ed-1/seq-1/) to the many, many historic non-Western scripts. These are areas where research is in progress, but you might find this [Manuscript OCR](https://manuscriptocr.org/) tool of interest as well as this [essay on the challenges medievalists continue to face when using OCR technologies](http://digitalhumanities.org/dhq/vol/13/1/000412/000412.html). When choosing an OCR tool, this is one of the capabilities you'll want to check for.

### Language support?<a id="language-support"></a>
Similar to the historic script issue, for scholars and students working with or studying *less common, perhaps endangered, and especially non-Western languages*, you'll want to see if an OCR tool supports your particular language. Tesseract offers [a list of the languages and scripts it supports](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). Tesseract supports 125+ languages and dialects--likely those most commonly spoken, based on shared [writing systems](https://en.wikipedia.org/wiki/Writing_system), and/or those that researchers may have invested time in training Tesseract to "read" for some specific reason. This is just a fraction of the languages and scripts in the world, though. 

Unfortunately, if you're working with Indigenous writing systems such as [Canadian Aboriginal Syllabics](https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics), you still may need to seek out additional support from computer scientists for developing OCR technologies to "read" these languages.

# Convert files

Tesseract prefers image files. If you are starting from a PDF or a bunch of PDFs, here are a few ways you can convert each page into a separate image file:

- [Use Adobe online](https://www.adobe.com/acrobat/online/pdf-to-jpg.html) (1 pdf at a time...)
- [Use Adobe Acrobat](https://helpx.adobe.com/acrobat/using/exporting-pdfs-file-formats.html?mv=product) (1 pdf at a time...)
- [Use pdf2image](https://pypi.org/project/pdf2image/) (1 pdf or many)


**Note:** Technically, it's possible to feed Tesseract a PDF, but breaking up a PDF into images breaks down the OCR process from one massive task into a bunch of smaller tasks that are better for your computer -- if something happens, and the process is interrupted, you'll be able to pick up from where you left off if you are working from images. If you are processing an entire PDF and your computer freezes, you'll need to start over from the beginning.

## Convert a pdf to image files using pdf2image
First, let's create a folder to hold our sample pdfs and then download them into the folder.

We can use pdf2image to convert a pdf file into a set of image files. [(See pdf2image documentation)](https://github.com/Belval/pdf2image). Let's try converting a single pdf file first: `sample_01.pdf`.

In [None]:
### Convert a single PDF into a series of image files ###

# Import pdf2image's convert_from_path module.
from pdf2image import convert_from_path
# Import pathlib's Path module.
from pathlib import Path

# # Define where the images will be saved
# # Check if a folder exists to hold pdfs. If not, create it.
data_folder = Path('../All-sample-files/sample_pdfs/pdf_images/')
data_folder.mkdir(exist_ok=True, parents=True)

# Get the PDF and convert to a group of PIL (Pillow) objects
# This does NOT save the images as files.
document_path = Path('../All-sample-files/sample_01.pdf')
PIL_objects = convert_from_path(document_path)

# For each PIL image object:
for page, image in enumerate(PIL_objects):

    # Create a file name that includes the original file name, and
    # a file number, as well as the file extension.
    fileName = f'{data_folder.as_posix()}/image_{str(page)}.jpg'

    # Save each PIL image object using the file name created above
    # and declare the image's file format. (Try also PNG or TIFF.)
    image.save(fileName, 'JPEG')

# Success message
print('PDF converted successfully')

Now, let's try multiple pdfs. This will require a more complicated file structure. Here, we create a new folder of images for each pdf.

In [None]:
### Convert multiple pdfs into a set of image files ordered by folder ###

# Import pdf2image's convert_from_path module.
from pdf2image import convert_from_path
# Import pathlib's Path module.
from pathlib import Path

# Define where the input pdfs are stored
pdfs_folder = Path('../All-sample-files/')

# For each pdf file in the pdf folder, do the following:
for pdf in pdfs_folder.rglob('*.pdf'):
    # Announce the current working file
    print(f'Converting {pdf.name}')
    
    # Create a folder for the images
    pdf_images = Path(f'{pdfs_folder.as_posix()}/{pdf.stem}_images')
    pdf_images.mkdir(exist_ok=True)
    
    # Get the PDF and convert to a group of PIL (Pillow) objects
    # This does NOT save the images as files.
    document_path = Path(pdf)
    PIL_objects = convert_from_path(document_path)

    # For each PIL image object:
    for page, image in enumerate(PIL_objects):

        # Create a file name that includes the original file name, and
        # a file number, as well as the file extension.
        if page < 10:
            fileName = f'{pdf_images.as_posix()}/image_000{str(page)}.jpg'
        elif page < 100:
            fileName = f'{pdf_images.as_posix()}/image_00{str(page)}.jpg'
        elif page < 1000:
            fileName = f'{pdf_images.as_posix()}/image_0{str(page)}.jpg'
        else:
            fileName = f'{pdf_images.as_posix()}/image_{str(page)}.jpg'

        # Save each PIL image object using the file name created above
        # and declare the image's file format. (Try also PNG or TIFF.)
        image.save(fileName, 'JPEG')

print('All PDFs converted to images.')


# Image correction<a id="image-correction"></a>

This section introduces the most common types of image correction. The only way to discover the exact type and number of image corrections needed for your text is to try a sample of your documents. You want to create a sample that is diverse. Ideally, you would choose a large random sample of images, but it may also be worthwhile to hand-pick some examples (perhaps you know a particular volume has issues with spotting, rotation, or skewing?). Ideally, you can create a single set of image corrections that can be applied to any image and give a satisfactory result that is ready for OCR processing. In practice, you may have a custom set of operations for particular volumes that have unique problems. Depending on your images, more or less corrections may be necessary. 

This work requires trial-and-error to figure out the best adjustments and OCR settings for your corpus. The general discovery of the best method will resemble:

1. **Create a folder of sample text from your corpus.** The size of the sample may depend on the corpus' size and homogeneity or heterogeneity, but it should be an amount that you and/or your team could review manually in a reasonably short period of time.
2. **Look for potential issues & needed adjustments.** Issues may include skewed or rotated text, fade text, smudges or damage to the page, etc.
2. **Run OCR on your sample.**
3. **Review the output** to identify errors, looking especially for error patterns that could be addressed at a corpus level. 
4. **Create a list of errors and possible adjustments** that you might use to address the errors. Order the list based on which errors should be solved first--which might address the largest number of errors. For example, it would be more important to fix rotated or skewed pages across the sample/corpus before trying to use erosion or dilation to make specific pages more legible to Tesseract. 
5. **Make the first adjustment** on your list to the sample.
6. **Re-run OCR on your sample.**
7. **Review the output.** Has the output improved noticeably? Are there still errors and error patterns? 
8. **Repeat some or all of the above steps:** Depending on your findings, you might continue applying adjustments from your list, re-running OCR, and reviewing outputs, or you might be ready to move on to the next step.

These are common pre-processing tasks with example code offered here:

* rotatation
* inversion
* grayscale and binarization

Depending on your texts you may need to use additional steps. We discuss some of these here and supply links to additional resources, but they are beyond a beginner's introduction to OCR:

* Cropping
* Noise Removal
* Dilation and Erosion
* Identifying Layout and Text Order

## Rotation

It is common for image scans to have a slight rotation or skew. The result is the lines of text are sloped from left to right—either ascending or descending. 

![An image of a skewed page from On the Books](../All-sample-files/skewed_page.png)

*Why do errors occur when reading a rotated or skewed text?* Tesseract has been programmed to expect to "read" a language in the same way a human would. We read English left to right and from the top of a page down. Although we are able to parse text even when viewing it at an angle (maybe you can even read text upside down), Tesseract doesn't do this well. It will still attempt to read a rotated line from left to right--it won't know to follow the text as it slants down or up. So it returns its interpretation of the letters that fall within its line of "sight." This is why, particularly with rotated texts, we may receive symbols and other unexpected characters.

We can assess the angle of rotation using some example code from the [On the Books](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/marginalia_determination/marginalia_determination.ipynb) project. The `rotation_angle()` and `find_score()` functions will help determine the level of rotation found in a given page:

>The rotation_angle function accepts an original image variable as an argument. The image is then converted to a binarized numpy pixel array (bin_img). The function then defines a list of angles from -1 to 1 in intervals of .25 degrees. The loop iterates through this list, sending each angle along with bin_img to another crop function, find_score.

>find_score evaluates rotations of bin_img according to each of the angles in the above list. It does so first by rotating the pixel array by the given angle. The rotated array (data) is then converted to a vertical histogram of pixel 'counts'. These values represent the number of non-zero pixels found along the horizontal axis for a given vertical axis coordinate. The function then calculates score, which is maximized for histograms with the most definitive 'peaks' and 'valleys'. A histogram with the most notable 'peaks' and 'valleys' is indicative of a text image with lines that run parallel to the horizontal axis. Thus, rotation angles that achieve higher scores have achieved more success in 'straightening' the original image.

>Then after finding scores for each angle in the list, rotation_angle returns the angle with the highest score - the angle that will best straighten an image given its particular degree of skew.

In [None]:
### Find the optimum rotation angle ###

## Based on code by Lorin Bruckner for On the Books.
## Lorin Bruckner derived find_score and rotation_angle from:
## https://avilpage.com/2016/11/detect-correct-skew-images-python.html
## Adjust delta and limit in order to increase/decrease accuracy and speed
import os
from PIL import Image
import numpy as np
import scipy.ndimage

def find_score(arr, angle):
    """Determine score for a given rotation angle.
    """
    data = scipy.ndimage.rotate(arr, angle, reshape=False, order=0)
    hist = np.sum(data, axis=1)
    score = np.sum((hist[1:] - hist[:-1]) ** 2)
    return hist, score

def rotation_angle(img):
    """Determine the best angle to rotate the image to remove skew.
    
    Parameters:
    img (PIL.Image.Image): Image
    
    Returns:
    (int): Angle
    """
    wd, ht = img.size
    pix = np.array(img.convert('1').getdata(), np.uint8)
    bin_img = 1 - (pix.reshape((ht, wd)) / 255.0)
   
    delta = .5 # The number of degrees between each rotation
    limit = 10 # The limit of the rotations in each direction
    angles = np.arange(-limit, limit+delta, delta)
    scores = []
    for angle in angles:
        hist, score = find_score(bin_img, angle)
        scores.append(score)
    
    best_score = max(scores)
    best_angle = angles[scores.index(best_score)]
    
    return float(best_angle)

In [None]:
### Assign a rotation score for a given image ###
from PIL import Image
from pathlib import Path

# Open the rotated image file
f = Path('../All-sample-files/rotated_sample.jpeg')
orig1 = Image.open(f)

## Use rotation_angle and find_score to determine angle
angle_1 = rotation_angle(orig1)
print(angle_1)

In [None]:
### Rotating the image based on the best angle ###
import PIL

# Open the selected image
im = Image.open('../All-sample-files/rotated_sample.jpeg')

# Rotate the image
# expand = True expands the image size to keep all data, filling space with white pixels
# PIL.Image.Resampling.BICUBIC chooses BICUBIC resampling, a more accurate form of
# resampling than the default: Nearest Neighbors
im1 = im.rotate(angle_1, PIL.Image.Resampling.BICUBIC, expand = True)
 
# to show specified image
im1.save('./fixed_rotation.jpeg')

## Inverting & Binarizing
Early OCR programs required light text on dark backgrounds to operate correctly. In recent years, many OCR programs have moved to preferring dark text on light backgrounds. This means that **inversion** is typically not an issue historians need to worry about since most printed documents are dark text on light background. There might be some exceptions to this if you are working with, for example, images of microfiche.

Python's PIL can help us handle [image inversion](https://pillow.readthedocs.io/en/latest/reference/ImageOps.html#PIL.ImageOps.invert).

In [None]:
### Invert an image file ###
from PIL import Image
from PIL import ImageOps

file_path = '../All-sample-files/ocr_sample.jpg'

# Open the original image file.
file = Image.open(file_path)

# Use the ImageOps.invert function to invert the colors in the original file.
inverted_file = ImageOps.invert(file)

# Save the newly inverted image file.
inverted_file.save("./inverted_sample.jpg")

# Success message
print(file_path, 'converted successfully.')

While **inversion** switches the colors in an image file, **binarization** converts an image so that it shows image data in only two pixel "colors": black and white. Tesseract does this as part of its OCR process, but it might be worthwhile to do this ahead of time if you're trying to reduce noise or see where, for example, a shadow on a page may introduce problems for Tesseract. The first step is converting the full-color image into a grayscale image:

In [None]:
### Converting an image into grayscale or black and white ###
from PIL import Image

file_path = '../All-sample-files/ocr_sample.jpg'

# Open the original image file.
file = Image.open(file_path)

# Use the Image.convert function to change the original file
# to grayscale or black and white.
# Use mode "L" to return an 8-bit grayscale image.
# Use mode "1" to return a true black and white image.
binarized_file = file.convert("L")

# Save the new grayscale image file.
binarized_file.save("./sample_grayscale.jpeg")

# Success message
print(file_path, 'converted successfully.')

A true black and white image can improve OCR results, but only if it does not introduce too much noise. One way to reduce this noise is to set a different threshold for when a pixel becomes either white or black. This can be done using another library like [OpenCV](https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html) or by using a little more code with PIL. 

In [None]:
### Setting a threshhold with PIL ###
# This code source from https://www.educative.io/answers/how-to-binarize-an-image-in-pillow

# Import libraries
import PIL
from PIL import Image

# Binarization function
def binarize(img, thresh):

  #convert image to greyscale
  img=img.convert('L') 

  width,height=img.size

  #traverse through pixels 
  for x in range(width):
    for y in range(height):

      #if intensity less than threshold, assign white
      if img.getpixel((x,y)) < thresh:
        img.putpixel((x,y),0)

      #if intensity greater than threshold, assign black 
      else:
        img.putpixel((x,y),255)

  return img

In [None]:
### Binarize an image based on a threshold number ###
# A higher threshold will only transform darker pixels
# into black pixels

file_path = '../All-sample-files/ocr_sample.jpg'
threshhold = 220 # A value between 0-255

# Open the original image file.
img = Image.open(file_path)

# Binarize the image and save it
bin_image = binarize(img, threshhold)
bin_image.save("./bin_image.jpeg")

## Cropping

![Two images, the one on the left is a full book that has not been cropped to the text. The right image only contains the text.](../All-sample-files/08-ocr-05.jpeg)

When documents are scanned, often there is more included in the image than just the document itself: the stand or supports for the document, color calibration targets, rulers, and anything else in close proximity to the document.  Archivists preparing scanned materials for the Internet Archive and other digital repositories may crop out all parts of a scanned image that are *not* part of the document, aiming to create image files of a relatively uniform size.

If your images have not been cropped already, **here are a few resources for learning how to batch crop images:**
- In Python: [this Jupyter Notebook explains how to prepare to crop](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/marginalia_determination/marginalia_determination.ipynb), and [this Notebook implements the crop along with other adjustments we'll explore further here](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/adjustment_recommendation/adjRec.ipynb)
- [In Photoshop](https://helpx.adobe.com/photoshop/using/crop-straighten-photos.html)

## Noise Removal
Images can't produce sound, but they can still have *noise*. In an image, noise is a **random variation in brightness or color**. Let's look again at our S from earlier:

![An image of the letter S at 72 ppi.](../All-sample-files/pixelized-s.png)

The pixels surrounding the S represent the color of the paper the S was printed on. The pixels are not all one color, though. That variation is noise. In these images, the noise has already been minimized in the scanning process: if you open one of the images and zoom in, you may notice that blank page space surrounding text appears to have many pixels that are close to the same color. 

**Tesseract removes noise on its own, but this process can also introduce errors in images that have a high amount of noise.** If you want to learn more about noise and removing it using Python [here's a good place to start](https://docs.opencv.org/3.4/d5/d69/tutorial_py_non_local_means.html). 

## Dilation and erosion
Finally, there is dilation and erosion. As we can see in our sample page, printers often varied font thickness when they set the text type. Bold might be used for headings while thinner fonts might be used for smaller text. Depending on the print quality, bolded text might have additional ink around it, while thinner text might not have enough ink. Variation in ink thickness can throw Tesseract off, so **eroding bolded text** (making it thinner) and **dilating very thin text** (adding thickness) can help address this issue.

Performing erosion and dilation in Python requires some additional understanding of image processing. We won't cover it here (and our samples don't need it!), but [this GeeksForGeeks tutorial](https://www.geeksforgeeks.org/erosion-dilation-images-using-opencv-python/) explains the basics and provides sample code.

## Identifying Layout & Text Order

There are many instances when we might be working with printed documents that have text arranged in a variety of ways--not just in a single column or orientation on the page.

While Tesseract does have tools for estimating a document's orientation, on its own it is not well-equipped to identify text order, recognize images, or understand arrangement of text and images on a page--tasks that many refer to as ["document layout analysis"](https://en.wikipedia.org/wiki/Document_layout_analysis) or "page layout analysis." These analysese need to be performed before running Tesseract in order to proide it with th ecorrect ordering and layout information—or, rather, in order to focus its attention on specific parts of a document in a specific sequence. Here is an overview of that workflow:

1. Identify the areas on a page that you want Tesseract to focus on. It may be that you want to include only *some* parts of a page and not others. Consider whether this area might be similar or different on different documents.

![A newspaper article with two columns](../All-sample-files/news_column.jpeg)

2. For each page, calculate the area that you want to *include* in the OCR. To do this, use pixels as cartesian/XY coordinates to mark out an area's corners. The outlines created by these coordinates are referred to as "bounding boxes" and may include as much or as little text as needed. This may be automated in a variety of ways using Python but may need some human intervention.

![A newspaper article with two columns with bounding boxes](../All-sample-files/news_column_withbb.png)

3. Create a dataset of all of the bounding boxes on each page. To do this, you may need to specify particular features about the document, such as whether columns are separated by a vertical line or blank space.

4. Feed these bounding boxes and the content within them in their "reading" order to Tesseract for OCRing.

Here is [an example of how *On The Books* did this to exclude marginalia from its OCR](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/marginalia_determination/marginalia_determination.ipynb).

There ar a variety of tools you might use to do this. *On The Books* used [Pillow](https://pypi.org/project/Pillow/) and [NumPy](https://numpy.org/) Python libraries. [OpenCV](https://opencv.org/)'s computer vision tools can also be used for this as can tools such as [Kraken](http://kraken.re/) and [OCRopus](https://ocropus.github.io/). The [Coursera course](https://www.coursera.org/learn/python-project) demonstrates how to do this with OpenCV and Kraken.

# Additional Resources

[PythonHumanities.com OCR Videos](https://www.youtube.com/watch?v=tQGgGY8mTP0&list=PL2VXyKi-KpYuTAZz__9KVl1jQz74bDG7i)- A set of videos on Python OCR by William Mattingly, a TAP Institute instructor
