<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Hannah Jacobs](http://hannahlangstonjacobs.com/) for the [2021 Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

Adapted by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
____

# Creating an OCR Workflow (Post-Processing)

These notebooks describe how to turn images and/or pdf documents into plain text using Tesseract optical character recognition. The goal of this notebook is to help users design a workflow for a research project.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](../Python-basics/python-basics-1.ipynb))
* [Optical Character Recognition Basics](./ocr-basics.ipynb)
* [Creating an OCR Workflow (Pre-Processing)](./ocr-workflow-1.ipynb)

**Knowledge Recommended:**

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Libraries Used:**
* [Tesseract](https://tesseract-ocr.github.io/) for performing optical character recognition.

**Learning Objectives:**

1. Run OCR on a large batch of prepared images
2. Assess the degree of accuracy achieved in performing OCR
3. Identify post-processing strategies for improving OCR accuracy

**Research Pipeline:**

1. Digitize documents
2. **Optical Character Recognition**
3. Tokenize your texts
4. Perform analysis
___

## "Cleaning" OCR (Post-Processing) Overview

**This part of the process is often best performed with a combination of manual (human) and automated (computer) steps.** This is where you may be addressing not only errors in the OCR itself but also issues with the original printing, as we describe below with regard to hyphenated words at the end of lines. As with pre-processing, how complex you make iterations in this phase depends on your corpus and your resources:

1. **Review the OCR output.** Take an initial look at the OCR text files. Sometimes even just a glance will give you a sense of how well the process has gone. If you see a lot of errors, return to the pre-processing questions and consider which steps you might take to improve the OCR output.


2. **Run a spellchecker & calculate the quality of the OCR output.** Use a spellchecker to get a sense of just how accurate the OCR process may have been. Note that spellchecking here, as with spellchecking in software such as Word, is really looking for known and unknown words.


3. **Use Python to check for and correct possible recurring & unique spelling errors.** These are errors that appear frequently and may be caused by the typescript, hyphenation at the end of lines, or other patterns that Tesseract repeatedly misinterprets. This step should focus on common words and avoid proper nouns (unless you have a full list of proper nouns to draw from). As with any automated step, it's possible that new errors will be introduced here. If there is a known and small quantity of proper nouns used in individual texts or across the corpus, and these are consistently "read" incorrectly by Tesseract, it may be possible to use Python to correct these.


4. If your corpus is small enough and/or you have a team that can help you, **read through the corpus to manually check for and correct unique errors**. This may be a moment to correct proper nouns. If you have a team, it may be advisable to have texts read and corrected by multiple team members. It will be important that these team members have access to both inputs and outputs, and perhaps even lists of proper nouns, to be able to compare the original scans with the computer-readable versions. You may even want to set up a process whereby reviewers can flag words they are not sure about so that another reviewer can provide their opinion so that you and/or another project manager making a final decision on uncertain words.


The above process could be broken down further to address smaller issues incrementally and iteratively. It may also be useful to break your corpus into units of analysis before or during this process to assist with cleaning. Let's download a sample, OCR it, and investigate the output.

Now let's break the PDF down into individual images using the same method from our last lesson.

In [None]:
### Convert a single PDF into a series of image files ###

# Import pdf2image's convert_from_path module.
from pdf2image import convert_from_path
# Import pathlib's Path module.
from pathlib import Path

# Define where the images will be saved
# Check if a folder exists to hold pdfs. If not, create it.
input_folder = Path('./data/pdf_images/')
input_folder.mkdir(exist_ok=True)

# Get the PDF and convert to a group of PIL (Pillow) objects
# This does NOT save the images as files.
document_path = Path('../All-sample-files/sample_01.pdf')
PIL_objects = convert_from_path(document_path)

# For each PIL image object:
for page, image in enumerate(PIL_objects):

    # Create a file name that includes the original file name, and
    # a file number, as well as the file extension.
    fileName = f'{input_folder.as_posix()}/image_{str(page)}.jpg'

    # Save each PIL image object using the file name created above
    # and declare the image's file format. (Try also PNG or TIFF.)
    image.save(fileName, 'JPEG')

# Success message
print('PDF converted successfully')

And finally, let's batch OCR all the pages, creating a single text file for each image file.

In [None]:
### Convert all the image files into text files ###
import pytesseract

#Import PIL's Image module.
from PIL import Image

# For each .jpg file in the input folder, do the following:
for img in input_folder.rglob('*.jpg'):
    # Open the input file and complete OCR
    with open(f'{img}', 'rb') as f_image:
        file = Image.open(f_image)
        ocrText = pytesseract.image_to_string(file)
    
    # Create (or overwrite!) the output file and append the text
    with open(f'{input_folder}/{img.stem}.txt', 'w') as f_text:
        f_text.write(ocrText)
        

# Post-Processing Step-by-Step

## Review the OCR output.

Open your output text files and begin your review. Make sure to compare them with the original page images. What do you notice?

## Check for misspellings & quality.

Although it appears that this page has been entirely correctly OCR'ed, there are two issues that show up in this text file that we want to address in all of our OCR'ed files:

1. The original printers **broke words at the end of some lines**. For example, `Dis-trict` might be broken up across two lines. How do we deal with this without removing words that *should* be hyphenated?
2. **How would we know how accurate this simple script might be when applied to the entire volume, or to the entire corpus?** 

In addition to being hyphenated, `Dis-trict` may be misspelled as `Dis-triet` or `Dis-trism` in our output—is this just one instance, or does this error recur? If it's recurring, we can use Python to fix it across the corpus. This could be more efficient than having to read the entire OCR'ed corpus. A good starting point is to get a sense of just how accurate the OCR process has been, that is **check its readability**, before we start trying to identify and fix spelling errors.

**In the following scripts, we'll look at how to correct misspelling and check for OCR accuracy by generating a readability score.** During this process, we'll remove the hyphens at the end of lines to help us with spellchecking, but we may find that we introduce new issues for the spellcheck.

To begin, there are a number of modules and libraries we need to import (or reimport) to extend Python's functionality:

In [None]:
### Install PySpellChecker ###
!pip install pyspellchecker

In [None]:
# Import the word_tokenize module from the nltk ("Natural Language Processing Kit") library.
# NLTK is a powerful toolset we can use to manipulate and analyze text data.
import nltk
from nltk import word_tokenize
nltk.download('punkt_tab', download_dir='./data/nltk_data')

In [None]:
# Import PyTesseract and PIL, an image processing library used by PyTesseract, to complete the OCR.
from PIL import Image
import pytesseract

# Import re, a module that we can use to search text.
import re

# Import glob, a module that helps with file management.
import glob

# Import the SpellChecker module, which we'll use to look for likely misspelled words.
from spellchecker import SpellChecker

# We'll also need the pandas library, which is a powerful toolset for managing data.
# We'll learn more about pandas in the exploratory analysis modules.
import pandas as pd

# This statement confirms that the above code was run without issue.
print("Modules & libraries imported. Ready for the next step.")

Now we'll set up variables that we'll use to give Python information and structure information that Python returns. These include the location of the original image files and the place we want to store our OCR'ed text, as well as a [spellcheck dictionary](https://pypi.org/project/pyspellchecker/), which we'll extend to include North Carolina placenames, and a dataframe (essentially, an empty table) we'll use to structure readability information along with the OCR'ed text.

*Note: The [spellchecker library](https://pypi.org/project/pyspellchecker/) we are using supports a limited number of Western languages. English is the default.*

In [None]:
# Before we loop through each page, we'll augment our spellchecker 
# dictionary to include place names specific to North Carolina. 
# Our script for gathering these place names is available here: 
# https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/adjustment_recommendation/geonames.py

# Load the spellchecker dictionary.
# Replace the language attribute with another 2 letter code
# to select another language. Options are: English - ‘en’, Spanish - ‘es’,
# French - ‘fr’, Portuguese - ‘pt’, German - ‘de’, Russian - ‘ru’.

spell = SpellChecker(language='en')

# Add the place name words from the "geonames.txt" file to the 
# spellchecker dictionary.
# Sample file to download
geonames_path = Path('../All-sample-files/geonames.txt')
spell.word_frequency.load_text_file(geonames_path.as_posix())

# This statement confirms that the above code was run without issue.
print("Variables created. Ready for the next step.")

Here is what each column will hold:

- **file_name**: The name for the corresponding image file. For now, this is the only information in the table that identifies where the rest of the information in each row comes from (which page).
- **token_count**: The total number of tokens (words) found in each page.
- **unknown_count**: The number of unknown ("misspelled") words found in each page.
- **readability**: Think of this as the percentage of the page that was readable.
- **unknown_words**: A list of tokens (words or in some cases characters) that were not listed in the spellchecker.
- **text**: The OCR'ed text output from each page. The output here includes all <a href="https://en.wikipedia.org/wiki/Escape_character#JavaScript" target="blank">escape characters</a>, so it may look as if a lot of erronenous characters have been added.

Now we'll remove hyphens from the text, run the spellcheck script, and produce a dataframe (table) of information that will give us a sense of the accuracy of our OCR.

In [None]:
### Dictionary Test a Folder of .txt Files ###

# We'll use Pandas to create a dataframe (a table) that can hold 
# information about an OCR'ed page and display it in a tabular format.
# This dataframe will start out empty with only its column headers 
# defined. We'll add information to it one page at a time. So each
# row will represent 1 page.

df = pd.DataFrame(columns=["file_name","token_count","unknown_count","readability","unknown_words","text"])

# Set the folder for the input images
texts_folder = Path('./data/pdf_images/')

for txt_file in texts_folder.iterdir():
    if txt_file.suffix == '.txt':
    
        # Open each text file and read text into `ocrText`
        with open(txt_file, 'r') as inputFile:
            ocrText = inputFile.read()
            
        # Join hyphenated words that are split between lines by 
        # looking for a hyphen followed by a newline character: "-\n"
        # "\n" is an "escape character" and represents the 
        # "newline," a character that is usually invisible 
        # to human readers but that computers use to mark the 
        # end/beginning of a line. Each time you press the 
        # Enter/Return key on your keyboard, an invisible "\n" 
        # is created to mark the beginning of a new line.
        ocrText = ocrText.replace("-\n","")
        
        # First, we'll use NLTK to "tokenize" text. 
                # "Tokenize" here means to take a page of our OCR'ed text,
                # which Python is currently reading as one big glob of data,
                # and separate each word out so that it can be read as an
                # individual piece of data within a larger data structure 
                # (a list). This process also removes punctuation.
        tokens = word_tokenize(ocrText)
        
        # Lowercase all tokens
        tokens = [token.lower() for token in tokens if token.isalpha()]
        
        # Now we can get all of the words that don't match the 
        # spellchecker dictionary or our list of place names--
        # these are the potential spelling errors.
        unknown = spell.unknown(tokens)
        
        # Let's use a little math to find out how many potential 
        # spelling errors were identified. As part of this process, 
        # we'll create a "readability" score that will give us a 
        # percentage of how readable each file is--how much of the 
        # OCR'ed is "correct."
            
        # If the list of unknown tokens (words) is greater than 0 
        # (i.e. if the list is not empty):
        if len(unknown) != 0:
                
                   # Following order of operations, here's what's happening 
                   # in the readability variable below:
                   # 1. Divide the number of unknown tokens (len(unknown)) 
                        # by the total number of tokens on the page
                        # (len(tokens)). Use "float" to specify that Python
                        # returns a decimal number:
                            # (float(len(unknown))/float(len(tokens))
                   # 2. Multiply the number from step 1 by 100.
                        # (float(len(unknown))/float(len(tokens)) * 100)
                   # 3. Subtract the number from step 2 from 100.
                        # 100 - (float(len(unknown))/float(len(tokens)) * 100)
                   # 4. Round the number from step 3 to 2 decimal places
                        # round(100 - (float(len(unknown))/float(len(tokens)) * 100), 2)
                
            readability = round(100 - (float(len(unknown))/float(len(tokens)) * 100), 2)
            
            # If the list of unknown tokens is empty (or equal to 0), then readability is 100!
        else:
            readability = 100
        
        # Let's create a record of the readability information 
        # for this page that we'll add to the dataframe. 
        # The following is a Python dictionary, another way of 
        # storing data. Each word or phrase to the left of the : is a
        # "key" -- think of it as a column header. Each piece of 
        # information to the right is a "value" -- information 
        # written in a single cell below each header. 
    
        df2 = pd.DataFrame({
                "file_name" : txt_file.as_posix(),
                "token_count" : len(tokens),
                "unknown_count" : len(unknown),
                "readability" : readability,
                "unknown_words" : [unknown],
                "text" : ocrText
                })
    
        df = pd.concat([df, df2])
    
        # This statement lets us know if a page has been successfully 
        # checked for readability.
        print(txt_file, "checked for readability.")
    
# This time, instead of creating individual .txt files for each page,
# we're going to save all of the OCR'ed text and readability 
# information to a single .csv ("comma separated value") file. 
# We can view this file format as a table. Having everything stored 
# like this will help us with clean up and future analysis.
df.to_csv(f'{texts_folder}/spellcheck_data.csv', header=True, index=False, sep=',')

# We have the data stored in a file now, but we can also 
# preview it here:
df

# Delete the df variable in case we wish to run this script again
del df

# Correcting Errors

Broadly speaking, we can break down errors into two categories: **unique** or **recurring**. We can use Python to address both types to an extent, but it's likely that some manual review will still need to be done to ensure the highest quality OCR. Whether and how much manual review can be done will depend on the project's resources.

## Unique Errors

There are at least **two ways to address unique computer-identified errors:**

1. Since we produced a list of unknown words in our readability test, we could simply open each file in a text editor and use find-and-replace functionalities (Command + F or Control + F) to locate and replace instances of unique errors.

2. We could use a little Python to find and replace these errors across the corpus. 

*Caveat: There may be instances where variant spellings are identified as "unknown" (misspelled) but are true representations of the word as it was originally printed. It may be necessary to check these misspellings against the scanned pages and decide whether or not to correct the text in the OCR output.*

The following script runs through the entire sample output (and could be applied to an entire corpus) and checks for and replaces instances of a unique:

In [None]:
# Replacing unknown words with known words
unknown_word = "diseretion"
known_word = "discretion"

# Import glob, a module that helps with file management.
import glob

# Identify the sample_output file path.
# Remember that our readability output is also stored 
# in this file as a .csv. We don't want to change it, 
# so we'll use glob to look for only .txt files.
file_list = glob.glob("./data/pdf_images/*.txt")
# Apply the following loop to one file at a time in filePath.

# Read in a file's text
for file in file_list:
    file_path = Path(file)
    with open(file_path) as f:
        text = f.read()
    
    # Correct the unknown word with a known word
    corrected_text = text.replace(unknown_word, known_word)
    with open(file_path, 'w') as f:
        f.write(corrected_text)

print("All instances of " + unknown_word + " replaced with " + known_word + ".")

Check the output files for the unknown word to see if the word is still present. 

We've done this for one word at a time, but we could use the list of unknown words generated to create a script that runs through the list and corrects each instance all at once--rather than running the above script for each correction individually.

## Recurring Errors & Changes

There are several kinds of recurring errors:

- Specific Words & Phrases (if a unique mispelling above is present consistently across the corpus, for example).
- Word, Phrase, or Character Patterns (for example, a hyphen used to break up a word at the end of a line).

We looked earlier at how to remove hyphens at the end of lines. To do this we replaced `-\n` with nothing (""). We saved that change to spellcheck csv, but we could have written that to the original text files. We could use the above script to make that change directly in the original output files, though it may be advisable to *keep the original text output files separate from the corrected versions in case you need to refer back.*

We could also use the below script in combination with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) to correct issues that we know are recurring.

**Be careful when attempting changes with regular expressions**--these always come with the risk of introducing new errors. To avoid as many as possible, make your regular expression as specific as possible.

In [None]:
# Import the regular expressions module (re), 
# which helps us use regex in Python.
import re

# Import glob, a module that helps with file management.
import glob

# Identify the sample_output file path.
# Remember that our readability output is also stored 
# in this file as a .csv. We don't want to change it, 
# so we'll use glob to look for only .txt files.
file_list = glob.glob("./data/pdf_images/*.txt")

# Save the pattern for a chapter header (even pages) that we 
# want to search each page for. We've added "^" to our regular 
# expressions to be extra sure that Python searches only at the 
# beginning of each file.
regex_search = re.compile("\n\nThe General Assembly.*?t:\n\n")

# Save the text that we want to use to correct the OCR output.
replacement = "\n\nThe General Assembly of North Carolina do enact:\n\n"

# Apply the following loop to one file at a time in filePath.
for file in file_list:
    
    corrected_file = None # Reset the value of the corrections
    
    file_path = Path(file) 
    with file_path.open() as f:
        text = f.read()
        # Create a corrected version
        corrected_text = re.subn(regex_search, replacement, text)
        print(corrected_text[1], 'match(es) found in ', file_path.name)

    # Write the corrected text to the file
    corrected_file_name = file.replace(".txt", "_corrected.txt")
    with open(corrected_file_name, 'w') as f:
        f.write(corrected_text[0])
    
# The loop will finish when Python has gone through all files in 
# the sample_output folder.

# Concatenate all the text files into one
If we are happy with our outputs, then we can stitch all the text files into a single text file.

In [None]:
# Set the folder for the input texts
texts_folder = Path('./data/pdf_images/')

# Set output filename and create file
full_text = Path('./data/full.txt')
full_text.touch()

for txt in sorted(texts_folder.rglob('*corrected.txt')):
    with open(txt, 'r') as f_in:
        fileText = f_in.read()
        with open(full_text, 'a') as f_out:
            f_out.write(fileText)

# Try it out!

Here's [the first 50 pages of an edition of Moby Dick from 1922](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/moby_dick.pdf). Can you OCR all the pages and then generate a list of errors based on dictionary analysis? How about replacing some of the text with errors?

`../All-sample-files/moby_dick.pdf`

You'll need to start by either using `urllib.request` to download the materials to the Constellate environment or by downloading the document to your local machine and uploading it to Constellate.