<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Hannah Jacobs](http://hannahlangstonjacobs.com/) for the [2021 Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

Adapted by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
____

# Creating an OCR Workflow (OCR & Post-Processing)

These [notebooks](https://docs.constellate.org/key-terms/#jupyter-notebook) describe how to turn images and/or pdf documents into plain text using Tesseract [optical character recognition](https://docs.constellate.org/key-terms/#ocr). The goal of this notebook is to help users design a workflow for a research project.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))
* [Optical Character Recognition Basics](./ocr-basics.ipynb)
* [Creating an OCR Workflow (Pre-Processing)](./ocr-workflow-1.ipynb)

**Knowledge Recommended:**

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Libraries Used:**
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).

**Learning Objectives:**

1. Run OCR on a large batch of prepared images
2. Assess the degree of accuracy achieved in performing OCR
3. Identify post-processing strategies for improving OCR accuracy

**Research Pipeline:**

1. Digitize documents
2. **Optical Character Recognition**
3. Tokenize your texts
4. Perform analysis
___

## Install Tesseract
We will install Tesseract on your machine using the command line. The following code cell install:
* tesseract-ocr
* pytesseract
* tesseract training data
* additional languages

In [None]:
%%bash
apt install tesseract-ocr
y

In [None]:
# Install PyTesseract, the Python wrapper for Tesseract
# An exclamation point runs the command on the command line
!pip install pytesseract

In [None]:
# Install Tesseract training data in the Constellate Analytics Lab.
# The exclamation runs the command as a terminal command.

!wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
!mv eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata

In [None]:
# Install Spanish language support
# Change `spa` to match the language code of your choice
# Full list of languages: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
!apt-get install tesseract-ocr-spa
print('Language installed.')

In [None]:
### Converting a folder of images into a single text file ###

# Import packages #

# Import os, a module for file management.
import os

# Import re, a module that we can use to search text.
import re

# Import glob, a module that helps with file management.
import glob

# Import the Image module from the Pillow Library, 
# which will help us access the image.
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract

# Configure Tesseract and input/output folders #

# Specify OEM & PSM configurations. 
# 3 is the default setting for both.
custom_oem_psm_config = r'--oem 3 --psm 3'

# Open the file folder where our sample pages are stored.
# Look only for the files ending with the ".jpg" file extension.
sampleFilePath = glob.glob("sample/*.jpg")

# Create a folder for the volume in the output directory (/sample).
outDir = "sample_output"
newDir = os.path.normpath(outDir)

# Run OCR Process #

# If you're running this script a second or third time, the sample_output folder will 
# already exist. 
# The following statement checks whether it already exists and then creates the
# sample_output folder if it doesn't exist (e.g. if the statement below is False).
if os.path.exists(newDir) == False:
    os.mkdir(newDir)

# Adding a "/" after newDir ("sample_output") makes it into a file path that
# we'll use to move our output file to the correct folder later in this script.
newDir = newDir + "/"
    
# For each file in the sample folder:
for file in sampleFilePath:
    
    # Open a file.
    with open(file, 'rb') as inputFile:
        
        # Read the file using PIL's Image module.
        img = Image.open(inputFile)
    
        # Run OCR on the open file.
        ocrText = pytesseract.image_to_string(img, lang="eng")
        
        # Get a file name -- without the extension -- to use when we name the output file.
        fileName = file.strip('.jpg')
        
        # The current file name also includes its folder name (sampleFilePath, "sample/").
        # We want to store our text output files in a different folder so that we can use 
        # them in future without altering the original image files. The following two 
        # lines use the re module to rename the path from "sample/" to "sample_output/",
        # which also changes the final destination for our next text file.
        currentFolder = "sample/"
        fileName = re.sub(currentFolder, newDir, fileName)

        # Create and open a text file, name it to match its input file,
        # and write the OCR'ed text to the file.
        with open(fileName + ".txt", "w") as outFile:
            outFile.write(ocrText)
        
        print(fileName, " successfully created.")
    
    # Loop back to check for another image file, run OCR on that file, 
    # and write its OCR to a new output file. When no more files remain,
    # this loop will end, and the script will be finished.