<a href="https://colab.research.google.com/github/peiyulan/From-Images-to-Text-Working-with-OCR/blob/main/ImageToTextOCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Welcome: From Images to Text: Working with OCR**


## Overview

This course provides an overview of Optical Character Recognition (OCR), an image processing technique for extracting text from images.

The resource covers the following sections to help you learn and start to apply OCR in research practice:

- What is text extraction and OCR?
- What are the real-world application of OCR?
- What challenges might you face when using OCR and how can you address them?

We will learn the fundamental concepts and implement OCR techniques through the following activities:

- Activity 1: [Write down your OCR Workflow](https://colab.research.google.com/github/peiyulan/From-Images-to-Text-Working-with-OCR/blob/main/ImageToTextOCR.ipynb#scrollTo=7Wcz0_MXQ4kV&line=1&uniqifier=1)
- Activity 2: [Inspect the Files](https://colab.research.google.com/github/peiyulan/From-Images-to-Text-Working-with-OCR/blob/main/ImageToTextOCR.ipynb#scrollTo=XJnrZo-EYr6M&line=1&uniqifier=1)
- Activity 3: [Online OCR Engine](https://colab.research.google.com/github/peiyulan/From-Images-to-Text-Working-with-OCR/blob/main/ImageToTextOCR.ipynb#scrollTo=a089BvORcV43&line=1&uniqifier=1)
- Activity 4: [Clean Your Text with Regex](https://colab.research.google.com/github/peiyulan/From-Images-to-Text-Working-with-OCR/blob/main/ImageToTextOCR.ipynb#scrollTo=55P-TGaAmkt4&line=1&uniqifier=1)
- Activity 6: []

> 👋 New to Google Colab and Python? No worries! Let's get you started by running the code snippet below to make sure everything is working properly for you. It's a great way to take your first steps into coding!

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, welcome to the CDCS Image to Text with OCR workshop!".format(name))

Enter your name and press enter:
Joy

Hello Joy, welcome to the CDCS Image to Text with OCR workshop!




---



# **What is OCR?**

## **O**ptical **C**haracter **R**ecognition
- OCR is a technique to process images of text, such as written or printed documents, and produce **machine-readable** documents.
- Machine-readable documents are encoded in formats that computers can process, which allow the text to be searched, edited and analysed computationally

## Real-world examples of OCR

- Scanning your passport at the airport
- Generate machine-readable text for text-to-speech technology.
- Making digitalised physical archives searchable.​
- Creating a dataset of for text mining or
text analysis.​



## OCR Workflow

OCR is a multi-step process and we'll examine this as we move through the lesson.




1.   **Document selection**: An image or text document needs to be selected to scan.
2.   **Scanning**: The images are then scanned by OCR software. This generates a machine-readable text output.
3.   **Cleaning** :Once you have an OCR output, extra steps to clean the files might be need to improve accuracy.
4.   **Saving files**: the machine-readable documents can be saved and are to use.



## ✏️ *Activity 1: Write Down Your OCR Workflow*  


- Identify a dataset (images of text)
that you might use in your research​.

- Write the steps to obtained encoded
text from your dataset.​

- Identify potential source of errors or
issues in each step, and discuss how
you might address them.​

- Share your dataset, workflow, and
plan with your small group.​





# **Challange and Error of OCR**

Unfortunately, text recognition is not a perfect process, and you are likely to encounter problems or errors in the text outputs.

The accuracy of OCR can be limited by the **OCR engine capability**:
- file size
- format of the input files
- text orientation
- language it can process

It also depends on **dataset quality, formatting of the origianl documents:**
- Human errors and typos​
- Age and damage (stained or blurry)​
- Mixed text and images, or multiple
languages​
- Cursive handwriting​

# **Ways to improve OCR accuracy**

While the accuracy or OCR will never be 100%, there are ways to reduce errors and imrove OCR accuracy:

- Select good quality dataset to begin with​

- Pre-process your dataset to improve its quality​

- Correct errors in OCR-produced files

- Improve OCR engine capability​



---



# Dataset Selection

Selecting good quality data to start with means that you don’t need to edit and process your image too much from the beginning. Key considerations include:







**Image resolution**

Set a minimum DPI metric. DPI means ‘Dots Per Inch’ and means that every inch of an image contains a certain number of dots of ink. 300 DPI is often used as a benchmark for good quality printing reproducibility for photographs, but this may vary.  

**Types of error**

By checking the documents or sample documents, you can identify the patterns documents that might cause errors during scanning and reduece the work required for clearning the files afterwards.

Some "patterned" or "predictable" error can be fixed commutationally using software or programming patches:

*   **Characters that appear similar** can be misrecognised, for example ‘cl’ and ‘d’, or ‘rn’ and ‘m’, which results in the incorrect substitution of a letter or letters. E.g. ‘clean’ becomes recognised as ‘dean’.
*   **Different letter forms**, such as in some older historical materials a different form of 's' is used. This is called a long s and looks like this 'ſ', which is often mis-recognised as an 'f' character.
E.g. ſleeve [sleeve] -> fleeve


Some error is unpredictable and therefore more difficult to be tackeled, such as:
- Age and damage of the files (stained or blurry)
- Human error such as typo​s and smelling variations
- Mixed formating, text and images
- Cursive handwriting​


## ✏️ *Activity 2: Inspect the Files*  

*   Identify issues you might encounter when processing the following documents.  ​
- Are there any steps you could take to
preprocessing the document that might
improve the output accuracy?​

​


![](https://github.com/DCS-training/Image-to-Tech-Text-Extraction/blob/main/Github%20Images/Image1.jpg?raw=true)


Document 1: image above from the Scottish Session Papers collection held by the University of Edinburgh, shelfmark [EUL0011](https://librarylabs.ed.ac.uk/iiif/uv/?manifest=https://librarylabs.ed.ac.uk/iiif/manifest/sessionpapers/volumes/EUL0011.json#?c=0&m=0&s=0&cv=21&xywh=-510%2C0%2C9022%2C5263) under a [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/) licence.

![](https://github.com/DCS-training/Image-to-Tech-Text-Extraction/blob/main/Github%20Images/Image2.jpg?raw=true)

Document 2: image above from the Scottish Session Papers collection held by the University of Edinburgh, shelfmark [EUL0281](https://librarylabs.ed.ac.uk/iiif/uv/?manifest=https://librarylabs.ed.ac.uk/iiif/manifest/sessionpapers/volumes/EUL0281.json#?c=0&m=0&s=0&cv=0&xywh=-2584%2C-252%2C8636%2C5038) under a [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/) licence.

# **Pre-processing**

Once you identify the dataset, preparing your files for scanning can help to produce better outputs after scanning. There are a few ways to do this:

**Image colour**

The colour of your images can impact text quality too; colour images can be used, but ensure there is a sharp contrast between the background page and the text itself (for example black text on white or cream paper).
**Text orientation:**
The orientation of your text is also important in text recognition as the letters that your OCR engine tries to match against will be the ‘right’ way up and straight on the page, so making sure that the text in the documents you upload is ‘straight on’ in the document or image will produce the best results.

**File format:**

Some OCR engines will only accept certain file formats, so it is best to check this in advance; some will only use image files, whereas others will work on PDF files. Common image file types are TIFF, JPEG and PNG.

TIFF files are lossless files, meaning that no image quality or information is lost; this means that TIFF images are usually high quality but also much larger file sizes. PNGs are also lossless, although TIFF files would be preferred over PNGs for OCR.

JPEGs are a lossy format type, meaning that the image is compressed to create a smaller file size. OCR engines can work with JPEGs, but there may be a loss of image quality that can impact the text generation.

# Scanning

Now we're going to try some OCR ourselves with out-of-the-box options.

## ✏️ *Activity 3: Online OCR Engine*  
 Choose a pdf that contains text and try uploading it into some of these online programmes:
*   https://tools.pdf24.org/en/ocr-pdf   
*   https://www.onlineocr.net/   
*   https://www.sodapdf.com/ocr-pdf/   
*   https://www.sejda.com/ocr-pdf   
*   https://ocr.space/   
*   https://avepdf.com/pdf-ocr   











Compare your results - which performed best? What are the limitations of these options?




# **Cleaning**

**Manually remove errors**

Read through the text and manually change what needs to be changed – this is one option to create a high-quality text with your corrections, however, it can be time and effort consuming


**Patches and Machine Learning techniques**:
Certain software have an embedded lexicon or dictionary that can be used to identify incorrect vocabulary and correct likely words based on the lexicon and calculated probability of correct words.

**Regex**

Regex (or regular expressions) is a type of shorthand code you can use to specify which parts of text you would like to target and how you would like to change these.

OCR has been developing at a significant rate over the past twenty years and newer OCR is faster and more accurate than older software, but it is still liable to errors.

## Regex (Regular Expressions)

Regular expressions or regex are a way to identify and match patterns via code and is used in a range of different programming environments.

Regex is a powerful way to find, manage and transform your data and files. It uses sequences of characters to define a search to match strings. You can use regex to:
*   Match types of characters (e.g. ‘upper case letters’, ‘digits’, ‘spaces’, etc.)  
*   Match patterns that repeat any number of times



There are far too many to remember off the top of your head, so online cheat sheets are your best option.

Check out the following webpages for different patterns:

*   [Cleaning OCR’d text with Regular Expressions](https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions)
- [Regex cheat sheet ](https://www.rexegg.com/regex-quickstart.php)

Here are some examples of what you can do with regex.

> Don't worry, you don't need to understand how this all works, but if you click the 'run' button at the left side of the code module it will run the code and show you the results.



## ✏️ *Activity 4: Clean Your Text with Regex*  

Example 1:
- search and replace where long 's' with modern 's' characters.

"Thiſ iſ my extracted text, but it doeſn't look right"

In [19]:
import requests
filepath = "https://raw.githubusercontent.com/peiyulan/From-Images-to-Text-Working-with-OCR/refs/heads/main/data/ocrtext_1.txt"

req = requests.get(infile)
text = req.text
print(text)

UNTO THE RIGHT HONOURABLE,
The Lords of Council and Seffion,
THE
PETITION
O F
JAMES FEA of Cleftrain;
HUMBLY SHEWETH,
HAT in the proces, at the intance of the petitioner; againft Chriftian Webfter, for fetting afide a bond of 120l. 11s. 7d. Sterling, granted by him to the defender; after various proceedings, the Lord Au-
chinleck Ordinary pronounced the following interlocutor: " In Feb. 17-
" refpect that Mr. Fea does not plead that he is imbecile or 1773.
" weak, and that it is agreed on all hands, that the chief caufe
" of the bond was for a remuneration to Chriftian Webfter, for
" the fatigue the underwent, and the remarkable care the took
" of Cleftrain, who died in her father's houfe, and to whom
" Mr. Fea, though not his neareft relation, fucceeded by difpo-
" fition, and that Cleftrain was at liberty to give her what re-
" muneration he thought fit, and gave her the bond now in
"queltion; he, the purfuer, cannot be heard to impugn that
" bond,,



In [None]:
#import re module
import re

#identify the text we wish to correct
text = "Thiſ iſ my extracted text, but it doeſn't look right"

#this is the pattern we want to find with the long 's' character
pattern = r"ſ"

#here we substitute the pattern (wrong character) for the right character - 'print' displays the corrections we have made
print(re.sub(pattern, "s", text))

This is my extracted text, but it doesn't look right


Example 2: join up words that are split across lines with hyphens:

"This is my ex-

tracted text but now it is split
a-

cross lines“

In [53]:
#import re module
import re

#identify the text we wish to correct
import requests
filepath = "https://raw.githubusercontent.com/peiyulan/From-Images-to-Text-Working-with-OCR/refs/heads/main/data/ocrtext_1.txt"

req = requests.get(filepath)
text = req.text
print(text)

#join hyphenated words across lines
result = re.sub(r'\n"', '', text)
print(x)


#display the result, removing the hyphens to join up the words
print(result)

UNTO THE RIGHT HONOURABLE,
The Lords of Council and Seffion,
THE
PETITION
O F
JAMES FEA of Cleftrain;
HUMBLY SHEWETH,
HAT in the proces, at the intance of the petitioner; againft Chriftian Webfter, for fetting afide a bond of 120l. 11s. 7d. Sterling, granted by him to the defender; after various proceedings, the Lord Au-
chinleck Ordinary pronounced the following interlocutor: " In Feb. 17-
" refpect that Mr. Fea does not plead that he is imbecile or 1773.
" weak, and that it is agreed on all hands, that the chief caufe
" of the bond was for a remuneration to Chriftian Webfter, for
" the fatigue the underwent, and the remarkable care the took
" of Cleftrain, who died in her father's houfe, and to whom
" Mr. Fea, though not his neareft relation, fucceeded by difpo-
" fition, and that Cleftrain was at liberty to give her what re-
" muneration he thought fit, and gave her the bond now in
"queltion; he, the purfuer, cannot be heard to impugn that
" bond,,

None
UNTO THE RIGHT HONOURABLE,
T

# 🟢 Printed Text Recognition with Python pytesseract

# 🟢 **Handwritten Text Recognition with Python trOCR**

**H**andrwitten **T**ext **R**ecognition (HTR) is another method of text extraction, although it is in the earlier stages of development than OCR is.
Unlike OCR, which is best used on printed text, HTR engines are designed to be run on handwritten text and often use machine learning models to intelligently recognise text.


## ✏️ *Activity 6*  

[Transkribus](https://readcoop.eu/transkribus/) is one of the leading options in HTR; although designed for handwritten text, it can also be used on printed text.

*   Try uploading an image (PNG or JPG) of handwritten and/or printed text to their website https://readcoop.eu/transkribus/

*   How does changing the language used impact your results?


**Disclaimer*: Transkribus operates on a paid credit model, but the test option outlined above is free. Sign up is free and includes a small number of credits when [joining](https://readcoop.eu/transkribus/credits/).



# **Resources List**

**Tutorials**

Centre for Data, Culture and Society. "Text Extraction & Preparation," Managing Digitised Documents (2022), https://www.cdcs.ed.ac.uk/training/training-pathways/managing-digitised-documents-pathway [accessed 23 July 2023].

Knox, Doug. "Understanding Regular Expressions," Programming Historian 2 (2013), https://doi.org/10.46430/phen0033 [accessed 23 July 2023].

Turner O'Hara, Laura. "Cleaning OCR’d text with Regular Expressions," Programming Historian 2 (2013), https://doi.org/10.46430/phen0024 [accessed 23 July 2023].

Library Carpentry. "Introduction to Working with Data (Regular Expressions)" (2023), https://librarycarpentry.org/lc-data-intro/01-regular-expressions.html [accessed 23 July 2023].

---
**Readings**

Cordell, Ryan. ‘“Q i-Jtb the Raven”: Taking Dirty OCR Seriously’. Book History 20, no. 1 (2017): 188–225. https://doi.org/10.1353/bh.2017.0006 [accessed 23 July 2023].

Schantz, Herbert F. The History of OCR, Optical Character Recognition. [Manchester Center, Vt.] : Recognition Technologies Users Association, (1982) http://archive.org/details/historyofocropti0000scha [accessed 23 July 2023].


# **Activity Notes**


## ✏️ Activity 1
Even if you have not heard of OCR or are not sure if you have used it, you probably have! Here are some real-world examples of text recognition uses:

*   Using the airport’s e-Passport Gates
*   Translating text with language recognition
*   Searching a digital database of historical public records
*   Searching a PDF file

## ✏️ Activity 2
If you are not sure what to choose as your dataset, imagine you are digitising a book; think of your favourite book or try browsing the following website to find examples you could use:

*   https://openbooks.is.ed.ac.uk/

Don't forget that the type of materials you choose will impact your OCR, and different material types might need slightly different considerations for selecting OCR software or anticipating problems.



## ✏️ Activity 3
This list is not exhaustive but will cover some of the potential issues you might have picked out:

Document 1:

*   There are marks on the top left of the page that might be picked up by accident
*   The text uses the long 's' form that may be mistaken for an 'f'
*   You can see text coming through from other pages, depending on the OCR engine, this could be picked up
*   Sometimes spacing between letters can cause issues, and the individual letters will be picked up rather than the entire word, for example with 'PETITION'

Overall the image is clear, there is a good light balance and the text is quite straight on the page.

Document 2:

*   There are lots of marks and darlk spots on the page that may be picked up by the text recognition
*   There are lots of creases in the page, meaning that some of the lettering is a little warped and may not be picked up correctly
*   There is some text coming through from the other side of the page, although this is not as dark as in the previous example
*   Some of the writing is very small and appears to be smudged in printing, which may mean that letters are misrecognised

Overall, the text looks clear in some places, however the physical condition of the item is likely to impact the OCR done on this item due to the creases and marks on the page.

## ✏️ Activity 4
Any text extraction software or programme you decide to use will need to be evaluated - was it effective for the materials, or if not, why not? The limitations of OCR outputs are important to consider too - if you are only looking at a small quantity of text then you may be able to manually correct as much as you need to. If you are working with larger quantities of text this may not be the case and you will need to choose the best-performing option and do additional clean up with programming, or establish a suitable level of acceptable errors.


## ✏️ Activity 5
Transkribus has models designed for handwriting and print because they are trained on either handwritten text or printed text, to produce the vest results possible for the different types of materials. The same goes for different languages and some of the models distinguish between different fonts or handwriting types. If you play around with the model types and images you use, you should see a difference in output quality.






