# OCR with Pytesseract

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License.Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

In 2006, Tesseract was considered one of the most accurate open-source OCR engines then available. 

Functions<br /> 
<b>get_tesseract_version</b> Returns the Tesseract version installed in the system.<br /> 
<b>image_to_string </b> Returns the result of a Tesseract OCR run on the image to string <br /> 
<b>image_to_boxes</b> Returns result containing recognized characters and their box boundaries <br /> 
<b>image_to_data</b> Returns result containing box boundaries, confidences, and other  information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation <br /> 
<b>image_to_osd</b> Returns result containing information about orientation and script detection.<br /> 
<b>image_to_alto_xml</b> Returns result in the form of Tesseract’s ALTO XML format.<br /> 
<b>run_and_get_output </b> Returns the raw output from Tesseract OCR. Gives a bit more control over the parameters that are sent to tesseract.<br /> 

Install: https://digi.bib.uni-mannheim.de/tesseract/ <br /> 
<b>pip install pytesseract</b>

In [19]:
import cv2
import pytesseract
from pytesseract import Output

In [24]:
pytesseract.get_tesseract_version()

LooseVersion ('4.1.1')

In [4]:
img = cv2.imread('invoice-3.jpg')

# Adding custom options
custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)

'n% EXTRA\n¢! —___ Kallerud\n| Apent A = 28 OR « 319\ns Telefon 41 84 92 00\nCOOP INNLANDET sA\ny Org.nr 979 419 287 MVA\nButikk 2512-8, Skann & Betal 2\nra Salgskvitterin3 20777 25.04.2020 18:31\nmm = BANAN X-TRA KG 17 47\n: 0.976 kg 17.90 kr/kg\nhe =6COOP MOZARELLA 41.80\nP Antall: 2 stk 20.90 kr/stk\nMm HVITLOK 1006 12.90\nFy = JORD. TANNT .EASYSLIDE 43.90\nm NOTTER LOSVEKT 11.37\n3 0.430 kg 259.00 kr/ks\nE PAPRIKA X-TRA 2STK 19-98\n@ POTET ASSORTERT 13.93\na 0.876 kg 15.90 kr/ks 3\nM SHOR N.SALTET 5006 39.\nB SUKKER 1KG 20.9\nE TOMATER | 22.8\nfs 0.618 kg 36.90 kr/k9g\n= X-TRA HVITL. BAGUETT 19.3\nAntall: 2 stk 9.90 kr/stk\ng KO FLOTE 38% 1a\nI Totalt (14 Artikler) 2,\n| Bank: 382\n| Herav\nDaglisgvarer 382\nSyrige varer (\nBax: 17249983-457144\n26/04/2020 18:36 Overf .:779\nKortet ikke presentert\nRef .¢ >\nAVBRUTT AV OPERATOR\na\n\x0c'

In [5]:
d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())

dict_keys(['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text'])


Using this dictionary, we can get each word detected, their bounding box information, the text in them and the confidence scores for each.

You can plot the boxes by using the code below:

In [None]:
n_boxes = len(d['text'])
n_words = len(d['word_num'])
print(n_boxes)
print(n_words)
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.namedWindow('img', cv2.WINDOW_KEEPRATIO)
cv2.imshow('img', img)
cv2.waitKey(0)

204
204


In [11]:
cv2.destroyAllWindows()

Take the example of trying to find where a date is in an image. Here our template will be a regular expression pattern that we will match with our OCR results to find the appropriate bounding boxes. We will use the regex module and the image_to_data function for this.

In [18]:
import re
keys = list(d.keys())

date_pattern = '^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d$'

n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        if re.match(date_pattern, d['text'][i]):
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            img1 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 255), 2)
cv2.namedWindow('date-pattern', cv2.WINDOW_KEEPRATIO)
cv2.imshow('date-pattern', img1)
cv2.waitKey(0)

-1

# Page segmentation modes
There are several ways a page of text can be analysed. The tesseract api provides several page segmentation modes if you want to run OCR on only a small region or in different orientations, etc.

Here's a list of the supported page segmentation modes by tesseract -

0    Orientation and script detection (OSD) only.
1    Automatic page segmentation with OSD.
2    Automatic page segmentation, but no OSD, or OCR.
3    Fully automatic page segmentation, but no OSD. (Default)
4    Assume a single column of text of variable sizes.
5    Assume a single uniform block of vertically aligned text.
6    Assume a single uniform block of text.
7    Treat the image as a single text line.
8    Treat the image as a single word.
9    Treat the image as a single word in a circle.
10    Treat the image as a single character.
11    Sparse text. Find as much text as possible in no particular order.
12    Sparse text with OSD.
13    Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

To change your page segmentation mode, change the --psm argument in your custom config string to any of the above mentioned mode codes.

# Detect orientation and script
You can detect the orientation of text in your image and also the script in which it is written.

In [21]:
osd = pytesseract.image_to_osd(img)
print(osd)
angle = re.search('(?<=Rotate: )\d+', osd)
script = re.search('(?<=Script: )\d+', osd)
print("angle: ", angle)
print("script: ", script)

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.05
Script: Latin
Script confidence: 1.23

angle:  <re.Match object; span=(49, 50), match='0'>
script:  None


In [23]:
print(pytesseract.image_to_string(img))

 

a lerud
Apent ¢ = 02 (8 ~ 21)
84 98 00
SA
219 287 MVA

   
 
 

2
25 04.2020

X-TRA
kr/k9
MOZARELLA 41.80
stk i
HVITL@K 1006
JORD. TANNT . EASYSL IDE 43.90
NZTTER LOSVEKT WW.3t
kg ee
28s
POTET

OO
UN
o
:
x
Ss
a
qe
w
WwW
-
\O

 
 

2 stk
GKO FLBTE 2)

4A 6 \
Jotalt (14 F ne

:

varer
Bax: 17249983-457144
aes
ikke

| ikke presentert

 



You can download the .traindata file for the language you need from https://tesseract-ocr.github.io/tessdoc/Data-Files and place it in $TESSDATA_PREFIX directory (this should be the same as where the tessdata directory is installed) and it should be ready to use. For Windows, go to C:\ProgramData\Anaconda3\Library\bin\tessdata <br/>To specify the language you need your OCR output in, use the -l LANG argument in the config where LANG is the 3 letter code for what language you want to use.

In [34]:
custom_config = r'-l nor --psm 6'
print(pytesseract.image_to_string(img, config=custom_config))

29
) ' —Eä leru.
” Apent 7 — 23 (8 — 21)
84 98 00
P SA
e 979 287 MVA
2
Salgskvitterin9 201TT 25 .04.2020
kr/kg
MOZARELLA 41 .80
1006
oQ — JORD.TANNT . ASYSLIDE 43 .90
M NØTTER LØSvVEKT 11131
ka å
Ø — POTET ASsoRTERT 02
N.SALTET ” 3
8 SUKKER
$ — x—Tro HvITL.BAGUET 19.
stk 9.90 w

å ØKO FLØTE 1i
|

Totalt (14 Artikler) L
|
!

Øvrige varer (

17249983—457144

Overf . : 779

sy

OPERATØR



You can recognise only digits by changing the config to the following

In [8]:
custom_config = r'-l nor --oem 3 --psm 6 outputbase digits'
print(pytesseract.image_to_string(img, config=custom_config))

'B EXTRÅ
pi = Kallerud
| QPQYH .( — 23 t$ — 21)
< Telefon 41 84 98 00
% COOP INNLANDET sA
5 Org.nr 979 419 287 MVA
Butikk 2612—8, Skann & Betal 8
a Salgskvittering 201tf 25.04.2020 18. 31
$ BANAN X—TRA KG 17 a1 —
p 0.976 ka 17.90 kr/ks
k — CO0P MOZARELLRA 41 .80
B Antall: 2 stk 20.90 kr/stk
90 HVITLØK 1006 12.90
& — JORD.TANNT.EASYSLIDE 43.90
W NØTTER LØSvEkt Wi St
$ 0.430 kg 259.00 kr/ks
E. PAPRIKA X—TRA 2STK 12 2
W — POTET ASSORTERT 13.93
. 0.876 kg 15.90 kr/ka 3
M SMØR N.SALTET 5006 39 .
8 — SUKKER 1KG 20.3
Å TOMATER ' 22 .8
kv 0.618 kg 36 .90 kr/ks
R. x%—TRA HVITL . RAGUETT 19 .
Antall: 2 stk | 9.90 kr/stk
$ ØKO FLØTE 38x 19.
4 — Totalt (14 Artikler) 32
i Bank: 382
I — Herav
Dag l igvarer 382
Øvrige varer (
Bax: 17249983—457144
25/04/2020 18: 36 Overf . : 779
Kortet ikke presentert
Ref — — sem
AvBRUTT AV OPERATØR
| jj



# Whitelisting and blacklisting characters
If you want to detect certain characters from the given image and ignore the rest, you can specify your whitelist of characters (here, we have used all the lowercase characters from a to z only) by using the following config.

In [6]:
custom_config = r'-c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz --psm 6'
print(pytesseract.image_to_string(img, config=custom_config))

n exr
y fallerud
apent is t
s felefon
cooas
y orgnrv
utikkkannetal
ra algskvitterins
mxk
kgkrkg
h
p antallstkkrstk
y
me
kgkrks
xk
a kgkrks
k
e
fs kg krkg
xt a
antallstk krstk
gke a
dotaltfrtikler
ank
erav
aglisgvarer
yrigevarer
ax
i verf
ortetikkepresentert
efe
a



If you are sure some characters or expressions definitely will not turn up in your text (the OCR will return wrong text in place of blacklisted characters otherwise), you can blacklist those characters by using the following config.

In [7]:
custom_config = r'-c tessedit_char_blacklist=0123456789 --psm 6'
pytesseract.image_to_string(img, config=custom_config)

"n% EXTRA\n¢! —___ Kallerud\n| Apent A = OR «\ns Telefon  &¢\nCOOP INNLANDET sA\ny Org.nr ? I  MVA\nButikk Z/@-, Skann & Betal\nra Salgskvitterins ZO ..Z &:'\nmm = BANAN X-TRA KG '\n: O. kg .O kr/kg\nhe §=©COOP MOZARELLA A\\ .O\nP Antall: Z stk .O kr/stk\nMm HVITLOK OG {.\nFy = JORD. TANNT .EASYSLIDE .\nm NOTTER LOSVEKT WW.?\nO.O kg .O kr/ks\nE PAPRIKA X-TRA ZSTK .-\n@ POTET ASSORTERT \\.\na O,./ kg . kr/ks\nM SHOR N.SALTET SOOG .\nB SUKKER KG O.\nE TOMATER | ZZ.%\nfs O./% kg . kr/kg\n= X-TRA HVITL. BAGUETT \\.%\nAntall: stk %. kr/stk\ng KO FLOTE % a\nI Totalt ( Artikler) R?,\n| Bank: $\n| Herav\nDaglisgvarer $Z\nSyrige varer (\nBax: Z-\n//O i%: Overf .:\nKortet ikke presentert\nRef .¢ >\nAVBRUTT AV OPERATOR\na\n\x0c"

More examples: https://pypi.org/project/pytesseract/

# Tesseract limitations:

- The OCR is not as accurate as some commercial solutions available to us.
- Doesn't do well with images affected by artifacts including partial occlusion, distorted perspective, and complex background.
- It is not capable of recognizing handwriting.
- It may find gibberish and report this as OCR output.
- If a document contains languages outside of those given in the -l LANG arguments, results may be poor.
- It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns, and may try to join text across columns.
- Poor quality scans may produce poor quality OCR.
- It does not expose information about what font family text belongs to.

# Preprocessing for Tesseract
To avoid all the ways your tesseract output accuracy can drop, you need to make sure the image is appropriately pre-processed.

This includes rescaling, binarization, noise removal, deskewing, etc.

# Training Tesseract on custom data
Tesseract 4.00 includes a new neural network-based recognition engine that delivers significantly higher accuracy on document images. Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 text lines spanning about 4500 fonts.

In order to successfully run the Tesseract 4.0 LSTM training tutorial, you need to have a working installation of Tesseract 4 and Tesseract 4 Training Tools and also have the training scripts and required trained data files in certain directories. Visit github repo for files and tools.

Tesseract 4.00 takes a few days to a couple of weeks for training from scratch. Even with all these new training data, therefore here are few options for training:
- Fine-tune - Starting with an existing trained language, train on your specific additional data. For example training on a handwritten dataset and some additional fonts.
- Cut off the top layer - from the network and retrain a new top layer using the new data. If fine-tuning doesn't work, this is most likely the next best option. The analogy why is this useful, take for an instance models trained on ImageNet dataset. The goal is to build a cat or dog classifier, lower layers in the model are good at low-level abstraction as corners, horizontal and vertical lines, but higher layers in model are combining those features and detecting cat or dog ears, eyes, nose and so on. By retraining only top layers you are using knowledge from lower layers and combining with your new different dataset.
- Retrain from scratch - This is a very slow approach unless you have a very representative and sufficiently large training set for your problem. The best resource for training from scratch is following this github repo.

A guide on how to train on your custom data and create .traineddata files can be found: https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch, https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/, https://medium.com/@vovaprivalov/tesseract-ocr-tips-custom-dictionary-to-improve-ocr-d2b9cd17850b

# Tasks
1. Try tesseract OCR on all proposed images. First try without any pre-processing. Then, implement various pre-processing functions: binarization,morphological operations, denoising, deskewing, etc. After implementing pre-processing functions, try again and compare the performance between the two paths. 
2. Collect your own mini-dataset: photos of digital texts under various orientations, illuminations, etc. Take photos of handwritten text. Try out the library, using various settings and configurations. Assess the potentials and limitations.

<b>Others:</b>
- http://sujitpal.blogspot.com/2016/03/detecting-corruption-in-ocr-text.html
- https://medium.com/better-programming/beginners-guide-to-tesseract-ocr-using-python-10ecbb426c3d