# Teach Python how to read

## Using PIL and pytesseract 

This is the first blogpost of a three to four (I haven't decided yet) part project. The main idea is that I want to create a model which will tell me how much I would like the book, given an image of a page as in input. 

In this part, I will show you how to turn a image of text into actual text, using pytesseract. So let's first get our packages.

## Import packages

In [1]:
import numpy as np
from pathlib import Path
import pytesseract
import re
from PIL import Image, ImageFilter 
import pandas as pd

## Define PosixPath to data

I took about 100 images of pages from books I own (all in German). I then put them in an image folder, let's have a look at the directory.

In [2]:
p = Path('../storage/data/book_text_images/')

In [3]:
img_paths = [x for x in p.iterdir()]

In [5]:
img_paths[0].parts[4]

'IMG_5123_gegendenStrich.jpg'

## Use regex to extract title of book

Next to the text from the image, I would like to extract the title of the book so I can later easily join my ratings to the texts. I therefore use a regex.

In [6]:
title_list = [re.match("^.*\_(.*)\..*$",img_paths[i].parts[4]).group(1) for i in range(len(img_paths))]

## Automatically read in images, transform them and get text

We use pytesseract for extracting the text from the images. To improve the performance I tried a lot of data transformation: cropping, binarizing and a lot more. The only thing that worked for me was to first rotate the image and then use a MedianFilter.

In [7]:
def proc_img(img_path):
    im1 = Image.open(img_path) 

    im1 = im1.rotate(angle=270, resample=0, expand=10, center=None, translate=None, fillcolor=None)

    im1 = im1.filter(ImageFilter.MedianFilter)
    
    return im1

All of my input images are from german books, so I need to use lang="deu".

In [8]:
def get_text(img):
    return pytesseract.image_to_string(img, lang="deu")

I again use a regular expression to get rid of common mistakes pytesseract does: putting a \n somewhere or confusing a s for a 5.

In [9]:
def use_pattern(text):
    return pattern.sub(lambda m: rep[re.escape(m.group(0))], text)

In [10]:
rep = {"\n": "", "`": "", '%':"", '°': '', '&':'', '‘':'', '€':'e', '®':'', '\\': '', '5':'s', '1':'i', '_':'', '-':''} # define desired replacements here

# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.items()) 
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))

## Use tesseract to make image into text

Finally, we use a list comprehension (they're super useful) to put all of the text into a list of texts.

In [11]:
text_list = [use_pattern(get_text(proc_img(str(img_paths[i])))) for i in range(len(img_paths))]

## Combine into Dataframe

And now let's put that into a pandas dataframe.

In [12]:
d = {'text':text_list,'title':title_list}
df = pd.DataFrame(d) 
df.head()

Unnamed: 0,text,title
0,war ein schrecklicher Rückfall eingetreten.In ...,gegendenStrich
1,"höchst moralischer Akt, die Welt von einem sol...",derSeewolf
2,deutsches Luder nehmen. Und sollten Sie es dan...,ButchersCrossing
3,müssen.»Sie kamen jetzt in die Vorstadt. Die S...,diePest
4,"ins Gesicht, wandte sich von ihrem traurigen A...",diePest


In [20]:
df.text[3]

'müssen.»Sie kamen jetzt in die Vorstadt. Die Scheinwerfer beleuchteten die menschenleeren Straßen. Sie hielten an.Vor dem Auto fragte Rieux Tarrou, ob er mitkommenwolle, und der sagte ja. Ein Schimmer vom Himmel er—hellte ihre Gesichter; Rieux lachte plötzlich freundschaftlich.« Sagen Sie, Teirrou, was treibt Sie dazu, sich damit zubefassen? »«Ich weiß nicht. Meine Moral vielleicht.»«Und die wäre? »«Verständnis.»Tarrou wandte sich dem Haus zu, und Rieux sah erstwieder sein Gesicht, als sie bei dem alten Asthmatikerwaren.'

Looking good! For this project I also need my ratings for each of the books. I use a dictionary and the map function to easily create a column with my ratings.

## Use Dictionary to map my ratings

In [14]:
rating_lasse = {'derPate': 5,
                 'ButchersCrossing': 4,
                 'derSeewolf': 5,
                 'JekyllandHyde': 4,
                 'gegendenStrich': 1,
                 'FruestueckmitKaengurus': 5,
                 'HuckleberryFinn': 4,
                 'diePest': 2,
                 'HerzderFinsternis': 3,
                 'derSpieler': 4}

In [15]:
df['rating'] = df['title'].map(rating_lasse)

In [16]:
df.head()

Unnamed: 0,text,title,rating
0,war ein schrecklicher Rückfall eingetreten.In ...,gegendenStrich,1
1,"höchst moralischer Akt, die Welt von einem sol...",derSeewolf,5
2,deutsches Luder nehmen. Und sollten Sie es dan...,ButchersCrossing,4
3,müssen.»Sie kamen jetzt in die Vorstadt. Die S...,diePest,2
4,"ins Gesicht, wandte sich von ihrem traurigen A...",diePest,2


And that's it, let's save this dataframe and we're ready to move on to the model training!

In [24]:
df.to_csv(p/'datasets/text_df.csv', encoding='utf8', index=False)

I hope you enjoyed this blogpost and stay tuned for the next one!

Lasse