### Example Receipt:

<img src="http://www.trbimg.com/img-561b165a/turbine/la-sp-sarkisian-receipt-20151011" width="200">

### OCR using Tesseract
`$ brew install tesseract`

`$ pip install pytesseract`

In [1]:
from PIL import Image
import requests
from io import BytesIO
import pytesseract

In [2]:
img_url = "http://www.trbimg.com/img-561b165a/turbine/la-sp-sarkisian-receipt-20151011"
response = requests.get(img_url)
img = Image.open(BytesIO(response.content))

In [3]:
ocr_text = pytesseract.image_to_string(img)
print (ocr_text)

THE SUNCADIA RESORT

TC GOLF HOUSE
1103 CLAIRE i"
V5 1275 6st 1
SARK

JUNOS'11 10: 16AM

35 HEINEKEN 157.50
35 COORS LT 175.00
12 GREY GOSSE 120.00
7 BUD LIGHT 35.00
7 BAJA CHICKEN 84.00
2 RANCHER 24.00
1 CLASSIC 8.00
1 SALMON BLT 13.00
1 DRIVER 12.00
6 CORONA 36.00
2 7-UP 4,50
Subtotal 669.00
Tax 53.52

3:36 Arent Que $722.52
FOR HOTEL GUEST ROOM CHARGE ONLY
Gratuity.


### Entity recognition using spacy
`$ pip install spacy`

`$ python -m spacy download en_core_web_sm`

In [4]:
import spacy
import pandas as pd

In [5]:
nlp = spacy.load('en_core_web_sm')

In [6]:
doc = nlp(ocr_text)
df = pd.DataFrame(columns=['text', 'label'])
text, label = [], []
for entity in doc.ents:
    text.append(entity.text)
    label.append(entity.label_)
df['text'] = text
df['label'] = label
df.head(50)

Unnamed: 0,text,label
0,GOLF HOUSE,ORG
1,1103,DATE
2,CLAIRE,ORG
3,\n,GPE
4,1275 6st 1\n,DATE
5,10,CARDINAL
6,35,CARDINAL
7,HEINEKEN,ORG
8,157.50,PRODUCT
9,35,CARDINAL


Can get the total price using the default 'MONEY' spacy entity type:

In [7]:
total = 0
for entity in doc.ents:
    if (entity.label_ == 'MONEY'):
        total = float(entity.text)
total

722.52

Techniques like NER (named entity recognition) could be used here, using tools like spacy.
Spacy could be trained with custom entities, e.g. "item" "price" "total price" "account number"

### Some data cleaning & investigation

In [8]:
ocr_arr = ocr_text.split('\n')
ocr_arr

['THE SUNCADIA RESORT',
 '',
 'TC GOLF HOUSE',
 '1103 CLAIRE i"',
 'V5 1275 6st 1',
 'SARK',
 '',
 "JUNOS'11 10: 16AM",
 '',
 '35 HEINEKEN 157.50',
 '35 COORS LT 175.00',
 '12 GREY GOSSE 120.00',
 '7 BUD LIGHT 35.00',
 '7 BAJA CHICKEN 84.00',
 '2 RANCHER 24.00',
 '1 CLASSIC 8.00',
 '1 SALMON BLT 13.00',
 '1 DRIVER 12.00',
 '6 CORONA 36.00',
 '2 7-UP 4,50',
 'Subtotal 669.00',
 'Tax 53.52',
 '',
 '3:36 Arent Que $722.52',
 'FOR HOTEL GUEST ROOM CHARGE ONLY',
 'Gratuity.']

Getting every line ending with a number:

In [9]:
def valid_line(s):
    str_arr = s.split(' ')
    return (str_arr[len(str_arr) - 1].replace(".", "", 1).isdigit())

def number_ending_lines(items):
    return list(filter(lambda x: valid_line(x), items))

In [10]:
filtered_list = number_ending_lines(ocr_arr)
filtered_list

['V5 1275 6st 1',
 '35 HEINEKEN 157.50',
 '35 COORS LT 175.00',
 '12 GREY GOSSE 120.00',
 '7 BUD LIGHT 35.00',
 '7 BAJA CHICKEN 84.00',
 '2 RANCHER 24.00',
 '1 CLASSIC 8.00',
 '1 SALMON BLT 13.00',
 '1 DRIVER 12.00',
 '6 CORONA 36.00',
 'Subtotal 669.00',
 'Tax 53.52']

Trying to use subset sum on all the numbers at the end of lines to sum to the total. e.g. the items purchased

In [11]:
def subsetsum(array,num):
    if num == 0 or num < 1:
        return None
    elif len(array) == 0:
        return None
    else:
        if array[0] == num:
            return [array[0]]
        else:
            with_v = subsetsum(array[1:],(num - array[0])) 
            if with_v:
                return [array[0]] + with_v
            else:
                return subsetsum(array[1:],num)

def get_prices(items):
    prices = []
    for item in items:
        prices.append(float(item.split(' ')[len(item.split(' ')) - 1]))
    prices = list(filter(lambda x: x != total, prices))
    return prices

In [12]:
prices = get_prices(filtered_list)
prices

[1.0,
 157.5,
 175.0,
 120.0,
 35.0,
 84.0,
 24.0,
 8.0,
 13.0,
 12.0,
 36.0,
 669.0,
 53.52]

In [13]:
total

722.52

The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.


Converting to integers:

In [14]:
prices_penny = list(map(lambda x: int(x * 100), prices))
total_penny = int(total * 100)
ans = subsetsum(prices_penny, total_penny)

In [15]:
print (ans)

[66900, 5352]


The (subtotal + tax) is picked up as a valid subset of the total

The '2 7-UP 4,50' line was ignored as the decimal point was read as a comma. Adding it back:

In [16]:
prices_penny.append(int(4.50 * 100))
ans_with_7up = subsetsum(prices_penny, total_penny)
print (ans_with_7up)

[15750, 17500, 12000, 3500, 8400, 2400, 800, 1300, 1200, 3600, 5352, 450]


Adding the 4.50 back gives us a valid subset of our original set - showing which numbers represent subpurchases on this receipt