### Example Receipt:

<img src="http://www.trbimg.com/img-561b165a/turbine/la-sp-sarkisian-receipt-20151011" width="200">

### OCR using Tesseract
`$ brew install tesseract`

`$ pip install pytesseract`

In [1]:
from PIL import Image
import requests
from io import BytesIO
import pytesseract

In [2]:
img_url = "http://www.trbimg.com/img-561b165a/turbine/la-sp-sarkisian-receipt-20151011"
response = requests.get(img_url)
img = Image.open(BytesIO(response.content))

In [4]:
ocr_text = pytesseract.image_to_string(img)
print (ocr_text)

THE SUNCADIA RESORT

TC GOLF HOUSE
1103 CLAIRE i"
V5 1275 6st 1
SARK

JUNOS'11 10: 16AM

35 HEINEKEN 157.50
35 COORS LT 175.00
12 GREY GOSSE 120.00
7 BUD LIGHT 35.00
7 BAJA CHICKEN 84.00
2 RANCHER 24.00
1 CLASSIC 8.00
1 SALMON BLT 13.00
1 DRIVER 12.00
6 CORONA 36.00
2 7-UP 4,50
Subtotal 669.00
Tax 53.52

3:36 Arent Que $722.52
FOR HOTEL GUEST ROOM CHARGE ONLY
Gratuity.


### Entity recognition using spacy
`$ pip install spacy`

`$ python -m spacy download en_core_web_sm`

In [5]:
import spacy

In [6]:
nlp = spacy.load('en_core_web_sm')

In [7]:
doc = nlp(ocr_text)
for entity in doc.ents:
    print(entity.text, entity.label_)

GOLF HOUSE ORG
1103 DATE
CLAIRE ORG

 GPE
1275 6st 1
 DATE
10 CARDINAL
35 CARDINAL
HEINEKEN ORG
157.50 PRODUCT
35 CARDINAL
COORS LT 175.00
12 GREY ORG
120.00 CARDINAL
7 BUD LIGHT PRODUCT
35.00 CARDINAL
84.00 CARDINAL

 GPE
2 CARDINAL
24.00 CARDINAL
8.00 CARDINAL

1 SALMON BLT 13.00
1 DRIVER EVENT
12.00 CARDINAL
6 CARDINAL
4,50
Subtotal 669.00
 ORG
722.52 MONEY

FOR HOTEL GUEST ROOM CHARGE ORG

 GPE


Can get the total price using the default 'MONEY' spacy entity type:

In [10]:
total = 0
for entity in doc.ents:
    if (entity.label_ == 'MONEY'):
        total = float(entity.text)
total

722.52

Techniques like NER (named entity recognition) could be used here, using tools like spacy.
Spacy could be trained with custom entities, e.g. "item" "price" "total price" "account number"

### Some data cleaning & investigation

In [11]:
ocr_arr = ocr_text.split('\n')
ocr_arr

['THE SUNCADIA RESORT',
 '',
 'TC GOLF HOUSE',
 '1103 CLAIRE i"',
 'V5 1275 6st 1',
 'SARK',
 '',
 "JUNOS'11 10: 16AM",
 '',
 '35 HEINEKEN 157.50',
 '35 COORS LT 175.00',
 '12 GREY GOSSE 120.00',
 '7 BUD LIGHT 35.00',
 '7 BAJA CHICKEN 84.00',
 '2 RANCHER 24.00',
 '1 CLASSIC 8.00',
 '1 SALMON BLT 13.00',
 '1 DRIVER 12.00',
 '6 CORONA 36.00',
 '2 7-UP 4,50',
 'Subtotal 669.00',
 'Tax 53.52',
 '',
 '3:36 Arent Que $722.52',
 'FOR HOTEL GUEST ROOM CHARGE ONLY',
 'Gratuity.']

Getting every line ending with a number:

In [12]:
def valid_line(s):
    str_arr = s.split(' ')
    return (str_arr[len(str_arr) - 1].replace(".", "", 1).isdigit())

def number_ending_lines(items):
    return list(filter(lambda x: valid_line(x), items))

In [13]:
filtered_list = number_ending_lines(ocr_arr)
filtered_list

['V5 1275 6st 1',
 '35 HEINEKEN 157.50',
 '35 COORS LT 175.00',
 '12 GREY GOSSE 120.00',
 '7 BUD LIGHT 35.00',
 '7 BAJA CHICKEN 84.00',
 '2 RANCHER 24.00',
 '1 CLASSIC 8.00',
 '1 SALMON BLT 13.00',
 '1 DRIVER 12.00',
 '6 CORONA 36.00',
 'Subtotal 669.00',
 'Tax 53.52']

Trying to use subset sum on all the numbers at the end of lines to sum to the total. e.g. the items purchased

In [44]:
def subsetsum(array,num):
    if num == 0 or num < 1:
        return None
    elif len(array) == 0:
        return None
    else:
        if array[0] == num:
            return [array[0]]
        else:
            with_v = subsetsum(array[1:],(num - array[0])) 
            if with_v:
                return [array[0]] + with_v
            else:
                return subsetsum(array[1:],num)

def get_prices(items):
    prices = []
    for item in items:
        prices.append(float(item.split(' ')[len(item.split(' ')) - 1]))
    prices = list(filter(lambda x: x != total, prices))
    return prices

In [58]:
prices = get_prices(filtered_list)
prices

[1.0,
 157.5,
 175.0,
 120.0,
 35.0,
 84.0,
 24.0,
 8.0,
 13.0,
 12.0,
 36.0,
 669.0,
 53.52]

In [59]:
total

722.52

Converting to integers:

In [61]:
prices_penny = list(map(lambda x: int(x * 100), prices))
total_penny = int(total * 100)
ans = subsetsum(prices_penny, total_penny)

In [62]:
print (ans)

[66900, 5352]


The '2 7-UP 4,50' line was ignored as the decimal point was read as a comma

In [63]:
prices_penny.append(int(4.50 * 100))
ans_with_7up = subsetsum(prices_penny, total_pe)
print (ans_with_7up)

None


Adding the 4.50 back gives us a valid subset of our original set - showing which numbers represent subpurchases on this receipt