# Applying OCR to receipts

As part of some analysis we might want to know the exact time when a receipt was issued and / or want to know the items that makes up for it in order to analyse its contents.

For example, we can:

* Match the timestamp of a receipt in a city far away from Brasilia when we believe the congress person was supposed to be in a session.
* Match two timestamps of receipts made on the same day on cities really far from each other /
* Look for things like alcoholic beverages.
* See if the congressperson ordered too many dishes for "himself".
* Check in and check out dates from hotels.

Even though we have lots of libraries for doing OCR in Python, I believe [Google's Cloud Vision API](https://cloud.google.com/vision/) should be the "State of the art" when it comes to that type of thing since it is backed by Google, not to say that it is really easy to use. This notebook outlines the results of OCR'ing the receipts of the following 10 reimbursements picked from another analysis I did:

* https://jarbas.datasciencebr.com/#/document_id/5631309 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631309.pdf
* https://jarbas.datasciencebr.com/#/document_id/5631380 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631380.pdf
* https://jarbas.datasciencebr.com/#/document_id/5928875 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/1564/2016/5928875.pdf
* https://jarbas.datasciencebr.com/#/document_id/5768932 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/80/2015/5768932.pdf
* https://jarbas.datasciencebr.com/#/document_id/5962849 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962849.pdf
* https://jarbas.datasciencebr.com/#/document_id/5962903 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962903.pdf
* https://jarbas.datasciencebr.com/#/document_id/5855221 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/2238/2015/5855221.pdf
* https://jarbas.datasciencebr.com/#/document_id/5856784 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/2238/2015/5856784.pdf
* https://jarbas.datasciencebr.com/#/document_id/5921187 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/2871/2016/5921187.pdf
* https://jarbas.datasciencebr.com/#/document_id/6069360 -> http://www.camara.gov.br/cota-parlamentar/documentos/publ/2935/2016/6069360.pdf

**NOTE**: While this could have been done in Python, it would take me a while to get it going so I kept things as simple as possible with bash since this is just a spike.

# Setup

_Make sure your `config.ini` has the Google APIKey set._

In [1]:
import configparser

settings = configparser.RawConfigParser()
settings.read('../config.ini')

target = open('/tmp/cloud-vision.key', 'w')
target.write(settings.get('Google', 'APIKey'))
target.close()

You'll also need the `pdftoppm` command to convert PDFs to PNGs and `jq` to pretty print the JSON output returned by Google's API.

On the Docker environment provided, you'll need to `docker exec -u root -ti CONTAINER bash` in order to have permissions to install the packages with a

```
apt-get update && apt-get install -y poppler-utils jq
```

## Download receipts, convert to PNG and OCR them

In [2]:
%%bash

ocr() {
  id="${1}"
  url="${2}"
  mkdir -p "/tmp/reimbursements/${id}"
  cd "/tmp/reimbursements/${id}"
  echo "---> $id"
  echo "     Downloading PDF from '$url'..."
  curl -s "${url}" > "document.pdf"
  echo "     Generating PNGs..."
  pdftoppm -rx 300 -ry 300 -png "document.pdf" page
  
  for img in page*.png; do
    echo "     OCRing ${img}..."
    payload="payload-${img%.*}.json"
    response="response-${img%.*}.json"
    echo -n '{"requests": [ { "features": [ { "type": "TEXT_DETECTION" } ], "image": { "content": "' > $payload
    base64 -w 0 $img >> $payload
    echo -n '" } } ] }' >> $payload
    
    curl -s "https://vision.clients6.google.com/v1/images:annotate?key=$(cat /tmp/cloud-vision.key)&alt=json" \
         --data-binary @$payload \
         -H 'Content-Type: application/json' \
      > $response
  done
}

date
ocr 5631309 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631309.pdf'
ocr 5631380 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631380.pdf'
ocr 5928875 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/1564/2016/5928875.pdf'
ocr 5768932 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/80/2015/5768932.pdf'
ocr 5962849 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962849.pdf'
ocr 5962903 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962903.pdf'
ocr 5855221 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/2238/2015/5855221.pdf'
ocr 5856784 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/2238/2015/5856784.pdf'
ocr 5921187 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/2871/2016/5921187.pdf'
ocr 6069360 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/2935/2016/6069360.pdf'
date

Fri Jan  6 13:33:26 UTC 2017
---> 5631309
     Downloading PDF from 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631309.pdf'...
     Generating PNGs...
     OCRing page-1.png...
     OCRing page-2.png...
---> 5631380
     Downloading PDF from 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631380.pdf'...
     Generating PNGs...
     OCRing page-1.png...
---> 5928875
     Downloading PDF from 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/1564/2016/5928875.pdf'...
     Generating PNGs...
     OCRing page-1.png...
---> 5768932
     Downloading PDF from 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/80/2015/5768932.pdf'...
     Generating PNGs...
     OCRing page-1.png...
     OCRing page-2.png...
     OCRing page-3.png...
     OCRing page-4.png...
     OCRing page-5.png...
     OCRing page-6.png...
---> 5962849
     Downloading PDF from 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962849.pdf'.

As we can see, it takes a while to process just 10 PDFs on a 60Mb connection (~2 minutes), if we ever move on with this we should really look into parallelizing it from day 0 and / or sending receipts in batches as it is supported by the API.

## Document [5631309](https://jarbas.datasciencebr.com/#/document_id/5631309)

A hotel receipt with lots of text in it. There is a lot of stuff on those JSON responses so we just extract the info for the piece that represents the whole text.

In [3]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5631309/response-page-1.json
echo '--------------'; echo
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5631309/response-page-2.json

Castro Marques Hoteis LTDA.
Extrato de Conta
None
OLAVO BILAC PINTO
Empresa PARTICULAR
Uh
1001
Nun, Doc
Endereco Class, Fiscal
RUA EAUSTo NUNES vrEIRA, 40 to 801
BELVEDERE
chegada
BELO HORI2ONTENG
30320-590 BRASIL
12/23/201
00 22
Reserva
1820028
Funcionario:ADSILVA.
Emissao: 18/03/2015 og t 35
Partida: 14/03/2015 08
CONTAENCERRADA Hospede: PINTO, OLAVO BLAC
Num. Doc: 4556169968 Designacao: OLAVO BILAC PINTO
Data
origem Documento
Descripao
13/0309 07
OR
Hospede(s)
Empresa
Saldo Usuario
13/03 11:59
cons ego 24200M ssRvzCE RESTAURANT
1310314222
115, ALEX
Coon 69019 RBGTAURANTE WRAND
384.00 ALJ13
DE MELANCIA.
3/0316:00
13/23 15:34
e00 59025 FRIGOBAR.
AGUA COM GAS BICLEVE
310323159
Coos 62034 RESTADRANTE LUGARO
A22, AFSRK.
182,00
upo ins MELANCIA
27, 59
16
CAFE AXPRESSO
cno.. 69 39:xpow SERVICE RESTAURANT
08:02
90.00
xxxxxxxxxxxx. DEPO8rso aNTECIPADO
Resumo do Extrato.
Hospede: PINTO, OLAVoBn
Designacao: OLAVO BILAC PINTO
DEPOSITO ANTECIPA
Valor Tota
FRIGOBAR
RESTAURANTE LUGAN
783,40
on
SER

As we can see, it'd be pretty hard to parse the contents with a trivial "string matching algorithm", my guess is that the fact that the receipt is not fully vertical gets the OCR confused

## Document [5631380](https://jarbas.datasciencebr.com/#/document_id/5631380)

There are 3 timestamps on the receipt and OCR extracted 2. The receipt items prices were also more or less extracted properly (4.50 and 4.75)

In [4]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5631380/response-page-1.json

PINENIA VERDE Al INEKIOS LTDA
tern. Passageiros atroporto inter tacional,S/H
Setor Sala Eabarque 8. Aeroporto Conflns
Confins Hinas Gerais
BliPJ: 08,060,954/0039-12
8170
312015
CUPOM FISCAL
VL ITEM (R$)
EN COOL
4,505
OTD
4042 Cafe Expresso i 01r
4,758
2 2036 Pao de
Queijo l 12
TOTAL R$
10,00
0,75
DINHEIRO
AFM 4.1,8,4
UHPOSTOS A PROX, LEI 12,24
R$ 2,05
17/D3/2015
Garcom Viviane
9:51 AM
0010
VR 100/2
ontes: 3
SMEDA If ST120
-IF VERSAI: 01,00.05 ECF: 006
Z(CC((XK 17/03/2015 09:53:50
FAE: SN)4100000000001651



## Document [5928875](https://jarbas.datasciencebr.com/#/document_id/5928875) 

The API is obviously not that magical and it can't parse handwritten stuff

In [5]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5928875/response-page-1.json

EUSEPIOPIILARIAERESTAURANTE NerezinhadeORreira
do Chapéu BA CEP 44850-00t
Rua Antonio Balbino 387- Casa. Centro Telefax (7413853-22
CNPI 07.802.205000 1-30 Inge Estadual 06310772 PP O23857
Nota Fiscal de Venda a Consumidor Série D1 vALIDAATE, 27/092017
ata da Emi
Nome
Ende
Estado
Cida
Unitario Total
Discriminacao das Mercadorias
Quant.
Grarca e Eduva Vitoria Rua Rui Barbos5, na 167 -Munro d Chai
u BA
Total R$
inscricac Estadt
085.420.506 ME
ENPI
It 2.384 0001-23
30 TJ. 50 x 03 Di 023001 a 024500 AID
213007720
Infaz Lue & 2RVED9/2015



## Document [5768932](https://jarbas.datasciencebr.com/#/document_id/5768932) 

A six page reimbursement document, most interesting info is on pages 4 and 5. On page 4, we can see items that make up for the meals expenses:

In [6]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5768932/response-page-4.json

3 de 4
http://www.nfe.fazenda.gov.br/portal/consultaImpressao.aspx?tipo...
Dados dos Produtos e Servicos
Unidade
Valor (R$
Qtd.
Num. Descricao
Comercial
3,00
0000
UND
1 AGUA MINERAL S GAS 500 ML
3,00
0000
UND
2 AGUA MINERAL, C GAS 500 ML
0000
UND
32,00
BUFFET DO DIA
5,00
UND
0000
i 4 i SOPA DO DIA
Totais
CMS
Valor do ICMS
Base de Calculo ICMS
Base de Calculo ICMS
Valor do ICMS
ST
Desonerado
0,00
0,00
0,00
0,00
Valor ICMS Substituicao Valor Total dos Produtos Valor do Frete
Valor do Seguro
0,00
0,00
53,00
0,00
Valor Total dos
Outras Despesas
Valor Total da NFe
Valor Total do IPI
Descontos
Acessorias
53,00
0,00
0,00
0,00
Valor Aproximado dos
Valor da COFINS
Valor do PIS
Valor Total do II
Tributos
0,00
0,00
0,00
6.98
Dados do Transporte
Modalidade do Frete
9 Sem Frete
Transportador
Razao Social Nome
CNPJ
RIO PARNAIBA EMPREEND TUR LTDA
04.024. 831/0001-54
Municipio
Inscricao Estadual
Ender eco Completo
UF
20/08/2015 19:09
Page 4/6



And on page 5 we can easily identify the period the person was in the hotel (search for `IN` and `OUT`, both followed by a date)

In [7]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5768932/response-page-5.json

4 de 4
http://www.nfe fazenda.gov.br/portal/consultalmpressao.aspx?tipo...
informacoes Adicionais
XSLT v3.1.0
Formato de Impressao DANFE
DANIFE normal, retrato
Informacoes Complementares de Interesse do Contribuinte
Descricao
NOME: JOSE FRANCISCO PAES LANDIM IN
08/2015 OUT 2/08/2015 APTO: 14 Valor aproximado de
Dados de Nota Fiscal Avulsa
CNPJ
Matricula do Funcionario
Reparticao Fiscal do Emitente
Fone Fax
Nome do Funcionario
Numero do Documento Arrecadacao
UF
Data de Emissao do Documento Arrecadacao
Valor Total do Documento Arrecadacao
Data do Pagamento do Documento Arrecadacao
20/08/2015 19:09
Page 5/6



## Document [5962849](https://jarbas.datasciencebr.com/#/document_id/5962849) 

We can parse both timestamps on the receipt

In [8]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5962849/response-page-1.json

DAFERALIHENTOS LTDA
EPP
PIU PIU LANCHES
AV. IROZIMBG MAIA, 2400
B. VILA ITAPURA
CEP: 13.023-0001
TEL
19) 3255-6546
CAMPINAS/SP
IE: 244.496.769.119
OPJ: 01.095. 461/0001-58
3170372016 15:40:49
CUPOM FISCAL
IT
001 00000000000120 DESPESAS /REFEICAO
un K
124.52 T12,00% A
124,52
TOTAL
R$
124 52
CARTAO
124,52
Val Aprox Tributos:R$ 39,96(32,09%)
Fonte:IBPT
ICMS Recolhido Conforme
LC 123/2006
Simples Nacional
31/03/16 23:15 LJ0001 OP000001 CX001 SR094789
Mensagem Nao Programada
Dar umaFramework
Daruna Framework Mensagem Nao Programada
DARUMA AUTOMACAO ACH 2
ECF-IF
VERSA 01.00.00
ECF: 004
Lj:0001
HHHHHHHHHAFDCHEABD
31/03/2016 15:41:04
FAB DRO913BR000000379665



## Document [5962903](https://jarbas.datasciencebr.com/#/document_id/5962903) 

Can nicely parse receipt items, prices and timestamp

In [9]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5962903/response-page-1.json

Churrascaria Sorriso
Sorriso
CHURRASCARIA SORRISO LTDA EPP
R: Dr Miguel Penteado Nu 953
Campinas SP
(19)32425676
CNPJ: 58.543.539!0001-77 HE: 244313752113
EXTRATO N
002419
DATA: 31/03/2016 13:39:32
CUPOM ISCAL ELETRO NICO SAT
VI, IT R$
001 Coca ks
1 x 4,40 1,87
40
002 Picanha Tro 1 x 89,90
19,96
89,90
2 x 3,50 1,55
003 Cafe
7,00
3,60
004 Agua Gas Prat 1 x 3,60 1,53
005 Salada Croca 1 x 9,50
6,55
29.50
Total Bruto de Itens: R$ 134,40
Acrecimos sobre Subtotal R$ 13,44
TOTAL: R$ 147,84
R$ 147,84
Visa dit
Obrigad
volte sempre
valor
aproximado dos tributos dest
cupom
(conforme lei Red 12.7 41/2012
R$ 31,46
R$ 6,71 Federal R$ 24,75 Estadual Fonte: IBP
SAT 000157053
3516 0358 5435 3900 0177 5900 0157 0530 0241 9947 2085



## Document [5855221](https://jarbas.datasciencebr.com/#/document_id/5855221) 

Here we have both the card receipt and the invoice, the quality of the PDF / images sucks and the API can't do magic

In [10]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5855221/response-page-1.json

CUM ICA LHUS SP
NP :32.905. 11 (110-77
C00: 323016
85
TE FISCA
COHER.IVPNTE CRE
OU DEBIT)
3230 5
GR SA, A
/NASA-n
2.30
116-77
50, 92
CUO 323015
ion
40
CIJE UM FI SCAL.
0101975
600
ITAL.
Cartao Credit
03.20
it 3prix 2723a6
RI: 3,3 Federal E! 0,00 Estadual
GEN-A 00
THAYNA AUGUST
S12 4
St'FG
4:39:17V
0912101(
0877



## Document [5856784](https://jarbas.datasciencebr.com/#/document_id/5856784) 

Here we have both the card receipt and the invoice but this time the API can get some timestamps and a bit of the receipt items

In [11]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5856784/response-page-1.json

RESTAURANTE RECANTO DO DJALMA LTDA.
RECANTO DO DJALMA
ROI UNORTE A
INDIANDPOLIS S/N ZONA RURAL ANORTEI
p:872
Tel:
IE: 904.3 836-42
TNPJ: 08.510 550/0001-25
MANFE NFC-e Documento Auxiliar
No
Nati
UErmite aproveitamento de crédito de ICMS
UN X 18 00 18 UU
UN X S.50 3.50
411 REFR SHVEPPES TONICA
21 50
VALOR TOTAL R$
Valor Pago
ORHA DE PAGAMENTO
infor dos Tributos Totais Incidentes
Lei Federal 12,741/2012)
Nuiero 000177
Serie 001 Enissao 12/11/2015 14:01:56
Via Consumidor
Consulte pela Chave de Acesso e
http://www.fazenda.pr.gov.br/
HAVE DE ACESSO
15 5105 5000 0125 6500 1000 0001 110 0000. 1775..
CONSUMIDOR
CPF: 030.988. 719-46
JOSE CARLOS SILVA
AN. GOVERNADOR PARIGOT DE A 2965 20
VII UNUARAHA PR
Consulta via leitor de QR C
Protocolo de Autorizacao
006602
12/11/2015 14:07:49
Gerence Sistenas
uuu, ence sistem
Con. br
Cielo
IA tLIENTE POS
1/15 14:11
21,5g



## Document [5921187](https://jarbas.datasciencebr.com/#/document_id/5921187) 

OCR doesn't make any sense, probably because the receipt is not fully vertical as well

In [12]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/5921187/response-page-1.json

JK RESEN
ROD DE DE COM IE: BR050 POSTOJK
DE PETROLE
KM 013. CAT
sNTREvo IM: 10.090. LOTEAMENTo 972-8
80
JK
OLIDA
CNPJ/CPF consumidor:
2 400 CUPOM 3 183 POLPA FISCAL
4 195 CAPUCCINO NESTLE 1UN
50ONL. 13
11
5 111 PALITO 1UN 006 7895 007:7895 144603216 MENTOS STICK DUO BLACK ICE TOTAL 144293844 MENTOS STICK R$
T1 01107,00x 03T MACA -32
T3 VERDE -3843
oc
03T17,002
ro.
Cat
0418021618 vo
1815
Ap
27,00
onte
BEMATECH FAB MP-4000 TH FI ECF-IF
BE031110 12/02/2016 18: 18,56V



## Document [6069360](https://jarbas.datasciencebr.com/#/document_id/6069360) 

All messed up, even found some weird foreign characters, but it found the timestamp

In [13]:
%%bash
jq -r '.responses[].textAnnotations[0].description' /tmp/reimbursements/6069360/response-page-1.json

ENTERNAI IONAL MI Al tiMPANY AL INESfaCAU S,h,
AEROPORED INTERNACIONAL
AERSPORFB - SAO PAULO -SP
CEP: 84626-811
CHPJ: 17,314, 329/8005-53
醷7Dé iii alt彰 箱 2f.-CCV溺64g5.........
.... cioi asa555
304,f3
CMPJ/CPF consuainsr: 095,023,023-81
CUP0M FISCAL
ITEM CdoISS DESCRICAG
QiB。W.YLUN3TORS) ST
VE ITEM(8$)
ritata i fl
2 1183 Sucs de Isaater Fi
16.50
3 8885 Csuver tras,Palean22 tos/ 011
13.90
4 3058 Buffet Csapleto 1 011
84,88
5424SOf.Exs Saprese . Lapsaid I
SBBIOFAL RS
...... ......190
ACRESCINB
13,3%
TOTAL R$
143,77
VISA
143,77
01103.20;
TOS APROX
Garcon: Sebastiao
08/08/2016
Caixa,SANDRA DE SORRES PIRES
MESA 28/1
1:56 PM
CJientes: 多
6(26Xs4ye?s??W1p61
SMEDA IF ST288
-IFXERSA:201, 88.85 ECF: 118 LJ: 36
RSPSLP 08/08/2016 13:45:27
FAB: SK031200000000034217
L su SLCDJさ85? L57
,no
ylL
CS $



# Conclusion

Even though not everything will be able to be parsed, more than half can get their timestamps extracted which is a nice data point to have around.

Some ideas for future work:

- Figure out if we can detect that the receipts have been rotated and try to use some image processing to fix it.
- Come up with some pre analysis of the image to detect "bluriness" so we can potentially discard OCR processing when a new receipt comes in that is not good.
- Another nice pre analysis would be to determine if a receipt is handwritten or not so we can flag them and filter them out on other analysis.

**NOTE** Everyone that registers for the Google Cloud engine gets US$300 to spend on the first 60 days so we can probably do a lot of tweaking on our code for free before we "Get it right"