# Showcasing `serenata-ocr`

As outlined in [a previous notebook](https://github.com/datasciencebr/serenata-de-amor/blob/master/research/develop/2016-12-30-fgrehm-ocr-receipts-with-google-cloud-vision.ipynb), we have an interest in analysing the contents of a receipt. Given the receipts are mostly scanned documents saved as PDFs with no textual information that can be parsed by a computer, we need to rely on OCR tools to get the job done.

[serenata-ocr](https://github.com/fgrehm/serenata-ocr) is an API that takes in Chamber of Deputies reimbursement information and returns the text contained in its receipt. While the project is still in its early days, it provides some improvements over the process that was done to OCR the initial set of 200K receipts documented [here](https://github.com/datasciencebr/serenata-de-amor/blob/master/docs/receipts-ocr.md), namely support for deskewing receipts images (making them straight), giving google a hint that the text is in portuguese and usage of the new [document text detection](https://cloud.google.com/vision/docs/detecting-fulltext) feature provided by Google Cloud Vision.

Picking the same set of 10 receipts used in the previous notebook ([available in the `2017-02-15-receipts-texts.xz` dataset](https://github.com/datasciencebr/serenata-de-amor/blob/master/docs/receipts-ocr.md#available-datasets)) as examples, the idea here is to compare the text returned by `serenata-ocr` with the information obtained by the initial approach used for OCRing receipts:

* http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631309.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631380.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/1564/2016/5928875.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/80/2015/5768932.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962849.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/3052/2016/5962903.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/2238/2015/5855221.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/2238/2015/5856784.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/2871/2016/5921187.pdf
* http://www.camara.gov.br/cota-parlamentar/documentos/publ/2935/2016/6069360.pd

## Processing PDFs

In [1]:
SERENATA_OCR_URL = "https://YOUR_API_ID.execute-api.us-east-1.amazonaws.com/latest/chamber-of-deputies/receipt"

In [2]:
reimbursements = [
    { "applicant_id": 1789, "year": 2015, "document_id": 5631380, "args": ""},
    { "applicant_id": 1564, "year": 2016, "document_id": 5928875, "args": ""},
    { "applicant_id": 3052, "year": 2016, "document_id": 5962849, "args": ""},
    { "applicant_id": 3052, "year": 2016, "document_id": 5962903, "args": ""},
    { "applicant_id": 2238, "year": 2015, "document_id": 5855221, "args": ""},
    { "applicant_id": 2238, "year": 2015, "document_id": 5856784, "args": ""},
    { "applicant_id": 2871, "year": 2016, "document_id": 5921187, "args": ""},
    { "applicant_id": 2935, "year": 2016, "document_id": 6069360, "args": ""},
    # These 2 reimbursements require a smaller density, otherwise the API times out
    { "applicant_id": 80,   "year": 2015, "document_id": 5768932, "args": "density=100" },
    { "applicant_id": 1789, "year": 2015, "document_id": 5631309, "args": "density=175" },
]

In [3]:
import os
import urllib.request
import json

new_texts = {}
for r in reimbursements:
    document_id = r["document_id"]

    print("OCRing", r)
    response = urllib.request.urlopen("{0}/{1}/{2}/{3}?{4}".format(
        SERENATA_OCR_URL, 
        r["applicant_id"], 
        r["year"], 
        r["document_id"],
        r["args"]
    ))
    raw_data = response.read()
    encoding = response.info().get_content_charset('utf8')
    data = json.loads(raw_data.decode(encoding))
    text = data['ocrResponse']['textAnnotations'][0]['description']
    new_texts[document_id] = text

print("DONE")

OCRing {'applicant_id': 1789, 'year': 2015, 'document_id': 5631380, 'args': ''}
OCRing {'applicant_id': 1564, 'year': 2016, 'document_id': 5928875, 'args': ''}
OCRing {'applicant_id': 3052, 'year': 2016, 'document_id': 5962849, 'args': ''}
OCRing {'applicant_id': 3052, 'year': 2016, 'document_id': 5962903, 'args': ''}
OCRing {'applicant_id': 2238, 'year': 2015, 'document_id': 5855221, 'args': ''}
OCRing {'applicant_id': 2238, 'year': 2015, 'document_id': 5856784, 'args': ''}
OCRing {'applicant_id': 2871, 'year': 2016, 'document_id': 5921187, 'args': ''}
OCRing {'applicant_id': 2935, 'year': 2016, 'document_id': 6069360, 'args': ''}
OCRing {'applicant_id': 80, 'year': 2015, 'document_id': 5768932, 'args': 'density=100'}
OCRing {'applicant_id': 1789, 'year': 2015, 'document_id': 5631309, 'args': 'density=175'}
DONE


## Comparing results

The previous batch of OCRed documents is available on the `2017-02-15-receipts-texts` dataset, so we'll load it up and compare the texts.

In [4]:
import pandas as pd
import numpy as np

from serenata_toolbox.datasets import fetch

# fetch("2017-02-15-receipts-texts.xz", "data")
df = pd.read_csv('data/2017-02-15-receipts-texts.xz', low_memory=False)
df = df[df.document_id.isin(new_texts.keys())]

## Compare each document individually

In [5]:
df['new_text'] = ""

In [6]:
txt_series = pd.Series(new_texts)
df = df.set_index('document_id')
df['new_text'] = txt_series
df = df.reset_index()

In [7]:
df['text'] = df.text.str.replace('\n', ' ')
df['new_text'] = df.new_text.str.replace('\n', ' ')

In [8]:
# From http://code.activestate.com/recipes/302380-formatting-plain-text-into-columns/
import re

LEFT = '<'
RIGHT = '>'
CENTER = '^'

class FormatColumns:
    '''Format some columns of text with constraints on the widths of the
    columns and the alignment of the text inside the columns.
    '''
    def __init__(self, columns, contents, spacer=' | ', retain_newlines=True):
        assert len(columns) == len(contents), \
            'columns and contents must be same length'
        self.columns = columns
        self.num_columns = len(columns)
        self.contents = contents
        self.spacer = spacer
        self.retain_newlines = retain_newlines
        self.positions = [0]*self.num_columns

    def format_line(self, wsre=re.compile(r'\s+')):
        l = []
        data = False
        for i, (width, alignment) in enumerate(self.columns):
            content = self.contents[i]
            col = ''
            while self.positions[i] < len(content):
                word = content[self.positions[i]]
                # if we hit a newline, honor it
                if '\n' in word:
                    # chomp
                    self.positions[i] += 1
                    if self.retain_newlines:
                        break
                    word = word.strip()

                # make sure this word fits
                if col and len(word) + len(col) > width:
                    break

                # no whitespace at start-of-line
                if wsre.match(word) and not col:
                    # chomp
                    self.positions[i] += 1
                    continue

                col += word
                # chomp
                self.positions[i] += 1
            if col:
                data = True
            col = '{:<{}}'.format(col.lstrip(), width)
            l.append(col)

        if data:
            return self.spacer.join(l).rstrip()
        # don't return a blank line
        return ''

    def format(self, splitre=re.compile(r'(\n|\r\n|\r|[ \t]|\S+)')):
        # split the text into words, spaces/tabs and newlines
        for i, content in enumerate(self.contents):
            self.contents[i] = splitre.findall(content)

        # now process line by line
        l = []
        line = self.format_line()
        while line:
            l.append(line)
            line = self.format_line()
        return '\n'.join(l)

    def __str__(self):
        return self.format()
    
def display_comparisson(document_id):
    reimbursement = df[df.document_id == document_id].iloc[0]
    print(FormatColumns(((50, LEFT), (50, LEFT)), [reimbursement.text, reimbursement.new_text]))

## Document [5631309](http://www.camara.gov.br/cota-parlamentar/documentos/publ/1789/2015/5631309.pdf)

Not only we were able to OCR the _whole_ document (previously it failed on a couple pages) but the extracted text makes more a bit more sense.

In [9]:
display_comparisson(5631309)

PREFEITURAMUNICIPAL DE POUSO ALEGRE Numero da      | Castro Marques Hoteis LTDA. Extrato de Conta Uh:
NFS-e SECRETARIA DE FAZENDA NOTA FISCAL ELETRONICA | Noner OLAVO DILAC PINTO Empresat PARTICULAR Nuo.
DE SERVICO-NFS e 1708 291293991 3/201 1703/2015    | DOG: Class. Fiscali Endereço:RUA FAUSTO NUNES
17.02.04 POU3O ALEGRE MG soc Nome CASTRO           | VIEIRA, 10 Apto 801 - BELVEDERE BELO HORIZONTE NG
MARQUESHOTERELTDA. POUSO ALEGRE NG                 | 30320-590 BRASIL Chegada 3. 12/03/2014 00:22
ta,623,063/B001-10 E CEP Av. PREF, TUANY TOLEDO,80 | Partida: 14/03/2015 08:04 1001 Renerva: 1820028
FATIMA il cap: 37550010.                           | Ad/Cr1CE2.: 1/970 Funcionário:ADSILVA. Emissão:
ssitvertoeritarquwsplaza.com.br 353122-2020 OLAwo  | 18/03/201S 09:35 CONTA ENCERRADA Hóspede: PINTO,
BILAC PINTO BELOHORUZONTE MG 455 616.996-87 R      | OLAVO BILAC Num. Dor: 4556169968 Designação:OLAVO
FAUSTO NUNESVEIRA AFT 801 40- BELVEDERE CEP:       | BILAC PINTO Origem OR Empresa Dafn 1

## Document [5928875](https://jarbas.datasciencebr.com/#/document_id/5928875) 

Same as before, the API is obviously not that magical and it can't parse handwritten stuff BUT it got really close to parsing the value of the reimbursement (`R$175,00` OCRed as `R$ 27S-OO`)

In [10]:
display_comparisson(5928875)

EUSEPIOPIILARIAERESTAURANTE NerezinhadeORreira do  | essere ATTA EUSÉPIO PIZZARIA E RESTAURANTE
Chapéu BA CEP 44850-00t Rua Antonio Balbino 387-   | Terezinha de Oliveira Garofani - EPP I 4. Arnioni
Casa. Centro Telefax (7413853-22 CNPI              | UMB T Mia! Å ANTE III VOI VALOIALI Rua Antària
07.802.205000 1-30 Inge Estadual 06310772 PP       | Baibiro, 387- CasaCentro - Telef ax (74) 3653-2205
O23857 Nota Fiscal de Venda a Consumidor Série D1  | - Morro do Chapéu - BA • CEP 44850-000 | (CNPJ
vALIDAATE, 27/092017 ata da Emi Nome Ende Estado   | 07.802.205/0001-30] Inse. Estadual 068.107.722 PP
Cida Unitario Total Discriminacao das Mercadorias  | || N°23857 Nota Fiscal de Venda a Consumidor -
Quant. Grarca e Eduva Vitoria Rua Rui Barbos5, na  | Série D1 VÁLIDA ATÉ 27/09/2017 ONSILFF1IdOP Nome
167 -Munro d Chai u BA Total R$ inscricac Estadt   | CalDAugust Data da Emissão Endereço S.SAN. fko.
085.420.506 ME ENPI It 2.384 0001-23 30 TJ. 50 x   | Cidade Estado Quant. Discriminação das

## Document [5768932](https://jarbas.datasciencebr.com/#/document_id/5768932) 

A six page reimbursement document, once again we have more text extracted, at first sight, the quality is almost the same:

In [11]:
display_comparisson(5768932)

2 de 4 http://www.nfe                              | RECEIOS DE RIO PARVAID IDA OS FRoniscusSTANTES E
fazenda.gov.br/portal/consultaImpressao.aspx?tipo... | NXT:SESLAL INIC A12A AMPLADO TRATA DE RECEBISIEN10
Dados do Emitente Nome Fantasia Nome/Razao Social  | DE. VTIPICACAO E ASSINATURA DO RECERKDOR NF-e N°
LUXOR PIAUI HOTEL RIO PARNAIBA EMPREEND TUR LTDA   | 900.023.391 Série 3 RIO PARNAIBA EMPREEND TUR LTDA
Endereco CNPJ PCA MARECHAL DEODORO, 310 04.024.831 | DANFE Facumciro ALIxıktar da PA MARŁCLJAL DEODORO
/0001-54 Bairro Distrito CEP 64000-160 CENTRO      | N° 310 Nettu Fascal Lletrònica CENTRO -
Telefone Municipio (86)3131-3000 i 221 1001        | TERESINA-E’A 0- LNTRADATI CIJAI. 16 ACESSO C'EP
TERESINA Pais UF 1058 BRASIL inscricao Estadual do | 6400-1600 I SAÉDA 2715 0904 024N 3100 0154 5500
Substituto Tributario Inscricao Estadual 94461394  | 3400 0233 9110 023 3914 FONE (86)3131-3(NICH N'
Municipio da Ocorrencia do Fato Gerador do ICMS    | 000.023.391 SÉRIE 3 Cunsulta de xu

## Document [5962849](https://jarbas.datasciencebr.com/#/document_id/5962849) 

Similar quality of text in terms of what we are interested in (timestamps, values and receipt items)

In [12]:
display_comparisson(5962849)

DAFERALIHENTOS LTDA EPP PIU PIU LANCHES AV.        | I E A DAFER ALIMENTOS LTDA - EPP ·PIU PIU LANCHES
IROZIMBG MAIA, 2400 B. VILA ITAPURA CEP:           | AV. DRUZIMBO MAIA, 2400 - B. VILA ITAPURA CEP:
13.023-0001 TEL 19) 3255-6546 CAMPINAS/SP IE:      | 13.023-0001 TEL: (19) 3255-6546 CAMPINAS/SP CNP]:
244.496.769.119 OPJ: 01.095. 461/0001-58           | 01.095.461/0001-58 IE: 244.496.769.119 31/03/2016
3170372016 15:40:49 CUPOM FISCAL IT 001            | 15:40:49 ***CCF: 024299 COO: 028000 CÜPOM FISCAL
00000000000120 DESPESAS /REFEICAO un K 124.52      | ITEM CÓDIGO DESCRIÇÃO QID.UN. VL UNIT R$ STA/T VL
T12,00% A 124,52 TOTAL R$ 124 52 CARTAO 124,52 Val | ITEM R$ 001 00000000000120 DESPESAS /REFEIÇAO un X
Aprox Tributos:R$ 39,96(32,09%) Fonte:IBPT ICMS    | 124,52 T12,00% Á 124,52) TOTAL R$ 124,52 CARTÃO
Recolhido Conforme LC 123/2006 Simples Nacional    | 124,52 Val. Aprox Tributos : R$ 39,96(32,09%)
31/03/16 23:15 LJ0001 OP000001 CX001 SR094789      | Fonte: IBPT ICMS Recolhido Co

## Document [5962903](https://jarbas.datasciencebr.com/#/document_id/5962903) 

More text again, similar quality

In [13]:
display_comparisson(5962903)

Churrascaria Sorriso Sorriso CHURRASCARIA SORRISO  | .**: *T* Drivit. | Franë. 2 uit: Prix, {
LTDA EPP R: Dr Miguel Penteado Nu 953 Campinas SP  | Churrascaria Sorriso CHURRASCARIA SORRISO LTDA -
(19)32425676 CNPJ: 58.543.539!0001-77 HE:          | ERP, R: Dr Miguel Penteado - N° 953 Carnpinas - SP
244313752113 EXTRATO N 002419 DATA: 31/03/2016     | - (19)32425676 2443137 - i - | . . . . . . . ' *.
13:39:32 CUPOM ISCAL ELETRO NICO SAT VI, IT R$ 001 | N F * A SÅ * A * * * * V0.06% 0.00/U T 1 w -- k,.
Coca ks 1 x 4,40 1,87 40 002 Picanha Tro 1 x 89,90 | -. -	- und H- - -, u. ka kom aan -- -- -- -- L. --
19,96 89,90 2 x 3,50 1,55 003 Cafe 7,00 3,60 004   | -- - -- -- -- -- -- -- -- -- -- EXTRATO N° 002419
Agua Gas Prat 1 x 3,60 1,53 005 Salada Croca 1 x   | DATA: 31/03/2016 13:39:32 CUPOM FISCAL ELETRÔNICO
9,50 6,55 29.50 Total Bruto de Itens: R$ 134,40    | - SAT -. . ! : * * ... 2. ... -- - - . -. - an u
Acrecimos sobre Subtotal R$ 13,44 TOTAL: R$ 147,84 | i-u. a.. .... .... ...–..- ..

## Document [5856784](https://jarbas.datasciencebr.com/#/document_id/5856784) 

Here we have both the card receipt and the invoice but this time the API can get some timestamps and a bit more of the receipt items

In [14]:
display_comparisson(5856784)

RESTAURANTE RECANTO DO DJALMA LTDA. RECANTO DO     | | | LU RESTAURANTE RECANTO DO DJALMA LTDA. RECANTO
DJALMA ROI UNORTE A INDIANDPOLIS S/N ZONA RURAL    | DO DJALMA. ROD CIANORTE A INDIANOPOLIS S/N ZONA
ANORTEI p:872 Tel: IE: 904.3 836-42 TNPJ: 08.510   | RURAL CIANORTE/PR Cep:87200-370 Tel: (CNPJ: 09 510
550/0001-25 MANFE NFC-e Documento Auxiliar No Nati | 550/0001-25IE: 904, 37836-42 TIANFE NFC-e -
UErmite aproveitamento de crédito de ICMS UN X 18  | Documento Auxiliar cu Nola Fiscal Eletrônica para
00 18 UU UN X S.50 3.50 411 REFR SHVEPPES TONICA   | consumidor Final Nato permite aproveitamento de
21 50 VALOR TOTAL R$ Valor Pago ORHA DE PAGAMENTO  | crédito de ICMS #;CODIGDESCRIÇÃO:OTO: UN! M | IN
infor dos Tributos Totais Incidentes Lei Federal   | RM TOTALT UOL TREFEILAD 1 UN I 16 00 18 ž 411
12,741/2012) Nuiero 000177 Serie 001 Enissao       | REFR. LATA - SHWEPPES TONICA 1 UN X 350 350 EXIJA
12/11/2015 14:01:56 Via Consumidor Consulte pela   | O DOCUMENTO FIS COMPROVANTE. NI: 

## Document [5855221](https://jarbas.datasciencebr.com/#/document_id/5855221) 

Here we have both the card receipt and the invoice, the quality of the PDF / images sucks and the API can't do much magic, at least with the changes introduced by `serenata-ocr` we can parse some timestamps

In [15]:
display_comparisson(5855221)

CUM ICA LHUS SP NP :32.905. 11 (110-77 C00: 323016 | GRS/A RODOVIA PELO SHTAÍof S/N ASA-D CUMICA -
85 TE FISCA COHER.IVPNTE CRE OU DEBIT) 3230 5 GR   | GUARULHİS - SP R VIVEN5 : NONE; TREST--COO: 322016
SA, A /NASA-n 2.30 116-77 50, 92 CUO 323015 ion 40 | NO E DOCUMENTO FISCAL COMPROVANTE CRÉDITO OU
CIJE UM FI SCAL. 0101975 600 ITAL. Cartao Credit   | DEBITO CartaÜredito J0 do docento vinculado:
03.20 it 3prix 2723a6 RI: 3,3 Federal E! 0,00      | :Valcr da CORDe FM |alor do pagamento R$$ IDO E
Estadual GEN-A 00 THAYNA AUGUST S12 4 St'FG        | 1aWLA 328015 51,00 51,10 CHR S/A RODOVIA AO
4:39:17V 0912101( 0877                             | SOHIMIDT, S/N ASA-D CUMICA - GUARULHOS - SP P.:
                                                   | 92,905. 110/0! 1677 /117ED15 1: SAN FEINO8)
                                                   | *CO0:328015 NP/CPF consuirica: 03098871946 OCUPOM
                                                   | FISCAL TEH DIGO DESCRIÇÃI ITD. U L UNIT (US$) SI


## Document [5921187](https://jarbas.datasciencebr.com/#/document_id/5921187) 

OCR makes a lot more sense here, deskewing the image seems to help quite a lot.

In [16]:
display_comparisson(5921187)

JK RESEN ROD DE DE COM IE: BR050 POSTOJK DE        | * :*' : * ITEM cao 6o Destancia ao u Vi UNICE) SI
PETROLE KM 013. CAT sNTREvo IM: 10.090. LOTEAMENTo | VL TEN, SO - I. JK RESENDE COM. DE DERVADOS DE
972-8 80 JK OLIDA CNPJ/CPF consumidor: 2 400 CUPOM | PETROLEOLIDA ROD BR 050 KM 289 SIN TREVO
3 183 POLPA FISCAL 4 195 CAPUCCINO NESTLE 1UN      | LOTEAMENTO JK .**** POSTOJK ***
50ONL. 13 11 5 111 PALITO 1UN 006 7895 007:7895    | CNPJ:20.013.876/0001-80 CATALÃO-GOIAS *. IE:10,
144603216 MENTOS STICK DUO BLACK ICE TOTAL         | 090.972-8 IM: 18005001 18/02/2016 18:18: TOVCCF:30
144293844 MENTOS STICK R$ T1 01107,00x 03T MACA    | CNPJ/CPF consumidor: NOME': AO CONSUMIDOR CUPOM
-32 T3 VERDE -3843 oc 03T17,002 ro. Cat 0418021618 | FISCAL CD0:435483 110 EWPADA DE FRANCO TORTT* 2
vo 1815 Ap 27,00 onte BEMATECH FAB MP-4000 TH FI   | 400 SUCO POLPA CON AGUA 500ML 1UN 13 3 183 PAD DE
ECF-IF BE031110 12/02/2016 18: 18,56V              | QUEIJO PALITO UN 11. 4 195 CAPUCCINO NESTLE UN T1
     

## Document [6069360](https://jarbas.datasciencebr.com/#/document_id/6069360) 

All messed up, even with `serenata-ocr`, still _very_ useful.

In [17]:
display_comparisson(6069360)

ENTERNAI IONAL MI Al tiMPANY AL INESfaCAU S,h,     | EM COOTEN INTERNATIONAL NAM. XIMPANY ALIMENTAÇAN
AEROPORED INTERNACIONAL AERSPORFB - SAO PAULO -SP  | 3,4. AEROPORTO INTERNACIONAL DE CONSONANSIJA 3G/37
CEP: 84626-811 CHPJ: 17,314, 329/8005-53 醷7Dé iii  | CNPJ: 17,314./-53 AEROPSININ - SAN PAULO -SP I 18,
alt彰 箱 2f.-CCV溺64g5......... .... cioi asa555      | ANN, 171.18 CEP: 04626-SIL NË/WWW70IG 451 CCF:
304,f3 CMPJ/CPF consuainsr: 095,023,023-81 CUP0M   | ISLAS CNPJ/CPF consvirlar NZ3.078-O CUPOM FISCAL
FISCAL ITEM CdoISS DESCRICAG QiB。W.YLUN3TORS) ST   | WO, W, VNIT(RS) ST | 1031 Nefri Coca Zero - Lata i
VE ITEM(8$) ritata i fl 2 1183 Sucs de Isaater Fi  | F1 DESCRICA CO: ASR555 2 III Sco de Tonate TFI I
16.50 3 8885 Csuver tras,Palean22 tos/ 011 13.90 4 | BONS COVvert Pau Fate VM tona TNT 4 358 Suffet
3058 Buffet Csapleto 1 011 84,88 5424SOf.Exs       | Cugpleta 1 WIT PL VENIRS) 5 4249 CLERP SPEED -
Saprese . Lapsaid I SBBIOFAL RS ...... ......190   | Lapsilia iDIT SUBTOTAL - ACRÉ

## Document [5631380](https://jarbas.datasciencebr.com/#/document_id/5631380)

There are 3 timestamps on the receipt and `serenata-ocr` almost got them all right (compared to only 2 in the past). The receipt items info (description and price) were better extracted in the first OCR work

In [18]:
display_comparisson(5631380)

PINENIA VERDE Al INEKIOS LTDA tern. Passageiros    | ........... Nyen pcp A GİTGTY STIGTIG PIHENTA
atroporto inter tacional,S/H Setor Sala Eabarque   | SERIE ALIMENTOS LTDA ſern. Passageiros 10 iler
8. Aeroporto Conflns Confins Hinas Gerais BliPJ:   | Oporto interacional , SIN Setor: A Sala de
08,060,954/0039-12 8170 312015 CUPOM FISCAL VL     | Enbarque - 8, Aeroporto Confins Confins - Minas
ITEM (R$) EN COOL 4,505 OTD 4042 Cafe Expresso i   | Gerais CHP.J: 08, 169,964/Q139-12 I! OIL. 098503,
01r 4,758 2 2036 Pao de Queijo l 12 TOTAL R$ 10,00 | OQ49 T/03/2015 09:5348 CCF:413170 COS:651399 CUPOM
0,75 DINHEIRO AFM 4.1,8,4 UHPOSTOS A PROX, LEI     | FISCAL ITEN COOIBO DESCRICKO QTD. . VL UNIT(US) VL
12,24 R$ 2,05 17/D3/2015 Garcom Viviane 9:51 AM    | (REK(RS) | 4042 Cafe Expresso I OIL 2 12038 Pao de
0010 VR 100/2 ontes: 3 SMEDA If ST120 -IF VERSAI:  | Quai įo 1 021 TOTAL R$ DINHEIRO 10,00 TROCO R$
01,00.05 ECF: 006 Z(CC((XK 17/03/2015 09:53:50     | OITIA, ON 02108,401
FAE: SN)41000000

## Conclusion and next steps

- In general, it seems like the changes introduced by `serenata-ocr` seem to have paid off.
- Can't really tell if the results improvements were more influenced due to deskewing images or the more expensive `DOCUMENT_TEXT_DETECTION` functionality from Google Cloud Vision.
- Given this is never going to be a perfect process, we should consider leveraging multiple versions of the text of a receipt, doing things like tweaking pre processing and using different OCR providers.
- I'll do a bit more experiments so we can have the best results before moving on with the next batch of OCR for recent data.