In [1]:
import re

from IPython.display import HTML
import pandas as pd
import numpy as np

from serenata_toolbox.datasets import fetch

fetch("2017-02-15-receipts-texts.xz", "../data")
fetch("2016-12-06-reimbursements.xz", "../data")

def report(df):
    df = df.copy()
    df['receipt'] = df.apply(link_to_receipt, axis=1)
    df['document_id'] = df.apply(link_to_jarbas, axis=1)
    cols = ['document_id', 'receipt', 'issue_date', 'congressperson_name', 'total_net_value', 'supplier']
    return HTML(df[cols].to_html(escape=False))

def link_to_jarbas(r):
    return '<a target="_blank" href="http://jarbas.datasciencebr.com/#/document_id/{0}">{0}</a>'.format(r.document_id)

DOCUMENT_URL = (
    'http://www.camara.gov.br/'
    'cota-parlamentar/documentos/publ/{}/{}/{}.pdf'
)
def link_to_receipt(r):
    url = DOCUMENT_URL.format(r.applicant_id, r.year, r.document_id)
    return '<a target="_blank" href="{0}">RECEIPT</a>'.format(url)

pd.set_option('display.max_colwidth', 1500)

In [2]:
texts = pd.read_csv('../data/2017-02-15-receipts-texts.xz', dtype={'text': np.str}, low_memory=False)
texts['text'] = texts.text.str.upper()

reimbursements = pd.read_csv('../data/2016-12-06-reimbursements.xz', low_memory=False)
reimbursements = reimbursements.query('(subquota_description == "Congressperson meal") & (year >= 2015)')

data = texts.merge(reimbursements, on='document_id')

In [3]:
print("Total meal reimbursements from 2015:", len(reimbursements))
print("Total meal reimbursements from 2015 that were OCRed:", len(data))
print("Total meal reimbursements from 2015 that were OCRed and have text:", len(data[~data.text.isnull()]))
data = data[~data.text.isnull()]

Total meal reimbursements from 2015: 57460
Total meal reimbursements from 2015 that were OCRed: 56715
Total meal reimbursements from 2015 that were OCRed and have text: 56710


Some of those reimbursements will have the remark value set to "discard" the value related to the alcoholic beverage so in order to make this initial analysis easier, we focus on those that does not have that value set

In [4]:
data = data[data.remark_value == 0]
len(data)

54274

## Reimbursements that have brazillian beer names in it

In [5]:
report(data[data.text.str.contains('SKOL')])

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
1762,5606117,RECEIPT,2015-02-13T00:00:00,DANILO FORTE,122.0,G M FREITAS MICROEMPRESA
2012,5608809,RECEIPT,2015-02-14T00:00:00,CAPITÃO AUGUSTO,67.19,Eizo Iwano - ME
14542,5709510,RECEIPT,2015-06-13T00:00:00,JOÃO MARCELO SOUZA,138.5,CAFÉ SAVANA LANCHES LTDA
23706,5796726,RECEIPT,2015-09-17T00:00:00,ADEMIR CAMILO,56.6,CASA BOA PIZZARIA LTDA-ME
26951,5825885,RECEIPT,2015-08-24T00:00:00,JOSÉ MENTOR,23.0,vrg linhas aéreas s/a
28635,5841347,RECEIPT,2015-11-08T00:00:00,JOÃO MARCELO SOUZA,139.7,RESTAURANTE CABANA VIP
31215,5866804,RECEIPT,2015-12-06T00:00:00,ADEMIR CAMILO,65.58,LO PAMPAS RESTAURANTE E CHURRASCARIA LTDA
31401,5868418,RECEIPT,2015-12-06T00:00:00,ADEMIR CAMILO,14.0,F&L EMPREENDIMENTOS COMERCIAS LTDA
44364,6016538,RECEIPT,2016-06-04T00:00:00,MAIA FILHO,32.74,J E ALIMENTAÇÕES LTDA EPP


Only one of those reimbursements had an issue, looks like the others already disregarded the beer amounts even though the remark value is zeroed

In [6]:
report(data[data.text.str.contains('BOHEMIA')])

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
2012,5608809,RECEIPT,2015-02-14T00:00:00,CAPITÃO AUGUSTO,67.19,Eizo Iwano - ME
28133,5835958,RECEIPT,2015-10-31T00:00:00,CAPITÃO AUGUSTO,90.4,LA PARRILA


Both had the beers "deducted" even though there's no remark value

In [7]:
report(data[data.text.str.contains('BRAHMA')])

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
20696,5768462,RECEIPT,2015-08-20T00:00:00,FRANKLIN LIMA,64.0,BAR E RESTAURANTE MONUMENTAL LTDA-EPP
22804,5787990,RECEIPT,2015-08-02T00:00:00,JOSÉ MENTOR,83.35,ESFIHA IMIGRANTES
38429,5958726,RECEIPT,2016-04-01T00:00:00,WILLIAM WOO,100.0,MADERO ITAIM
46793,6041827,RECEIPT,2016-06-05T00:00:00,ARLINDO CHINAGLIA,84.59,RESTAURANTE AEROPORTO
54865,6131562,RECEIPT,2016-10-24T00:00:00,JOÃO RODRIGUES,40.7,QCB LANCHONETE LTDA EPP


2 of those had beers in them, one had the beer name on the restaurant name and the others disregarded the beers

## Reimbursements that have foreign beer names in them

In [8]:
report(data[data.text.str.contains('HEINEKEN')])

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
2493,5612864,RECEIPT,2015-02-26T00:00:00,MAJOR OLIMPIO,43.5,BARU RESTAURANTE LTDA.
6381,5643971,RECEIPT,2015-02-28T00:00:00,LAUDIVIO CARVALHO,22.02,COZINHA DA ROÇA RESTAURANTE LTDA.
12541,5692453,RECEIPT,2015-05-21T00:00:00,NILTO TATTO,39.0,Koni Store
19734,5760438,RECEIPT,2015-08-09T00:00:00,JORGE SOLLA,136.9,ALMANARA RESTAURANTES E LANCHONETES LTDA
20769,5768908,RECEIPT,2015-08-17T00:00:00,MARIA HELENA,319.67,MET BACK BAY
21355,5774235,RECEIPT,2015-08-24T00:00:00,ADEMIR CAMILO,13.15,MOMO CONFEITARIA LTDA
23771,5797432,RECEIPT,2015-09-19T00:00:00,MAJOR OLIMPIO,96.3,VILLA CAETANO´S BAR LTDA
25227,5810343,RECEIPT,2015-10-04T00:00:00,MAJOR OLIMPIO,119.9,GRILL 688 RESTAURANTE LTDA - EPP
25326,5811427,RECEIPT,2015-10-01T00:00:00,JOSE STÉDILE,29.9,GIRAFFAS
25625,5814394,RECEIPT,2015-10-01T00:00:00,GIOVANI CHERINI,14.0,CPQ BRASIL


Out of nearly 40 reimbursements, 3 had beers in them and one of the reimbursements was already reported

In [9]:
report(data[data.text.str.contains('BUDWEISER')])

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
20771,5768916,RECEIPT,2015-08-16T00:00:00,MARIA HELENA,204.15,LEGAL SEA FOODS
33216,5900913,RECEIPT,2016-01-27T00:00:00,ADEMIR CAMILO,16.9,DUARTE E AMARINS LTDE -ME
35140,5924288,RECEIPT,2016-02-27T00:00:00,ROBERTO FREIRE,69.6,FRITZ COMÉRCIO DE ALIMENTOS E RESTAURANTE LTDA
43445,6007201,RECEIPT,2016-05-29T00:00:00,MARIA HELENA,115.7,CHICAGO PRIME ALIMENTOS EIRELI EPP


Only one with the beer without remark value that was already reported

In [10]:
report(data[data.text.str.contains('STELLA ARTOIS')])

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
4259,5628985,RECEIPT,2015-02-06T00:00:00,HIRAN GONÇALVES,70.0,Porto Alegre Restaurante - EIRELI EPP
6375,5643927,RECEIPT,2015-03-21T00:00:00,VANDERLEI MACRIS,114.18,DUSHI RESTAURANTE LTDA. EPP
20765,5768885,RECEIPT,2015-08-08T00:00:00,MARIA HELENA,333.18,SARDI'S RESTAURANT
29911,5853749,RECEIPT,2015-11-06T00:00:00,ADEMIR CAMILO,10.07,MOMO CONFEITARIA LTDA
29913,5853770,RECEIPT,2015-11-09T00:00:00,ADEMIR CAMILO,15.7,MOMO CONFEITARIA LTDA
31347,5867982,RECEIPT,2015-12-06T00:00:00,JORGE SOLLA,56.31,PONTO CERTO COM. DE CEREAIS LTDA
35716,5930546,RECEIPT,2016-03-02T00:00:00,AFONSO MOTTA,34.0,VICTROLA COMERCIO DE ALIMENTOS E BEBIDAS LTDA
40937,5982801,RECEIPT,2016-04-19T00:00:00,ADEMIR CAMILO,18.0,Cafeteria Ana Banana Ltda - ME
42458,5997877,RECEIPT,2016-05-11T00:00:00,MAJOR OLIMPIO,64.9,LA Hotels Empreendimentos Ltda.
49309,6069918,RECEIPT,2016-08-05T00:00:00,SEVERINO NINHO,53.99,MARIA DE SALGADO EVENTOS GASTRONÔMICOS LTDA


None of them had issues

## Improving the algorithm

As we can see, lots of the reimbursements above did not have issues, it seems that (most of the times) whoever submitted the receipts already did the maths and removed the beer $$ from the reimbursement request. One idea of improving the accuracy of finding better cases would be to attempt to find the document total net value within the text of the receipt itself, meaning there is an alcoholic beverage within the receipt and everything bought was considered when the congressperson got reimbursed. Please note that this is a bad strategy for expenses made abroad because of the rate conversion.

In order to make things simpler, lets focus on reimbursements that are less than R$ 1.000,00. This is going to make the regular expression I'll use for matching easier on our eyes.

In [11]:
data = data.query('total_net_value < 1000')

In [12]:
def format_regex(val):
    hundreds = int(val)
    decimal = int((val * 100) % 100)
    if decimal == 0:
        decimal = '00'
    return '|'.join([
        '{},\s*{}'.format(hundreds, decimal),
        '{}\.\s*{}'.format(hundreds, decimal)
    ])

def receipt_matches_net_value(r):
    return any(re.findall(format_regex(r.total_net_value), r.text))

data = data[data.apply(receipt_matches_net_value, axis=1)]
len(data)

23085

Almost half matched, now we get the list of alcoholic beverages [put together by Irio Musskopf](https://github.com/datasciencebr/serenata-de-amor/blob/fb93f96e334c46f98eea3d4a9db565b8bc6bb45b/develop/2016-12-16-irio-alcohol-expenses.ipynb) and we search for the suspicious ones

In [13]:
keywords = [
    'beer',
    'brandy',
    'cachaca',
    'cachaça',
    'cerveja',
    'champagne',
    'chope',
    'chopp',
    'conhaque',
    'gim',
    'gin',
    'liqueur',
    'pint',
    'rum',
    'tequila',
    'vinho',
    'vodka',
    'whiskey',
    'whisky',
    'wine',
    'Albarino',
    'Barbera',
    'Bonarda',
    'Cabernet Franc',
    'Cabernet Sauvignon',
    'Chardonnay',
    'Chenin Blanc',
    'Garnacha',
    'Gewurztraminer',
    'Grenache',
    'Malbec',
    'Merlot',
    'Moscato',
    'Nebbiolo',
    'Palomino',
    'Pinot Grigio',
    'Pinot Noir',
    'Pinotage',
    'Riesling',
    'Sangiovese',
    'Sauvignon Blanc',
    'Shiraz',
    'Sylvaner',
    'Syrah',
    'Tempranillo',
    'Viognier',
    'Zinfandel',
    'Aquitania Sol',
    'Beringer',
    'Blossom hill',
    'Casa Marín',
    'Casa Postal',
    'Casas Del Bosque',
    'Concha Y Toro',
    'Coronas',
    'Gallo',
    'Hardys',
    'House Malmau',
    'Jacobs Creek',
    'Lindemans',
    'Mil Piedras',
    'Ochotierras',
    'Paula Laureano',
    'Sutter Home',
    'Trumpeter',
    'Vila Regia',
    'Vinzelo',
    'Weinert',
    'Yellow tail',
    'Antarctica',
    'Antartica',
    'Becks',
    'Bohemia',
    'Brahma',
    'Bucanero',
    'Bud Light',
    'Budweiser',
    'Caracu',
    'Coors Light',
    'Corona',
    'Devassa',
    'Franziskaner',
    'Guiness',
    'Harbin',
    'Heineken',
    'Hertog Jan',
    'Hoegaarden',
    'Itaipava',
    'Kaiser',
    'Leffe',
    'Lowenbrau',
    'Miller Light',
    'Nortena',
    'Nortenã',
    'Nova Schin',
    'Polar',
    'Quilmes',
    'Serramalte',
    'Skol',
    'Stella Artois',
    'Yanjing',
    'Absolut',
    'Balalaika',
    'Blue Spirit Unique',
    'Hangar One',
    'Imperia',
    'Jean Mark XO',
    'Kadov',
    'Komaroff',
    'Kovak',
    'Leonoff',
    'Moscowita',
    'Natasha',
    'Orloff',
    'Roth California',
    'Skyy90',
    'Smmirnoff',
    'Ultimat',
    'Xellent',
    'Zvonka Dubar',
    'Ardbeg',
    'Bagpiper',
    'Ballantine’s',
    'Ballantines',
    'Bushmills',
    'Campari',
    'Chivas',
    'Forty Creek',
    'Glenlivet',
    'Glenmorangie',
    'Imperial Blue',
    'Jack Daniel’s',
    'Jack Daniels',
    'Jameson',
    'Johnnie Walker',
    'McDowell',
    'McDowell’s',
    'Old Tavern',
    'Royal Salute',
    'Royal Stag',
    'Wild Turkey',
    'Casa Noble Reposado',
    'Casamigos',
    'Don Julio Blanco',
    'Patron Silver',
]
keywords_up = map(str.upper, keywords)
keywords_regex = '|'.join(keywords_up)

suspicious = data[data.text.str.contains(keywords_regex)]

In [14]:
print(len(suspicious))
report(suspicious.head(10))

2682


Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
18,5579973,RECEIPT,2015-01-04T00:00:00,GERALDO THADEU,46.82,UNIQUE ASA SUL
31,5580375,RECEIPT,2015-01-04T00:00:00,MARCON,26.5,CHURRASCARIA DO GRINGO
42,5580660,RECEIPT,2015-01-03T00:00:00,FRANCISCO TENÓRIO,111.0,ORLANDO DE ZORZI ME
79,5582480,RECEIPT,2015-01-08T00:00:00,JORGE BITTAR,48.89,PADOVANO BUFFET E EVENTOS LTDA
86,5582754,RECEIPT,2015-01-05T00:00:00,EMILIANO JOSÉ,115.5,PASTA EM CASA COMERCIO DE ALIMENTOS LTDA-ME
98,5583506,RECEIPT,2014-10-16T00:00:00,EDMAR ARRUDA,20.85,PANELINHAS DO BRASIL
124,5584139,RECEIPT,2015-01-06T00:00:00,JOSÉ AIRTON CIRILO,87.5,MM RESTAUTANTE
139,5584658,RECEIPT,2015-01-13T00:00:00,DENILSON TEIXEIRA,21.71,Krabi Express Restaurante Ltda
174,5585407,RECEIPT,2015-01-06T00:00:00,LUIS CARLOS HEINZE,20.5,ARMAZÉM DO SABOR
180,5585624,RECEIPT,2015-01-01T00:00:00,DR. UBIALI,12.7,RODOSNACK TURMALINA LANCH. E REST. LTDA.


None of the reimbursements above have alcoholic beverages in them, lets find out the matches

In [15]:
suspicious.text.apply(lambda x: re.findall(keywords_regex, x)).head(10)

18          [RUM]
31          [RUM]
42          [GIN]
79          [RUM]
86          [GIN]
98          [RUM]
124    [RUM, RUM]
139         [RUM]
174    [GIN, GIN]
180         [RUM]
Name: text, dtype: object

Looks like `RUM` and `GIN` are not good words for this regex, lets wrap the keywords with whitespace separators

In [16]:
def wrap(s):
    return '\s{}\s'.format(s.upper())
k = map(wrap, keywords)

keywords_regex = '|'.join(k)
keywords_regex

suspicious = data[data.text.str.contains(keywords_regex)]
print(len(suspicious))

142


In [17]:
report(suspicious.head(10))

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
174,5585407,RECEIPT,2015-01-06T00:00:00,LUIS CARLOS HEINZE,20.5,ARMAZÉM DO SABOR
331,5588441,RECEIPT,2015-01-16T00:00:00,NILDA GONDIM,55.72,DAIKON RESTAURANTE LTDA
590,5592790,RECEIPT,2015-01-28T00:00:00,GILBERTO COUTINHO,129.3,SENAC
618,5593102,RECEIPT,2015-01-28T00:00:00,WELLINGTON SALGADO,86.2,SENAC
2736,5615980,RECEIPT,2015-02-27T00:00:00,MARCON,16.5,M GRAF & CIA LTDA
2902,5617414,RECEIPT,2015-02-19T00:00:00,POMPEO DE MATTOS,24.0,NEURI ANGELO ZAMBIASI
3827,5625288,RECEIPT,2015-02-25T00:00:00,JAIR BOLSONARO,31.69,ASSOCIAÇÃO DOS SERVIDORES DA CÂMARA DOS DEPUTADOS
4077,5627118,RECEIPT,2015-03-12T00:00:00,EDINHO BEZ,117.0,Armazem do Ferreira Bar e Restaurante Ltda
4094,5627412,RECEIPT,2015-03-12T00:00:00,RUBENS PEREIRA JÚNIOR,109.0,RPS BAR E RESTAURANTE LTDA
4930,5633336,RECEIPT,2015-03-14T00:00:00,ALUISIO MENDES,68.2,LUZEIROS HOTÉIS S/A


None of the reimbursements above have alcoholic beverages in them, lets find out the matches

In [18]:
suspicious.text.apply(lambda x: re.findall(keywords_regex, x)).head(10)

174                         [ GIN\n]
331     [ ANTARCTICA ,  ANTARCTICA ]
590                        [ VINHO ]
618                        [ VINHO ]
2736                       [\nGIN\n]
2902                        [ GIN\n]
3827                   [ ANTARTICA ]
4077                       [ CHOPP ]
4094                   [ ANTARTICA ]
4930                   [ ANTARTICA ]
Name: text, dtype: object

- `GIN` really sucks because it can match with bad OCRed strings
- `ANTARCTICA` can match with sodas / guaranás
- `VINHO` is bad because it can match with dishes like "Filé ao molho de vinho"
- The one that has `CHOPP` does not have a problem because the draft beer was not refunded, [the R\$117,00 net value of the reimbursement matched with the price of the dish that the congressperson ordered](http://www.camara.gov.br/cota-parlamentar/documentos/publ/1005/2015/5627118.pdf)

Lets look at the bottom of the dataset

In [19]:
report(suspicious.tail(10))

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
52523,6106229,RECEIPT,2016-09-21T00:00:00,FLAVINHO,14.0,ANSERVE COMÉRCIO DE BEBIDAS E ALIMENTOS LTDA
53001,6112296,RECEIPT,2016-09-27T00:00:00,ZECA DIRCEU,26.4,Soppen Beer CWB
54177,6124561,RECEIPT,2016-10-17T00:00:00,CELSO RUSSOMANNO,24.2,SERVIÇO NACIONAL DE APRENDIZAGEM COMERCIAL - SENAC
54182,6124674,RECEIPT,2016-10-12T00:00:00,RUBENS BUENO,157.3,LA BODEGUITA RESTAURANTE
55021,6134007,RECEIPT,2016-10-21T00:00:00,ALUISIO MENDES,52.8,LUZEIROS HOTÉIS S/A
55067,6134798,RECEIPT,2016-10-28T00:00:00,VICENTE CANDIDO,141.35,CANTINA BELLA DONNA
55413,6139471,RECEIPT,2016-10-27T00:00:00,AFONSO HAMM,19.75,JP SANTA LUCIA COMERCIO DE COMBUSTIVEIS LTDA
55880,6145079,RECEIPT,2016-10-26T00:00:00,BOHN GASS,41.15,AG SUCOS E PETISCOS LTDA
56014,6146588,RECEIPT,2016-11-16T00:00:00,OSMAR BERTOLDI,46.1,MARIETTA COMERCIO DE ALIMENTOS LTDA
56707,6157349,RECEIPT,2016-11-24T00:00:00,RENZO BRAZ,60.28,RESTAURANTE PARAISO DA SERRA SOBERBO EIRELI


No bad reimbursements, lets look at matches one more time

In [20]:
suspicious.text.apply(lambda x: re.findall(keywords_regex, x)).tail(10)

52523          [ GIN\n]
53001          [ BEER ]
54177    [ ANTARCTICA ]
54182        [\nCHOPP ]
55021     [ ANTARTICA ]
55067     [ ANTARTICA ]
55413     [ ANTARTICA ]
55880          [ GIN\n]
56014     [ ANTARTICA ]
56707     [ ANTARTICA ]
Name: text, dtype: object

## Can we get to 100% accuracy?

Probably not, but lets try using some less ambiguous drinks

In [21]:
keywords = [
    'cachaca',
    'cachaça',
    'cerveja',
    'champagne',
    'chope',
    'chopp',
    'conhaque',
    'liqueur',
    'tequila',
    'vodka',
    'whiskey',
    'whisky',
    'wine',
    'Albarino',
    'Barbera',
    'Bonarda',
    'Cabernet Franc',
    'Cabernet Sauvignon',
    'Chardonnay',
    'Chenin Blanc',
    'Garnacha',
    'Gewurztraminer',
    'Grenache',
    'Malbec',
    'Merlot',
    'Moscato',
    'Nebbiolo',
    'Palomino',
    'Pinot Grigio',
    'Pinot Noir',
    'Pinotage',
    'Riesling',
    'Sangiovese',
    'Sauvignon Blanc',
    'Shiraz',
    'Sylvaner',
    'Syrah',
    'Tempranillo',
    'Viognier',
    'Zinfandel',
    'Aquitania Sol',
    'Beringer',
    'Blossom hill',
    'Casa Marín',
    'Casa Postal',
    'Casas Del Bosque',
    'Concha Y Toro',
    'Coronas',
    'Gallo',
    'Hardys',
    'House Malmau',
    'Jacobs Creek',
    'Lindemans',
    'Mil Piedras',
    'Ochotierras',
    'Paula Laureano',
    'Sutter Home',
    'Trumpeter',
    'Vila Regia',
    'Vinzelo',
    'Weinert',
    'Yellow tail',
    'Becks',
    'Bohemia',
    'Brahma',
    'Bucanero',
    'Bud Light',
    'Budweiser',
    'Caracu',
    'Coors Light',
    'Corona',
    'Devassa',
    'Franziskaner',
    'Guiness',
    'Harbin',
    'Heineken',
    'Hertog Jan',
    'Hoegaarden',
    'Itaipava',
    'Kaiser',
    'Leffe',
    'Lowenbrau',
    'Miller Light',
    'Nortena',
    'Nortenã',
    'Nova Schin',
    'Polar',
    'Quilmes',
    'Serramalte',
    'Skol',
    'Stella Artois',
    'Yanjing',
    'Absolut',
    'Balalaika',
    'Blue Spirit Unique',
    'Hangar One',
    'Imperia',
    'Jean Mark XO',
    'Kadov',
    'Komaroff',
    'Kovak',
    'Leonoff',
    'Moscowita',
    'Natasha',
    'Orloff',
    'Roth California',
    'Skyy90',
    'Smmirnoff',
    'Ultimat',
    'Xellent',
    'Zvonka Dubar',
    'Ardbeg',
    'Bagpiper',
    'Ballantine’s',
    'Ballantines',
    'Bushmills',
    'Campari',
    'Chivas',
    'Forty Creek',
    'Glenlivet',
    'Glenmorangie',
    'Imperial Blue',
    'Jack Daniel’s',
    'Jack Daniels',
    'Jameson',
    'Johnnie Walker',
    'McDowell',
    'McDowell’s',
    'Old Tavern',
    'Royal Salute',
    'Royal Stag',
    'Wild Turkey',
    'Casa Noble Reposado',
    'Casamigos',
    'Don Julio Blanco',
    'Patron Silver',
]
def wrap(s):
    return '\s{}\s'.format(s.upper())
k = map(wrap, keywords)
keywords_regex = '|'.join(k)
keywords_regex

suspicious = data[data.text.str.contains(keywords_regex)]
print(len(suspicious))

34


In [22]:
report(suspicious)

Unnamed: 0,document_id,receipt,issue_date,congressperson_name,total_net_value,supplier
4077,5627118,RECEIPT,2015-03-12T00:00:00,EDINHO BEZ,117.0,Armazem do Ferreira Bar e Restaurante Ltda
5901,5640980,RECEIPT,2015-03-13T00:00:00,JÉSSICA SALES,50.8,PATRONI PIZZA
6375,5643927,RECEIPT,2015-03-21T00:00:00,VANDERLEI MACRIS,114.18,DUSHI RESTAURANTE LTDA. EPP
10666,5678536,RECEIPT,2015-04-09T00:00:00,JAIR BOLSONARO,27.43,ASSOCIAÇÃO DOS SERVIDORES DA CÂMARA DOS DEPUTADOS
12541,5692453,RECEIPT,2015-05-21T00:00:00,NILTO TATTO,39.0,Koni Store
12924,5694874,RECEIPT,2015-05-10T00:00:00,DELEGADO EDSON MOREIRA,40.5,PIZZARELLA LTDA
14127,5705051,RECEIPT,2015-06-03T00:00:00,DANIEL COELHO,36.4,JLM RESTAURANTE
20780,5768938,RECEIPT,2015-08-11T00:00:00,PAES LANDIM,83.0,LUXOR PIAUÍ HOTEL
22804,5787990,RECEIPT,2015-08-02T00:00:00,JOSÉ MENTOR,83.35,ESFIHA IMIGRANTES
24519,5803734,RECEIPT,2015-09-22T00:00:00,EVANDRO ROMAN,404.19,Restaurant Las Canarias


In [23]:
suspicious.text.apply(lambda x: re.findall(keywords_regex, x))

4077                       [ CHOPP ]
5901                       [ CHOPP ]
6375               [ STELLA ARTOIS ]
10666                      [ POLAR ]
12541                    [ CERVEJA ]
12924                     [\nCHOPP ]
14127                     [ MERLOT ]
20780    [ CHAMPAGNE ,  CHAMPAGNE\n]
22804            [ BRAHMA ,  CHOPP ]
24519                  [ BUD LIGHT ]
25179                      [ CHOPP ]
25527                    [ CERVEJA ]
25577                    [\nCHOPP\n]
25625                   [ HEINEKEN ]
25948                      [ CHOPP ]
26951                    [ CERVEJA ]
27242       [ HEINEKEN ,  HEINEKEN ]
27428                   [\nDEVASSA ]
31215                    [ CERVEJA ]
35716             [\nSTELLA ARTOIS ]
36354            [ CHOPP , \nCHOPP ]
39996                       [ WINE ]
40915                   [\nCAMPARI ]
42844                       [ WINE ]
43077                       [ WINE ]
44288                      [ GALLO ]
46205                       [ WINE ]
4

Besides the same observations made on previous data, also found:
- A few cases where beer / wine got canceled
  - http://www.camara.gov.br/cota-parlamentar/documentos/publ/1947/2015/5643927.pdf
  - http://www.camara.gov.br/cota-parlamentar/documentos/publ/2920/2015/5694874.pdf
  - http://www.camara.gov.br/cota-parlamentar/documentos/publ/3071/2015/5705051.pdf
- An [expense made abroad](http://www.camara.gov.br/cota-parlamentar/documentos/publ/2977/2015/5803734.pdf) that had many different things in it, not sure if the person did get reimbursed for a bud light or not
- `POLAR` was a match with bad OCRed text as well
- `CHAMPAGNE` can be used in sauces as well
- `DEVASSA` can be the name of the restaurant like [this one](http://www.camara.gov.br/cota-parlamentar/documentos/publ/2267/2015/5830014.pdf)
- `WINE` can be in the name of the restaurant like [this one](http://www.camara.gov.br/cota-parlamentar/documentos/publ/3037/2016/5973107.pdf)
- Handwritten receipts might have keywords in them on random "marketing stuff" that gets written like `CERVEJA` in "A melhor casa da cerveja" in [this receipt](http://www.camara.gov.br/cota-parlamentar/documentos/publ/2990/2015/5813355.pdf)
- "Comandas" might list all types of beverages they serve and could return lots of false positives like [this one](http://www.camara.gov.br/cota-parlamentar/documentos/publ/2295/2016/6067950.pdf)
- 6 reimbursements had beers included but 4 of them were already identified and reported based on the previous results. The 2 new ones were also reported
  
## Conclusions and thoughts

- Lots of reimbursements already exclude alcoholic beverages from the amount of returned even though there is no remark value
- There are lots of false positives around, can we do something to get them out of the way?
- Is this enough to incorporate this into rosie? How can we calculate scores for those reimbursements? Anything we can do to spot false positives and reduce the noise?
- Are there better strategies for identifying these type of things besides OCR? Maybe some "advanced computer vision" tool / algorithm can yield better results