# Meals for more than one person

Rosie already has a way to detect meal price outliers based on other reimbursements made on the same restaurant. Now that we have a new dataset with receipts OCRed texts we can use the information on how many people were present on the table to increase its probability of being meals paid for other people.

As an example, [this receipt](http://www.camara.gov.br/cota-parlamentar/documentos/publ/3143/2016/6050635.pdf) mentions that there were 2 people on the table.

## Data preparation

In [1]:
import re

from IPython.display import HTML
import pandas as pd
import numpy as np

from serenata_toolbox.datasets import fetch

def report(df):
    df = df.copy()
    df['receipt'] = df.apply(link_to_receipt, axis=1)
    df['document_id'] = df.apply(link_to_jarbas, axis=1)
    cols = ['document_id', 'receipt', 'issue_date', 'total_net_value', 'supplier']
    return HTML(df[cols].to_html(escape=False))

def link_to_jarbas(r):
    return '<a target="_blank" href="http://jarbas.datasciencebr.com/#/document_id/{0}">{0}</a>'.format(r.document_id)

DOCUMENT_URL = (
    'http://www.camara.gov.br/'
    'cota-parlamentar/documentos/publ/{}/{}/{}.pdf'
)
def link_to_receipt(r):
    url = DOCUMENT_URL.format(r.applicant_id, r.year, r.document_id)
    return '<a target="_blank" href="{0}">RECEIPT</a>'.format(url)

pd.set_option('display.max_colwidth', 1500)

fetch("2017-02-15-receipts-texts.xz", "../data")
texts = pd.read_csv('../data/2017-02-15-receipts-texts.xz', dtype={'text': np.str}, low_memory=False)
texts['text'] = texts.text.str.upper()
texts = texts[~texts.text.isnull()]

fetch("2016-12-06-reimbursements.xz", "../data")
reimbursements = pd.read_csv('../data/2016-12-06-reimbursements.xz', low_memory=False)
reimbursements = reimbursements.query('(subquota_description == "Congressperson meal")')
data = texts.merge(reimbursements, on='document_id')
len(data)

56710

There are 56710 meal reimbursements that have OCRed text.

## Meals for more than one person

Usually the text present on the receipt says `PESSOAS: X` where `X` is the number of people in the table. Some places also use `PESSOA(S): X` but that seems to be less common. Some pretty simple regexes for fetching those that have more than one person is outlined below

In [2]:
len(data[data.text.str.contains('PESSOA\(?S\)?\s*:?\s*[2-9]')])

188

And the amount of reimbursements for one person

In [3]:
len(data[data.text.str.contains('PESSOA\(?S\)?\s*:?\s*1')])

154

Based on a previous analysis and after some quick look at the data, I found that some of them had already subtracted other peoples food present on the receipt. One way to reduce some of the false positives is the search for the `total_net_value` amount within the text of the receipt. To make things easier, we focus on those that are under R$ 1.000,00

In [4]:
r = data.query('total_net_value < 1000')
r = r[r.text.str.contains('PESSOA\(?S\)?\s*:?\s*[2-9]')]

def format_regex(val):
    hundreds = int(val)
    decimal = int((val * 100) % 100)
    if decimal == 0:
        decimal = '00'
    return '|'.join([
        '{},\s*{}'.format(hundreds, decimal),
        '{}\.\s*{}'.format(hundreds, decimal)
    ])

def receipt_matches_net_value(r):
    return any(re.findall(format_regex(r.total_net_value), r.text))

r = r[r.apply(receipt_matches_net_value, axis=1)]
print(len(r))
report(r)

51


Unnamed: 0,document_id,receipt,issue_date,total_net_value,supplier
213,5586256,RECEIPT,2015-01-08T00:00:00,91.6,OUTBACK STEAKHOUSE RESTAURANTES BRASIL S.A
1524,5603654,RECEIPT,2015-02-10T00:00:00,81.7,COCO BAMBU
1577,5604052,RECEIPT,2015-02-06T00:00:00,52.8,MELO & MELO CAFETERIA LTDA ME
1733,5605403,RECEIPT,2015-02-11T00:00:00,105.42,COCO BAMBU LAGO SUL COMERCIO DE ALIMENTOS LTDA
2811,5616730,RECEIPT,2015-03-02T00:00:00,100.5,OUTBACK
3866,5625538,RECEIPT,2015-03-05T00:00:00,137.1,COCO BAMBU LAGO SUL COMERCIO DE ALIMENTOS LTDA
4465,5630492,RECEIPT,2015-03-16T00:00:00,129.15,OUTBACK
6176,5642807,RECEIPT,2015-03-25T00:00:00,123.5,OUTBACK STEAKHOUSE RESTAURANTES BRASIL LTDA
6465,5644536,RECEIPT,2015-03-26T00:00:00,130.3,Outback Steakhouse Restaurantes Brasil LTDA
6782,5647733,RECEIPT,2015-04-05T00:00:00,53.5,KING FOOD CO COMERCIO DE ALIMENTOS - S/A


Out of those reimbursements, I found at least 6 that are really suspicious and I'll try to get them reported.

I'm not sure how this affects meals prior to 2015 and after 2016 since the dataset I put together only has information about 2015-2016 ones, but it can contribute a lot for bringing up some reimbursements up in the rank of suspicious ones.