# Detecting abnormal meal prices

There's a list of meal reimbursements made using the CEAP. We want to alert about anomalies found in this dataset based on known information about food expenses. By the start, me and @filipelinhares are proposing grouping the congressperson that had reimbursement at the same places and same days to find how the consumption behave.

In [2]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
reimbursements = pd.read_csv('../data/2016-11-19-reimbursements.xz',
                      dtype={'document_id': np.str,
                          'congressperson_id': np.str,
                          'congressperson_document': np.str,
                          'term_id': np.str,
                          'cnpj_cpf': np.str,
                          'reimbursement_number': np.str},
                      low_memory=False)

In [4]:
reimbursements['issue_date'] = pd.to_datetime(reimbursements['issue_date'], errors='coerce')
reimbursements.sort_values('issue_date', inplace=True)

## Data preparation

In [5]:
meals = reimbursements[reimbursements.subquota_description == 'Congressperson meal']
meals.head()

Unnamed: 0,year,applicant_id,document_id,reimbursement_value_total,total_net_value,reimbursement_numbers,congressperson_name,congressperson_id,congressperson_document,term,...,issue_date,document_value,remark_value,net_values,month,installment,passenger,leg_of_the_trip,batch_number,reimbursement_values
107417,2009,1880,1701507,,22.36,3105,LUIZ PAULO VELLOZO LUCAS,141489,278,2007.0,...,2000-10-20,22.36,0.0,22.36,10,0,,,431507,
518515,2011,2303,2003049,,80.0,3554,RONALDO ZULKE,160594,515,2011.0,...,2001-02-01,80.0,0.0,80.0,2,0,,,519202,
293600,2010,1862,1895063,,29.0,3386,JOSÉ PAULO TÓFFANO,141471,378,2007.0,...,2007-07-14,29.0,0.0,29.0,7,0,,,486871,
292266,2010,1858,1811373,,76.0,3281,JOÃO OLIVEIRA,141460,61,2007.0,...,2008-03-28,76.0,0.0,76.0,3,0,,,463938,
375589,2010,995,1767225,,39.68,3218,CLAUDIO CAJADO,74537,186,2015.0,...,2009-01-18,39.68,0.0,39.68,1,0,,,450865,


In [6]:
meals.total_net_value.describe()

count    191724.000000
mean         65.758414
std          98.156313
min           0.010000
25%          24.800000
50%          46.060000
75%          85.250000
max        6205.000000
Name: total_net_value, dtype: float64

In [7]:
meals = meals[meals['congressperson_id'].notnull()]
meals.shape

(190763, 31)

In [8]:
# grouped = meals.groupby('cnpj_cpf', as_index=False)
# print('{} total cnpj/cpfs, {} are unique'.format(len(meals), len(grouped)))

In [9]:
# cnpj_cpfs = []
# names = []
# for group in grouped:
#     cnpj_cpfs.append(group[0])
#     names.append(group[1].iloc[0].supplier)

# names = pd.DataFrame({'cnpj_cpf': cnpj_cpfs, 'supplier_name': names})
# names.head()

## CNPJs/CPFs that received the most expenses in days.

In [12]:
keys = ['cnpj_cpf', 'supplier', 'issue_date']
aggregation = meals.groupby(keys)['total_net_value']. \
    agg({'sum': np.sum, 'expenses': len, 'mean': np.mean})

In [14]:
aggregation.sort_values(['expenses', 'sum'], ascending=[False, False]).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,expenses,sum,mean
cnpj_cpf,supplier,issue_date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
33469172000672,SERVIÇO NAC. DE APRENDIZAGEM COMERCIAL - SENAC,2016-05-24,73.0,1755.48,24.047671
33469172001644,SENAC - COMP. ADM. CAM. DEP. ANEXO IV 10º ANDAR,2015-03-18,60.0,2745.06,45.751
33469172000672,SERVIÇO NACIONAL DE APRENDIZAGEM COMERCIAL - SENAC,2016-10-05,60.0,1590.7,26.511667
33469172001644,SENAC - COMP. ADM. CAM. DEP. ANEXO IV 10º ANDAR,2015-05-06,57.0,2647.78,46.452281
33469172000672,SERV. NAC. DE APRENDIZAGEM COMERCIAL - SENAC,2015-03-11,56.0,1024.0,18.285714
33469172001644,SENAC - COMP. ADM. CAM. DEP. ANEXO IV 10º ANDAR,2015-03-17,55.0,2462.66,44.775636
33469172001644,SENAC - COMP. ADM. CAM. DEP. ANEXO IV 10º ANDAR,2015-05-07,55.0,2414.76,43.904727
33469172000672,SERVIÇO NACIONAL DE APRENDIZAGEM COMERCIAL - SENAC,2015-07-01,55.0,1391.17,25.294
33469172000672,SERV. NAC. DE APRENDIZAGEM COMERCIAL - SENAC,2015-03-25,55.0,1257.93,22.871455
33469172001644,SENAC - COMP. ADM. CAM. DEP. ANEXO IV 10º ANDAR,2015-04-28,54.0,2358.54,43.676667


As we observe, there's a place that received 73 reimburses in one day, by now we are looking for a way to know what congressperson had lunch in those places and how many they had paid for it, in order to find if some of those congressperson had an abnormal expense.