# Lodging Expense Analysis (an attempt to partially address issue #26)

This analysis tries to find anomalies in lodging expenses by internal comparison.

It is worth noting that this code doesn't take some very important things into consideration:

* There seems to be no way to know the amount of days spent at the hotel
* Also no special treatment to holidays and weekends is applied

Such things can cause false positives, so the results presented here must be taken with a grain of salt. To put it another way, this research should be used for further data analysis, they're not yet ready for manual investigation of any sort.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('../data/2016-11-19-reimbursements.xz',
                  dtype={ 'cnpj_cpf': np.str, 'reimbursement_numbers': np.str })

First thing we should do is to filter our dataset according to @Irio's suggestion.

In [3]:
filtered_data = data[data['subquota_description'] == 'Lodging, except for congressperson from Distrito Federal']
filtered_data.head(2)

Unnamed: 0,year,applicant_id,document_id,reimbursement_value_total,total_net_value,reimbursement_numbers,congressperson_name,congressperson_id,congressperson_document,term,...,issue_date,document_value,remark_value,net_values,month,installment,passenger,leg_of_the_trip,batch_number,reimbursement_values
181,2009,1001,1628770,,430.0,2986,DILCEU SPERAFICO,73768.0,444.0,2015.0,...,2009-07-12T00:00:00,437.0,7.0,430.0,7,0,,,410398,
220,2009,1001,1640122,,50.0,3006,DILCEU SPERAFICO,73768.0,444.0,2015.0,...,2009-07-30T00:00:00,50.0,0.0,50.0,7,0,,,413482,


Next, it is handy to further simplify our model. Lets focus only on the hotel's social ID (CNPJ) and the receipt's declared value.

In [4]:
lodging_data = filtered_data[['cnpj_cpf', 'total_net_value']]
lodging_data.head()

Unnamed: 0,cnpj_cpf,total_net_value
181,91046284000960,430.0
220,6376252000104,50.0
228,9259358000450,141.75
246,77124980000169,557.0
277,7686368000102,320.0


Now let's find out the average value and standard deviation for each supplier that has at least 10 receipts (so our results have a higher chance of being meaningful).

In [5]:
per_supplier_data = lodging_data.groupby('cnpj_cpf').agg({ 'total_net_value': ['count', 'mean', 'std'] })
meaningful_supplier_data = per_supplier_data[per_supplier_data['total_net_value']['count'] >= 10]

# http://stackoverflow.com/questions/14507794/python-pandas-how-to-flatten-a-hierarchical-index-in-columns
meaningful_supplier_data.columns = [' '.join(col).strip() for col in meaningful_supplier_data.columns.values]

meaningful_supplier_data.head()

Unnamed: 0_level_0,total_net_value count,total_net_value mean,total_net_value std
cnpj_cpf,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
31708000100,10,28.35,17.365435
82535000159,12,383.916667,238.898784
87893000154,16,624.39125,432.780485
96115000121,58,414.112931,158.154079
109623000105,20,153.7865,43.120415


With this data we can join back with our original dataset and find potentially suspicious receipts.

In [6]:
joined_data = pd.merge(filtered_data, meaningful_supplier_data, left_on='cnpj_cpf', right_index=True)
suspicous_predicate = joined_data.total_net_value > (joined_data['total_net_value mean'] + (joined_data['total_net_value std'] * 2))
suspicious_data = joined_data[suspicous_predicate][['congressperson_name', 'cnpj_cpf', 'supplier', 'total_net_value', 'total_net_value mean', 'total_net_value std', 'total_net_value count']]
suspicious_data.head()

Unnamed: 0,congressperson_name,cnpj_cpf,supplier,total_net_value,total_net_value mean,total_net_value std,total_net_value count
246,DILCEU SPERAFICO,77124980000169,GELINSKI HOTEIS E TURISMO LTDA,557.0,194.07303,104.745912,33
845546,BRUNO ARAÚJO,7686368000102,WTC ADMINISTRAÇÃO E HOTELARIA LTDA,1489.7,581.781538,339.808723,13
818585,ANDRÉ ZACHAROW,66542002002083,BLUE TREE PREMIUM,933.82,324.261471,211.155244,34
818586,ANDRÉ ZACHAROW,66542002002083,BLUE TREE PREMIUM,1069.14,324.261471,211.155244,34
185809,NELSON MEURER,76755404000157,PARANOA HOTEIS LTDA,4370.0,1682.624441,1273.0186,304


Let's check the first result (DILCEU SPERAFICO's receipt for GELINSKI HOTEIS E TURISMO LTDA) to be sure we got our math correctly.

In [27]:
filtered_data[filtered_data.cnpj_cpf == '77124980000169'][['document_id', 'supplier', 'total_net_value']]

Unnamed: 0,document_id,supplier,total_net_value
246,1655946,GELINSKI HOTEIS E TURISMO LTDA,557.0
123415,1713061,GELINSKI HOTÉIS E TURISMO,190.0
123416,1713064,GELINSKI HOTÉIS E TURISMO LTDA,193.0
154724,1718617,GELINSKI HOTEIS,212.0
167874,1661735,GELINSKI HOTEIS E TURISMO LTDA,98.0
370902,1781242,ATALAIA PALACE HOTEL - GUARAPUAVA,105.0
370957,1799959,GELINSKI E TURISMO LTDA,105.0
370994,1811652,GELINSKI HOTEIS E TURISMO LTDA,105.0
371127,1908865,GELINKI HOTÉIS E TURISMO LTDA,192.5
406971,2068645,ATALAIA PALACE HOTEL LTDA,135.5


We can conclude that 5 hundred is indeed highly above average for this supplier (maybe not enough to really be suspicious? There is knobs available at `suspicious_predicate` that may be improved). Also note that we have different supplier names related to the same company social ID (CNPJ).