## Expenses on car rental

This analysis aims to investigate expenses on car rental during the current term. Previous analysis I did using Excel shows 1) some politicians systematically spends above the monthly limit of R$ 10K, and 2) some congresspersons rent more than one vehicle every month, which brings certain suspicion: considering they work in DF, are those cars rented outside DF being used by someone else? 

~~**Fist step:** get a list of congresspersons, the amount reimbursed by them since Jan. 2015 and the dates of those reimbursements. Then we cross these data with the list of companies that rented those vehicles so we can get information on where those rentals occurred.~~ *Done!*

**Second step:** get datasets (sessions, speeches) that may prove whether congressperson was or was not in DF in specific periods of time: when the vehicles were rented. So we can get, as a result, months in which the congressperson spent most of his/her time in DF, but payed a full-month rent somewhere else. *I need some help here.*

-- Rodolfo Viana

In [4]:
import pandas as pd
import numpy as np

data = pd.read_csv('../data/2017-06-04-reimbursements.xz',
                      dtype={'applicant_id': np.str,
                             'cnpj_cpf': np.str,
                             'congressperson_id': np.str,
                             'congressperson_name': np.str,
                             'subquota_number': np.str,
                             'issue_date': np.str,
                             'document_id': np.str},
                      low_memory=False)

In [5]:
# Selecting term and subquota description

data = data[data['year'] >= 2015]
data = data[data['subquota_description'] == 'Automotive vehicle renting or charter']
data['cnpj_cpf'] = data['cnpj_cpf'].str.replace(r'[\.\/\-]', '')
data.subquota_description.value_counts()

Automotive vehicle renting or charter    14263
Name: subquota_description, dtype: int64

In [6]:
# Checking if it is everything fine

data.iloc[0]

year                                                           2015
applicant_id                                                   1003
document_id                                                 5588715
reimbursement_value_total                                       NaN
total_net_value                                                4000
reimbursement_numbers                                          4888
congressperson_name                                  DOMINGOS DUTRA
congressperson_id                                             74197
congressperson_document                                          72
term                                                           2011
state                                                            MA
party                                                            SD
term_id                                                          54
subquota_number                                                 120
subquota_description          Automotive vehicle

In [7]:
# Cleaning the list

congressperson_list = data[['congressperson_name', 
                            'congressperson_id', 
                            'net_values', 
                            'month', 
                            'year', 
                            'issue_date', 
                            'document_id',
                            'cnpj_cpf']]

In [13]:
# Grouping data by congressperson and the sum of his/her expenses

congressperson_expenses = congressperson_list.groupby(['congressperson_name', 
                                                       'year', 
                                                       'month', 
                                                       'issue_date', 
                                                       'document_id']).agg({'net_values':sum})

In [16]:
# Getting companies dataset and excluding those from DF

companies = pd.read_csv('../data/2017-05-21-companies-no-geolocation.xz', low_memory=False)
companies = companies[companies['state'] != 'DF']
companies['cnpj'] = companies['cnpj'].str.replace(r'[\.\/\-]', '')

In [17]:
# Merging both datasets

dataset = pd.merge(data, companies, how='inner',
                   left_on='cnpj_cpf', right_on='cnpj')

In [18]:
# Grouping all the data

congressperson_expenses_dataset = dataset.groupby(['congressperson_name', 
                                                    'year', 
                                                    'month', 
                                                    'issue_date',
                                                    'cnpj',
                                                    'name',
                                                    'city',
                                                    'state_y',
                                                    'document_id']).agg({'net_values':sum})
full_report = congressperson_expenses_dataset.reset_index()

In [19]:
# Getting 'sum' and 'describe'

full_report.net_values.sum()

49156001.55999989

In [20]:
full_report.net_values.describe()

count    10945.000000
mean      4491.183331
std       2694.178468
min          3.230000
25%       2492.100000
50%       3900.000000
75%       6200.000000
max      10900.000000
Name: net_values, dtype: float64

In [22]:
# Picking only the values above mean + std values

outliers = full_report[full_report['net_values'] > (full_report.net_values.mean()+full_report.net_values.std())].sort_values('net_values', ascending=False)

In [25]:
#Getting the full list of outliers and their sum

pd.set_option('display.max_colwidth', 1000)
HTML(outliers.to_html(escape=False))

Unnamed: 0,congressperson_name,year,month,issue_date,cnpj,name,city,state_y,document_id,net_values
4168,GIVALDO CARIMBÃO,2016,12,2016-12-30T00:00:00,4221587000110,J B LOCAÇÃO DE VEICULOS EIRELI,MACEIO,AL,6203902,10900.0
7873,NILTON CAPIXABA,2015,12,2015-12-17 00:00:00.0,10268644000119,RO AMBIENTAL E SERVICOS LTDA. - ME,CACOAL,RO,5883329,10900.0
1003,ASSIS CARVALHO,2017,4,2017-05-10T00:00:00,12231343000146,DIAGONAL LOCACAO DE VEICULOS LTDA - ME,TERESINA,PI,6292658,10900.0
1002,ASSIS CARVALHO,2017,2,2017-03-10T00:00:00,12231343000146,DIAGONAL LOCACAO DE VEICULOS LTDA - ME,TERESINA,PI,6228172,10900.0
1001,ASSIS CARVALHO,2017,1,2017-02-13T00:00:00,12231343000146,DIAGONAL LOCACAO DE VEICULOS LTDA - ME,TERESINA,PI,6214967,10900.0
1000,ASSIS CARVALHO,2016,12,2017-01-10T00:00:00,12231343000146,DIAGONAL LOCACAO DE VEICULOS LTDA - ME,TERESINA,PI,6213087,10900.0
7871,NILTON CAPIXABA,2015,10,2015-10-20 00:00:00.0,10268644000119,RO AMBIENTAL E SERVICOS LTDA. - ME,CACOAL,RO,5832162,10900.0
7872,NILTON CAPIXABA,2015,11,2015-11-24 00:00:00.0,10268644000119,RO AMBIENTAL E SERVICOS LTDA. - ME,CACOAL,RO,5855310,10900.0
7874,NILTON CAPIXABA,2016,1,2016-01-21T00:00:00,10268644000119,RO AMBIENTAL E SERVICOS LTDA. - ME,CACOAL,RO,5894980,10900.0
1024,ASSIS DO COUTO,2016,9,2016-10-08T00:00:00,11849722000131,BRIZZA COMERCIO DE VEICULOS LTDA,CASCAVEL,PR,6125142,10900.0


In [29]:
# Getting the names of the 20 first congresspersons, according to 
# 1) how many times they rented cars outside DF and 
# 2) those rentals are considered outliers here

outliers.congressperson_name.value_counts().head(20)

JHONATAN DE JESUS       29
GIVALDO CARIMBÃO        28
PEDRO FERNANDES         28
LELO COIMBRA            28
ZECA DIRCEU             28
ÁTILA LIRA              28
REMÍDIO MONAI           27
JOSI NUNES              27
FÁBIO MITIDIERI         26
TEREZA CRISTINA         26
ROBERTO ALVES           26
LUIZ LAURO FILHO        26
JUSCELINO FILHO         26
JOSÉ AIRTON CIRILO      26
ADALBERTO CAVALCANTI    25
JONY MARCOS             25
EXPEDITO NETTO          25
OSMAR SERRAGLIO         25
ASSIS DO COUTO          25
DAGOBERTO NOGUEIRA      25
Name: congressperson_name, dtype: int64

In [30]:
outliers.net_values.sum()

16647307.05

### Conclusion (so far)

In the current term (since Jan. 2015), congresspersons have reimbursed R$ 49,156,001.56 due to expenses on car rental outside DF. We have here vehicles rented for few days, which is something normal, and possibly cars rented for the whole month --and this is unusual, considering congresspersons work in DF.

As I am newbie at statistics, I considered the sum of mean value and standard value to point outliers --or should I consider any other? The outliers sum up to R$ 16,647,307.05.

Now I need help to go on with the second step and some review of the first step, so I can figure out how to improve this analysis. 

This analysis will be updated soon.