# Analysis to support article on cabs vs. apps

Simple notebook listing expenses with cabs and fuel to support [@vilapedro](https://github.com/vilapedro)'s article on cabs vs. apps.

First let's load the main dataset, using only the columns we need:

In [1]:
import pandas as pd
import numpy as np

usecols = (
    'total_net_value',
    'congressperson_name',
    'subquota_number',
    'subquota_description',
    'cnpj_cpf',
    'year',
    'month'
)

dtype={
    'cnpj_cpf': np.str,
    'subquota_number': np.str
}

df = pd.read_csv(
    '../data/2017-03-15-reimbursements.xz',
    usecols=usecols, dtype=dtype
)

As he's writing about expenses srating from June/2014, let's crop our data:

In [2]:
year2014 = df[df.year == 2014]
year2014june = year2014[year2014['month'] >= 6]
reimbursements = year2014june.append(df[df['year'] >= 2015])
reimbursements.year.unique()

array([2014, 2015, 2016, 2017])

And according to his analysis on cabs usage, we're only interested in 9 specific represenatives:

In [3]:
names = (
    'Marcelo Squassoni',
    'Vanderlei Macris',
    'Francisco Floriano',
    'Marcus Vicente',
    'Marcelo Delaroli',
    'Renata Abreu',
    'Alessandro Molon',
    'Chico D’Angelo',
    'Zeca Dirceu'
)
names = tuple(name.upper() for name in names)
deputies = reimbursements[reimbursements.congressperson_name.isin(names)]
deputies.congressperson_name.unique()

array(['VANDERLEI MACRIS', 'ZECA DIRCEU', 'ALESSANDRO MOLON',
       'FRANCISCO FLORIANO', 'MARCUS VICENTE', 'MARCELO SQUASSONI',
       'RENATA ABREU', 'MARCELO DELAROLI'], dtype=object)

Now let's filter only the expenses done with fuel…

In [4]:
fuel = deputies[deputies.subquota_number == '3']
fuel.subquota_description.unique()

array(['Fuels and lubricants'], dtype=object)

…and with cabs:

In [5]:
taxi = deputies[deputies.subquota_number == '122']
taxi.subquota_description.unique()

array(['Taxi, toll and parking'], dtype=object)

Finally let's group expenses month by month to compare — the hypothesis is that expenses with fuel should decrease when expenses with cabs incresase:

In [6]:
def group_by_month(df):
    keys = ('congressperson_name', 'year', 'month')
    return df.groupby(keys)['total_net_value'] \
        .agg([np.sum, len]) \
        .rename(columns={'len': 'expenses'})

In [7]:
grouped_fuel = group_by_month(fuel)
grouped_fuel

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,expenses
congressperson_name,year,month,Unnamed: 3_level_1,Unnamed: 4_level_1
ALESSANDRO MOLON,2014,6,1414.47,13.0
ALESSANDRO MOLON,2014,7,92.00,1.0
ALESSANDRO MOLON,2014,10,539.88,5.0
ALESSANDRO MOLON,2014,11,520.93,5.0
ALESSANDRO MOLON,2014,12,558.49,5.0
ALESSANDRO MOLON,2015,1,289.58,2.0
ALESSANDRO MOLON,2015,2,455.15,3.0
ALESSANDRO MOLON,2015,3,628.15,4.0
ALESSANDRO MOLON,2015,4,700.16,5.0
ALESSANDRO MOLON,2015,5,996.15,8.0


In [8]:
grouped_taxi = group_by_month(taxi)
grouped_taxi

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,expenses
congressperson_name,year,month,Unnamed: 3_level_1,Unnamed: 4_level_1
ALESSANDRO MOLON,2014,6,593.00,20.0
ALESSANDRO MOLON,2014,7,612.00,15.0
ALESSANDRO MOLON,2014,8,348.00,7.0
ALESSANDRO MOLON,2014,9,107.00,2.0
ALESSANDRO MOLON,2014,10,1086.00,22.0
ALESSANDRO MOLON,2014,11,1629.00,41.0
ALESSANDRO MOLON,2014,12,1365.07,33.0
ALESSANDRO MOLON,2015,1,2407.00,62.0
ALESSANDRO MOLON,2015,2,1955.00,53.0
ALESSANDRO MOLON,2015,3,1619.00,38.0


And… let's export some Excel:

In [9]:
# output = pd.ExcelWriter('fuel_vs_cabs.xlsx')
# grouped_fuel.to_excel(output,'Grouped Fuel')
# grouped_taxi.to_excel(output,'Grouped Taxi')
# fuel.to_excel(output,'Fuel')
# output.save()

According to Pedro nothing interesting in this comparison, but let's leave this notebook here if anyone else is interested in this comparison.