# Traveled speeds

The Quota for Exercise of Parliamentary Activity says that meal expenses can be reimbursed just for the politician, excluding guests and assistants. Creating a feature with information of traveled speed from last meal can help us detect anomalies compared to other expenses.

Since we don't have in structured data the time of the expense, we want to anylize the group of expenses made in the same day.

* Learn how to calculate distance between two coordinates.
* Filter "Congressperson meal" expenses.
* Order by date.
* Merge `reimbursements.xz` dataset with `companies.xz`, so we have latitude/longitude for each expense.
* Remove expenses with less than 12 hours of distance between each other.

...


* Filter specific congressperson.

In [1]:
import pandas as pd
import numpy as np

reimbursements = pd.read_csv('../data/2016-11-19-reimbursements.xz',
                             dtype={'cnpj_cpf': np.str})

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
reimbursements.iloc[0]

year                                                       2009
applicant_id                                               1001
document_id                                             1564212
reimbursement_value_total                                   NaN
total_net_value                                             130
reimbursement_numbers                                      2888
congressperson_name                            DILCEU SPERAFICO
congressperson_id                                         73768
congressperson_document                                     444
term                                                       2015
state                                                        PR
party                                                        PP
term_id                                                      55
subquota_number                                               3
subquota_description                       Fuels and lubricants
subquota_group_id                       

In [3]:
reimbursements = reimbursements[reimbursements['subquota_description'] == 'Congressperson meal']
reimbursements.shape

(191724, 31)

In [4]:
reimbursements['issue_date'] = pd.to_datetime(reimbursements['issue_date'], errors='coerce')
reimbursements.sort_values('issue_date', inplace=True)

In [5]:
companies = pd.read_csv('../data/2016-09-03-companies.xz', low_memory=False)
companies.shape

(60047, 228)

In [6]:
companies.iloc[0]

situation_date                                                     03/11/2005
type                                                                   MATRIZ
name                             COMPANHIA DE AGUAS E ESGOTOS DE RORAIMA CAER
phone                                                          (95) 3626-5165
situation                                                               ATIVA
neighborhood                                                        SAO PEDRO
address                                                        R MELVIN JONES
number                                                                    219
zip_code                                                           69.306-610
city                                                                BOA VISTA
state                                                                      RR
opening                                                            21/11/1969
legal_entity                              203-8 - SOCIEDADE DE E

In [7]:
companies['cnpj'] = companies['cnpj'].str.replace(r'[\.\/\-]', '')

In [8]:
dataset = pd.merge(reimbursements, companies, left_on='cnpj_cpf', right_on='cnpj')
dataset.shape

(176005, 259)

In [9]:
dataset.iloc[0]

year                                                               2011
applicant_id                                                       2303
document_id                                                     2003049
reimbursement_value_total                                           NaN
total_net_value                                                      80
reimbursement_numbers                                              3554
congressperson_name                                       RONALDO ZULKE
congressperson_id                                                160594
congressperson_document                                             515
term                                                               2011
state_x                                                              RS
party                                                                PT
term_id                                                              54
subquota_number                                                 

Remove party leaderships from the dataset before calculating the ranking.

In [15]:
dataset = dataset[dataset['congressperson_id'].notnull()]
dataset.shape

(175071, 259)

And also remove companies mistakenly geolocated outside of Brazil.

In [45]:
is_in_brazil = (dataset['longitude'] < -34.7916667) & \
    (dataset['latitude'] < 5.2722222) & \
    (dataset['latitude'] > -33.742222) & \
    (dataset['longitude'] > -73.992222)
dataset = dataset[is_in_brazil]
dataset.shape

(168568, 259)

In [38]:
# keys = ['applicant_id', 'issue_date']
keys = ['congressperson_name', 'issue_date']
aggregation = dataset.groupby(keys)['total_net_value']. \
    agg({'sum': np.sum, 'expenses': len, 'mean': np.mean})

In [39]:
aggregation['expenses'] = aggregation['expenses'].astype(np.int)

In [43]:
aggregation.sort_values(['expenses', 'sum'], ascending=[False, False]).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,expenses,mean
congressperson_name,issue_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CELSO MALDANER,2011-09-05,750.28,13,57.713846
JOSÉ PAULO TÓFFANO,2010-04-27,500.47,12,41.705833
SANDRA ROSADO,2012-01-12,333.4,12,27.783333
SANDRA ROSADO,2012-01-17,287.43,12,23.9525
SANDRA ROSADO,2012-01-06,281.75,12,23.479167
LÉO VIVAS,2010-08-31,630.0,11,57.272727
SANDRA ROSADO,2012-01-11,541.56,11,49.232727
PAULO WAGNER,2011-07-21,537.66,11,48.878182
SANDRA ROSADO,2015-01-07,396.6,11,36.054545
SANDRA ROSADO,2012-01-15,295.58,11,26.870909


In [50]:
len(aggregation[aggregation['expenses'] > 7])

35

In [74]:
keys = ['congressperson_name', 'issue_date']
cities = dataset.groupby(keys)['city']. \
    agg({'city': lambda x: len(set(x)), 'city_list': lambda x: ','.join(set(x))}).sort_values('city', ascending=False)

In [70]:
cities.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,city_list,city
congressperson_name,issue_date,Unnamed: 2_level_1,Unnamed: 3_level_1
TAKAYAMA,2014-06-25,"GUARAPUAVA,FERNANDES PINHEIRO,PEABIRU,CEU AZUL...",6
ZECA DIRCEU,2012-02-14,"BRASILIA,MARINGA,PARANAVAI,GUARULHOS,PAICANDU",5
RICARDO IZAR,2014-04-26,"LINS,SAO PAULO,PRAIA GRANDE,BAURU,BOITUVA",5
PAULO FERREIRA,2013-02-08,"IPAMERI,EMBU DAS ARTES,BRASILIA,IGARAPAVA,LIMEIRA",5
MARGARIDA SALOMÃO,2014-12-02,"BARBACENA,JUIZ DE FORA,BRASILIA,BELO HORIZONTE...",5


In [71]:
cities[cities['city'] >= 4].shape

(127, 2)

Would be helpful for our analysis to have a new column containing the traveled distance in this given day.

In [49]:
from geopy.distance import vincenty as distance
from IPython.display import display

x = dataset.iloc[0]
display(x[['cnpj', 'city', 'state_y']])
y = dataset.iloc[20]
display(y[['cnpj', 'city', 'state_y']])
distance(x[['latitude', 'longitude']],
         y[['latitude', 'longitude']])

cnpj       72614977000290
city             BRASILIA
state_y                DF
Name: 0, dtype: object

cnpj       72614977000290
city             BRASILIA
state_y                DF
Name: 20, dtype: object

Distance(0.0)

In [89]:
dataset.shape

(168568, 259)

In [90]:
dataset[['latitude', 'longitude']].dropna().shape

(168568, 2)

In [None]:
from itertools import tee

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def calculate_distances(x):
    coordinate_list = x[['latitude', 'longitude']].values
    distance_list = [distance(*coordinates_pair).km
                     for coordinates_pair in pairwise(coordinate_list)]
    return np.nansum(distance_list)

distances = dataset.groupby(keys).apply(calculate_distances)

In [108]:
distances = distances.reset_index() \
    .rename(columns={0: 'distance_traveled'}) \
    .sort_values('distance_traveled', ascending=False)
distances.head()

Unnamed: 0,congressperson_name,issue_date,distance_traveled
112369,SANDRA ROSADO,2012-09-04,6969.64386
112201,SANDRA ROSADO,2012-01-12,6965.526389
112210,SANDRA ROSADO,2012-01-23,6833.928798
112221,SANDRA ROSADO,2012-02-08,5333.782132
112295,SANDRA ROSADO,2012-06-12,5309.248049
