-
Notifications
You must be signed in to change notification settings - Fork 12
/
2017-03-15-fabiocorreacordeiro-taxi-descriptive-analysis.py
220 lines (119 loc) · 5.34 KB
/
2017-03-15-fabiocorreacordeiro-taxi-descriptive-analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# coding: utf-8
# # Descriptive analysis of local transportation
#
# This is Exploratory Descriptive Analisys of expendure with local transportation. So we will create three notebook to explore the data of the follows subquotas: 'Taxi, toll and parking', 'Automotive vehicle renting or charter' and 'Fuels and lubricants'. We basically used the same analysis used by Irio in his descriptive analysis of all dataset in https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2016-08-13-irio-descriptive-analysis.ipynb
#
# The anlisys of others subquota ara in:
#
# 'Automotive vehicle renting or charter': work in progress
#
# 'Fuels and lubricants': work in progress
# ### This first notebook is about subquota 'Taxi, toll and parking'
# Importing the dataset
# In[111]:
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import math
# In[30]:
data = pd.read_csv('../data/2017-03-14-reimbursements.xz',
parse_dates=[16],
low_memory=False,
dtype={'document_id': np.str,
'congressperson_id': np.str,
'congressperson_document': np.str,
'term_id': np.str,
'cnpj_cpf': np.str,
'reimbursement_number': np.str,})
# All reimbursement of subquota 'Taxi, toll and parking' will be set in **data_taxi** variable, this is the main data who we will will use in this notebook.
# In[31]:
data_taxi = data[data['subquota_description'] == 'Taxi, toll and parking']
# Since 2013 was 70.088 documents of reimbursement.
# In[32]:
print(data_taxi.shape)
# In[33]:
data_taxi.head()
# In[34]:
data_taxi.iloc[0]
# All the expedures sums more than **R$2.2 Million**
# In[36]:
data_taxi['net_values'].sum()
# In[37]:
data_taxi['net_values'].describe()
# In[38]:
sns.distplot(data_taxi['net_values'])
# The most expensive reimbursement was **R$2.500**
# In[39]:
most_expensive_reimbursement = data_taxi[data_taxi['net_values'] == data_taxi['net_values'].max()].iloc[0]
most_expensive_reimbursement
# Here we considered outliers all 0,5% biggest reimbursement values. We found 350 outliers in a total of R$338,060.46.
# In[71]:
data_taxi = data_taxi.sort_values('net_values', ascending=False)
outliers = data_taxi.head(math.floor(len(data_taxi)*0.005))
sns.distplot(outliers['net_values'])
# In[73]:
len(outliers)
# In[74]:
outliers['net_values'].sum()
# In[75]:
outliers
# ## Who are these applicants?
# In total 509 congressperson ask by reimbursement and most of them (75%) received R$5,364.90.
#
# But some congresperson received more them R$50.000,00 in the same perriod.
# In[77]:
len(data_taxi['applicant_id'].unique())
# In[78]:
applicants_by_net_value = pd.DataFrame(data_taxi.groupby(['applicant_id'], as_index=False).sum()[['applicant_id', 'net_values']])
# In[79]:
congressperson_list = data_taxi[
['applicant_id', 'congressperson_name', 'party', 'state']]
congressperson_list = congressperson_list. drop_duplicates('applicant_id', keep='first')
ranking = pd.merge(applicants_by_net_value,
congressperson_list,
how='left',
on='applicant_id').sort_values('net_values', ascending=False)
ranking.head(10)
# In[80]:
ranking['net_values'].describe()
# In[81]:
graph = sns.barplot(x='congressperson_name',
y='net_values',
data=ranking)
graph.axes.get_xaxis().set_ticks([]); None
# # Suppliers
# There are 1.797 suppliers, two of them are more than R$480.000. They are SINPETAXI, syndicate of taxi from Brasilia and CENTRO DE GESTAO DE MEIOS DE PAGAMENTO S.A., owner of SEM PARAR, company of payment of toll.
# In[83]:
suppliers_list = pd.DataFrame(data_taxi.groupby(['cnpj_cpf'], as_index=False)['net_values'].sum())
suppliers_list = suppliers_list.sort_values('net_values', ascending=False)
suppliers_list
# In[84]:
suppliers_list['net_values'].describe()
# In[52]:
sns.distplot(suppliers_list['net_values'])
# Here we considered outliers all 1% biggest suppliers. We found 17 outliers in a total of R$1.439.806.92
#
# R$1.008.936,47 from SINPETAXI and SEM PARAR
#
# R$430.870,44 from other companies
# In[102]:
outliers_suppliers = suppliers_list.head(math.floor(len(suppliers_list)*0.01))
sns.distplot(outliers_suppliers['net_values'])
# In[103]:
outliers_suppliers['net_values'].describe()
# In[104]:
outliers_suppliers['net_values'].sum()
# In[106]:
outliers_suppliers['net_values'].head(2).sum()
# In[108]:
outliers_suppliers['net_values'].tail(len(outliers_suppliers)-2).sum()
# In[54]:
outliers_suppliers = suppliers_list[~suppliers_list.isin(data_wo_outliers_suppliers)['cnpj_cpf']]
print(len(outliers_suppliers), len(outliers_suppliers) / len(suppliers_list))
# # Conclusion and next steps
# We could note there a long tail in the reimbursements, there a group of congresspeople who expend much more then others. We could observe too, two big suppliers, a taxi syndicate from Brasilia and a company of payment of toll. The sugestion of next stepes are:
# - split data of taxi, parking and toll.
# - check the expenses in the time (by year, month and day)
# - cross the data of "Taxi, toll and parking" with 'Automotive vehicle renting or charter' and 'Fuels and lubricants'