# Exploratory analysis on CEAP consultancies

The idea of this notebook is to offer an overview of congresspeople expenses with consultancies. It's simpler than a proper exploratory analysis, but I hope this help and encourage more analysis related to this subquota.

Let's get started by loading the data we have about the reimbursements:

In [1]:
import numpy as np
import pandas as pd


reimbursements = pd.read_csv(
    '../data/2016-11-19-reimbursements.xz',
    dtype={'cnpj_cpf': np.str},
    low_memory=False
)
reimbursements.shape

(1532491, 31)

A quick look in all subquotas just to make sure we pickup the right one when filtering expenses with consultancies:

In [2]:
keys = ['subquota_number', 'subquota_description']
reimbursements[keys].groupby(keys).count().reset_index()

Unnamed: 0,subquota_number,subquota_description
0,1,Maintenance of office supporting parliamentary...
1,2,"Locomotion, meal and lodging"
2,3,Fuels and lubricants
3,4,"Consultancy, research and technical work"
4,5,Publicity of parliamentary activity
5,6,Purchase of office supplies
6,7,Software purchase or renting; Postal services;...
7,8,Security service provided by specialized company
8,9,Flight tickets
9,10,Telecommunication


In [3]:
consultancies = reimbursements[reimbursements.subquota_number == 4]
consultancies.shape

(21477, 31)

## Counting where congresspeople spend on consultancy

This first grouping looks into cases in which a congressperson has many expenses with consultancies, but all/most of them are made in the very same company.

First lets see how many different reimbursements each congressperson had for each consultancy.

In [4]:
cols = ['applicant_id', 'congressperson_name', 'cnpj_cpf']
count_per_consultancy = consultancies[cols] \
            .groupby(cols) \
            .size() \
            .to_frame('count_per_consultancy') \
            .reset_index() \
            .sort_values('count_per_consultancy', ascending=False)
count_per_consultancy.head()

Unnamed: 0,applicant_id,congressperson_name,cnpj_cpf,count_per_consultancy
2079,1935,SÉRGIO MORAES,7601817000164,91
1437,1782,ANTONIO BULHÕES,7689420000176,86
1279,1703,WELLINGTON ROBERTO,5560288000172,86
1867,1889,MARCOS MONTES,4689393000143,85
678,1347,PAULO MAGALHÃES,6253998000112,83


## Counting the total reimbursements congresspeople had in consultancies

Now let's see the total reimbursements for all consultancies per congresspeople.

In [5]:
cols = ['applicant_id']
consultancies_count = consultancies.groupby('applicant_id') \
                        .size() \
                        .to_frame('total_consultancies') \
                        .reset_index() \
                        .sort_values('total_consultancies', ascending=False)
consultancies_count.head()

Unnamed: 0,applicant_id,total_consultancies
437,1889,168
282,1627,147
430,1881,132
316,1703,131
468,1922,127


## Find congressperson loyal to a specific consultancy

In [6]:
consultancies_grouped = count_per_consultancy.merge(consultancies_count)
consultancies_grouped['percentage'] = \
    consultancies_grouped.count_per_consultancy / consultancies_grouped.total_consultancies
consultancies_grouped.sort_values('percentage', ascending=False)

Unnamed: 0,applicant_id,congressperson_name,cnpj_cpf,count_per_consultancy,total_consultancies,percentage
3896,1202,ARNALDO MADEIRA,02308464000195,1,1,1.000000
2983,1924,ROBERTO BRITTO,09514328000109,5,5,1.000000
3152,2343,RUI COSTA,09393493000141,4,4,1.000000
3151,1895,MIGUEL MARTINI,08909036000102,4,4,1.000000
3107,1772,ABELARDO CAMARINHA,19146821000169,4,4,1.000000
3106,2818,EURICO JÚNIOR,15199489000140,4,4,1.000000
3096,1637,FÁBIO SOUTO,10489737000173,4,4,1.000000
3078,1254,MARCUS VICENTE,09613772000173,4,4,1.000000
3016,3143,PASTOR LUCIANO BRAGA,13683258000181,5,5,1.000000
3015,3084,MARX BELTRÃO,03397255000128,5,5,1.000000


This results aren't so helpful, so let's use a minimun of 10 consultancies expenses at the same company, and a ratio of 80% of consultancies expenses done at this same company:

In [7]:
results = consultancies_grouped \
    .query('count_per_consultancy >= 10') \
    .query('percentage >= 0.8')
results

Unnamed: 0,applicant_id,congressperson_name,cnpj_cpf,count_per_consultancy,total_consultancies,percentage
0,1935,SÉRGIO MORAES,07601817000164,91,92,0.989130
2,1782,ANTONIO BULHÕES,07689420000176,86,88,0.977273
37,1096,JOVAIR ARANTES,09613772000173,82,88,0.931818
40,184,SIMÃO SESSIM,03262796000149,79,91,0.868132
44,775,BETO MANSUR,09230540000136,72,72,1.000000
45,1686,RUBENS OTONI,08899265000185,71,76,0.934211
66,2256,JORGE CÔRTE REAL,08760771000199,65,77,0.844156
94,1677,REGINALDO LOPES,09613772000173,61,62,0.983871
96,1488,MAURO BENEVIDES,08790790000168,60,60,1.000000
103,2295,ANTONIO IMBASSAHY,13302066000188,58,68,0.852941


There are 126 congressperson that are constantly using the same consultancy. This is **not** illegal _per se_ but might be an indicator of something. If anyone wanna go deeper, here are the Jarbas links for each of this cases:

In [8]:
def jarbas_link(row):
    base_url = (
        'https://jarbas.serenatadeamor.org/#/'
        'applicantId/{applicant_id}/'
        'cnpjCpf/{cnpj_cpf}/'
        'subquotaId/4'
    )
    url = str(base_url.format(**row))
    return '<a href="{}">Jarbas</a>'.format(url)

results['url'] = results.apply(jarbas_link, axis=1)
links = results[[
    'congressperson_name',
    'count_per_consultancy',
    'total_consultancies',
    'percentage',
    'url'
]]

from IPython.display import HTML
pd.set_option('display.max_colwidth', -1)
HTML(links.to_html(escape=False))

Unnamed: 0,congressperson_name,count_per_consultancy,total_consultancies,percentage,url
0,SÉRGIO MORAES,91,92,0.98913,Jarbas
2,ANTONIO BULHÕES,86,88,0.977273,Jarbas
37,JOVAIR ARANTES,82,88,0.931818,Jarbas
40,SIMÃO SESSIM,79,91,0.868132,Jarbas
44,BETO MANSUR,72,72,1.0,Jarbas
45,RUBENS OTONI,71,76,0.934211,Jarbas
66,JORGE CÔRTE REAL,65,77,0.844156,Jarbas
94,REGINALDO LOPES,61,62,0.983871,Jarbas
96,MAURO BENEVIDES,60,60,1.0,Jarbas
103,ANTONIO IMBASSAHY,58,68,0.852941,Jarbas
