# Conexão Reporter - Exploratory Analysis



In [1]:
import numpy as np
import pandas as pd

In [2]:
reimbursements = pd.read_csv('../data/2017-07-04-reimbursements.xz',
                      dtype={'applicant_id': np.str,
                             'cnpj_cpf': np.str,
                             'congressperson_id': np.str,
                             'subquota_number': np.str,
                             'document_id': np.int},
                      low_memory=False)

In [3]:
reimbursements.columns

Index(['year', 'applicant_id', 'document_id', 'reimbursement_value_total',
       'total_net_value', 'reimbursement_numbers', 'congressperson_name',
       'congressperson_id', 'congressperson_document', 'term', 'state',
       'party', 'term_id', 'subquota_number', 'subquota_description',
       'subquota_group_id', 'subquota_group_description', 'supplier',
       'cnpj_cpf', 'document_number', 'document_type', 'issue_date',
       'document_value', 'remark_value', 'net_values', 'month', 'installment',
       'passenger', 'leg_of_the_trip', 'batch_number', 'reimbursement_values'],
      dtype='object')

## Find current term biggest consumers

The idea is to find the top 3 lower house representatives with the highest amount of reimbursements considering all the years in office but that was elected in the last election. The current term started in 2015.

we can find a table of lower house representatives in office using [this link](http://www2.camara.leg.br/deputados/pesquisa/arquivos/arquivo-formato-excel-com-informacoes-dos-deputados-1) bt there we only have 512 instead of the 513. So is a safer route to list the congresspeople using the reimbursements.

In [4]:
current_term_reimbursements = reimbursements[reimbursements['year'] >= 2015].reset_index(drop=True)
current_term_reimbursements.shape

(466273, 31)

In [5]:
len(list(set(current_term_reimbursements['congressperson_name'])))

808

Remember that this number doesn't match 513 because we have names also for some leadership and seconds.

In [6]:
current_term_congresspeople = list(set(current_term_reimbursements['congressperson_name']))

sorted(current_term_congresspeople)

['ABEL MESQUITA JR.',
 'ABELARDO CAMARINHA',
 'ABELARDO LUPION',
 'ACELINO POPÓ',
 'ADAIL CARNEIRO',
 'ADALBERTO CAVALCANTI',
 'ADELMO CARNEIRO LEÃO',
 'ADELSON BARRETO',
 'ADEMIR CAMILO',
 'ADILTON SACHETTI',
 'ADRIAN',
 'ADÉRMIS MARINI',
 'AELTON FREITAS',
 'AFONSO FLORENCE',
 'AFONSO HAMM',
 'AFONSO MOTTA',
 'AGUINALDO RIBEIRO',
 'AKIRA OTSUBO',
 'ALAN RICK',
 'ALBERTO FILHO',
 'ALBERTO FRAGA',
 'ALCEU MOREIRA',
 'ALESSANDRO MOLON',
 'ALEX CANZIANI',
 'ALEX MANENTE',
 'ALEXANDRE BALDY',
 'ALEXANDRE LEITE',
 'ALEXANDRE ROSO',
 'ALEXANDRE SANTOS',
 'ALEXANDRE SERFIOTIS',
 'ALEXANDRE SILVEIRA',
 'ALEXANDRE TOLEDO',
 'ALEXANDRE VALLE',
 'ALFREDO KAEFER',
 'ALFREDO NASCIMENTO',
 'ALFREDO SIRKIS',
 'ALICE PORTUGAL',
 'ALIEL MACHADO',
 'ALINE CORRÊA',
 'ALMEIDA LIMA',
 'ALTINEU CÔRTES',
 'ALUISIO MENDES',
 'AMAURI TEIXEIRA',
 'AMIR LANDO',
 'ANA PERUGINI',
 'ANDERSON FERREIRA',
 'ANDRE MOURA',
 'ANDREIA ZITO',
 'ANDRES SANCHEZ',
 'ANDRÉ ABDON',
 'ANDRÉ AMARAL',
 'ANDRÉ DE PAULA',
 'ANDRÉ F

Those people elected for this term may have been elected before. I want to take into account exepenses that were made in previous mandates.

In [7]:
reimbursements.shape

(1619213, 31)

In [8]:
filtered_reimbursements = reimbursements[reimbursements['congressperson_name'].isin(current_term_congresspeople)]

In [9]:
filtered_reimbursements.shape

(1375556, 31)

In [10]:
keys = ['congressperson_name', 'congressperson_id', 'applicant_id']
grouped_by_representative = filtered_reimbursements.groupby(keys)['total_net_value'].agg(np.sum) \
                                    .reset_index() \
                                    .rename(columns={'total_net_value': 'sum'})

In [11]:
grouped_by_representative = grouped_by_representative.sort_values('sum', ascending=False).head().reset_index(drop=True)
grouped_by_representative

Unnamed: 0,congressperson_name,congressperson_id,applicant_id,sum
0,WELLINGTON ROBERTO,74043,1703,3106049.82
1,CLEBER VERDE,141408,1804,2975434.05
2,RAIMUNDO GOMES DE MATOS,74216,1244,2860739.64
3,RENATO MOLLING,141527,1922,2809970.46
4,EFRAIM FILHO,141422,1823,2774536.47


Each tof the top 3 conressperson chamber of deputies page:

In [12]:
pd.set_option('display.max_colwidth', -1) # setting pandas so it won't truncate the html

congressperson_url = '<a href="http://www.camara.leg.br/internet/deputado/dep_Detalhe.asp?id={0}">link</a>'
grouped_by_representative['url'] = grouped_by_representative \
                                    .apply(lambda row: congressperson_url.format(row['congressperson_id']), axis=1)
grouped_by_representative

Unnamed: 0,congressperson_name,congressperson_id,applicant_id,sum,url
0,WELLINGTON ROBERTO,74043,1703,3106049.82,"<a href=""http://www.camara.leg.br/internet/deputado/dep_Detalhe.asp?id=74043"">link</a>"
1,CLEBER VERDE,141408,1804,2975434.05,"<a href=""http://www.camara.leg.br/internet/deputado/dep_Detalhe.asp?id=141408"">link</a>"
2,RAIMUNDO GOMES DE MATOS,74216,1244,2860739.64,"<a href=""http://www.camara.leg.br/internet/deputado/dep_Detalhe.asp?id=74216"">link</a>"
3,RENATO MOLLING,141527,1922,2809970.46,"<a href=""http://www.camara.leg.br/internet/deputado/dep_Detalhe.asp?id=141527"">link</a>"
4,EFRAIM FILHO,141422,1823,2774536.47,"<a href=""http://www.camara.leg.br/internet/deputado/dep_Detalhe.asp?id=141422"">link</a>"


In [13]:
from IPython.display import HTML
HTML(grouped_by_representative.to_html(escape=False))

Unnamed: 0,congressperson_name,congressperson_id,applicant_id,sum,url
0,WELLINGTON ROBERTO,74043,1703,3106049.82,link
1,CLEBER VERDE,141408,1804,2975434.05,link
2,RAIMUNDO GOMES DE MATOS,74216,1244,2860739.64,link
3,RENATO MOLLING,141527,1922,2809970.46,link
4,EFRAIM FILHO,141422,1823,2774536.47,link


Interesting enough none of the top 5 is Bonifacio Andrada our lower house representative with the highest number of suspicions. So let's load the suspicions file and investigate a little further.

In [14]:
suspicions = pd.read_csv('../data/2017-07-04-suspicions.xz')
suspicions.head()

Unnamed: 0,applicant_id,year,document_id,meal_price_outlier,over_monthly_subquota_limit,suspicious_traveled_speed_day,invalid_cnpj_cpf,election_expenses,irregular_companies_classifier
0,1001,2009,1564212,False,False,False,False,False,False
1,1001,2009,1564223,False,False,False,False,False,False
2,1001,2009,1568039,False,False,False,False,False,False
3,1001,2009,1568056,False,False,False,False,False,False
4,1001,2009,1568098,False,False,False,False,False,False


In [15]:
suspicions.shape

(1619213, 9)

Now let's filter suspicions from the role suspicions dataset that correspond to our current term.

In [32]:
current_term_aplicant_ids = list(set(current_term_reimbursements['applicant_id']))
suspicions_current_term = suspicions[suspicions['applicant_id'].isin(current_term_aplicant_ids)]
suspicions_current_term.shape

(1375556, 9)

In [33]:
# this takes a lot of time! grab a cup of coffee
def is_suspect(row):
    return row.any()

suspicions_current_term['suspicious'] = suspicions_current_term.apply(lambda row: is_suspect(row[3:]), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [34]:
only_suspicions_current_term = suspicions_current_term[suspicions_current_term['suspicious']]

In [35]:
suspicions_current_term.shape

(1375556, 10)

In [36]:
only_suspicions_current_term.shape

(7125, 10)

In [None]:
# companies = pd.read_csv('../data/2017-05-21-companies-no-geolocation.xz', low_memory=False)