# Checking federal senate reimbursements

This analysis is a study in progress that shows hows does Federal Senate datasets works. Like in the `chamber_of_deputies` reimbursements, we will need to concat all the datasets, and clean what is necessary to clean.
What we need to see:
- [x] Concat all the nine datasets
- [x] Fix the `date` field to datetime
- [x] Clean the `cnpj_cpf` field
- [x] Check the dataset peculiarities
- [x] Check if a `group_by` is necessary

In [1]:
import pandas as pd
import numpy as np
from datetime import date

FIRST_YEAR = 2008
NEXT_YEAR = date.today().year + 1

filenames = ['../data/2017-05-09-federal-senate-{}.xz'.format(year) for year in range(FIRST_YEAR, NEXT_YEAR)]

dataset = pd.DataFrame()

for filename in filenames:
    data = pd.read_csv(filename, encoding = "utf-8")
    dataset = pd.concat([dataset, data])

In [2]:
len(dataset)

203547

In [3]:
dataset.head()

Unnamed: 0,year,month,congressperson_name,expense_type,cnpj_cpf,supplier,document_id,date,expense_details,reimbursement_value
0,2008,9,ADA MELLO,"Recruitment of consultancies, advisory service...",,,,,,1235152
1,2008,9,ADA MELLO,"Locomotion, lodging, food, fuels and lubricants",,,,,,3866
2,2008,10,ADA MELLO,"Recruitment of consultancies, advisory service...",,,,,,1235152
3,2008,10,ADA MELLO,"Locomotion, lodging, food, fuels and lubricants",,,,,,261068
4,2008,11,ADA MELLO,"Recruitment of consultancies, advisory service...",,,,,,1235152


In [4]:
dataset['date'] = pd.to_datetime(dataset['date'], errors='coerce')
dataset['cnpj_cpf'] = dataset['cnpj_cpf'].str.replace(r'\D', '')

In [5]:
dataset.query('date != "NaT"').head()

Unnamed: 0,year,month,congressperson_name,expense_type,cnpj_cpf,supplier,document_id,date,expense_details,reimbursement_value
0,2009,12,ACIR GURGACZ,"Rent of real estate for political office, comp...",494802863,GILBERTO PISELO DO NASCIMENTO,,2009-11-12,,5000
1,2009,12,ACIR GURGACZ,Publicity of parliamentary activity,2831112000209,INTERCOM INTERMEDIAÇÕES E COMUNICAÇÃO INTEGRAD...,330.0,2009-09-12,,12620
7,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",306597001683,Cascol combustíveis para veículos Ltda,106471.0,2009-12-04,,17901
8,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",358192000102,Ribeiro e Pereira Ltda,77472.0,2009-04-04,,30
9,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",6098111000169,Tudo de Bom Comércio de alimentos Ltda,10169.0,2009-04-14,,2158


In [6]:
dataset[dataset['expense_details'].notnull()].head()

Unnamed: 0,year,month,congressperson_name,expense_type,cnpj_cpf,supplier,document_id,date,expense_details,reimbursement_value
100,2011,6,ACIR GURGACZ,"National air, water and land transport",2012862000160,TAM LINHAS AÉREAS,957 2429627366,2011-06-22,BILHETE UTILIZADO PELO SENADOR ACIR GURGACZ. T...,55766
101,2011,6,ACIR GURGACZ,"National air, water and land transport",2012862000160,TAM LINHAS AÉREAS,957 2429908821,2011-06-27,BILHETE UTILIZADO PELO SENADOR ACIR GURGACZ. T...,179423
102,2011,6,ACIR GURGACZ,"National air, water and land transport",2012862000160,TAM LINHAS AÉREAS,957-2429627318,2011-06-20,BILHETE UTILIZADO PELO SENADOR ACIR GURGACZ. T...,17823
103,2011,6,ACIR GURGACZ,"National air, water and land transport",2012862000160,TAM LINHAS AÉREAS S.A.,01,2011-10-06,robison pereira- cgb/bsb 10/06/2011,79701
104,2011,6,ACIR GURGACZ,"National air, water and land transport",2012862000160,TAM LINHAS AÉREAS S.A.,01,2011-12-06,acir gurgaz - 12/06 - bsb/p.velho.,91566


In [7]:
(dataset['document_id'].isnull()).sum()

19543

In [8]:
(dataset['document_id'].notnull()).sum()

184004

In [9]:
print(len(dataset['document_id'].unique()))

143582


## Dataset peculiarities

The dataset has many peculiarities, some of them I already mentioned in [my last notebook](2017-05-02-anaschwendler-translate-senate-dataset.ipynb):
* Until 2013 there wasn't a expense details field, but the other older dataset already have this field, but empty.
* Until 2010 there wasn't the `National air, water and land transport` and `Private Security Services` categories of expense type, so when we start translating all the data we need to check if the dataset has those categories.
* Studying the datasets to what we are doing by now, we can start using the `cnpj_cpf` classifier from the begining, since the data is pretty good to use.

But there is a few more things that need to be considered like:
* There is a total of 203547 reimbursements until now.
* and 19543 of them are whithout `document_id` field
* which means that 184004 of the have `document_id` field and NOT ALL OF THEM ARE UNIQUE, so we need to check if the reimbursements are made like `chamber_of_deputies` and we need to group them by `document_id`. 
* The datasets have no `cnpj_cpf`, `supplier`, `document_id`, `date`, `expense_details` fields from 2008 until the beggining of 2009.
* The datasets only have complete information after 2011.

## Decisions

After all those analysis we decided that we will only clean up the `date` and `cnpj_cpf` and after that we will make another study with all the things that we can discover exploring the fields.
That is what will be done, if you want, you can check the progress in [this PR](https://github.com/datasciencebr/serenata-toolbox/pull/53)

Thanks @jtemporal and @cuducos for all feedbacks given <3