# Comparison between Chamber of Deputies CEAP datasets 1.0 and 2.0

This notebook compares the old Chamber's CEAP dataset (the huge XML files) with the new one (CSV by year). The main objective of this comparison is to show we didn't lose any data on the migration from the 1.0 to the much more efficient 2.0 version of the data. This validates changes to serenata-toolbox so we can ditch 1.0 datasets for good and be prepare to their extinction by the Chamber's Open Data team.

Let's begin by loading both old and new datasets


In [1]:
import pandas as pd

pd.set_option('max_columns', 500)

In [2]:
from serenata_toolbox.datasets import Datasets

datasets = Datasets('../data')
datasets.downloader.download('2017-05-21-reimbursements.old.xz')
datasets.downloader.download('2017-05-21-reimbursements.new.xz')

Downloading 2017-05-21-reimbursements.old.xz: 100%|██████████| 34.0M/34.0M [02:51<00:00, 198Kb/s]
Downloading 2017-05-21-reimbursements.new.xz: 100%|██████████| 34.1M/34.1M [02:22<00:00, 240Kb/s]


In [3]:
old_dataset = pd.read_csv('../data/2017-05-21-reimbursements.old.xz',
                        compression='xz',
                        low_memory=False)

In [4]:
new_dataset = pd.read_csv('../data/2017-05-21-reimbursements.new.xz',
                        compression='xz',
                        low_memory=False)

First we need to check if both datasets have the same columns, even in they are in the same order:

In [5]:
old_keys = old_dataset.keys()
new_keys = new_dataset.keys()

print(old_keys==new_keys)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True]


We can also make sure they have the same types for all columns

In [6]:
new_dataset.dtypes == old_dataset.dtypes

year                          True
applicant_id                  True
document_id                   True
reimbursement_value_total     True
total_net_value               True
reimbursement_numbers         True
congressperson_name           True
congressperson_id             True
congressperson_document       True
term                          True
state                         True
party                         True
term_id                       True
subquota_number               True
subquota_description          True
subquota_group_id             True
subquota_group_description    True
supplier                      True
cnpj_cpf                      True
document_number               True
document_type                 True
issue_date                    True
document_value                True
remark_value                  True
net_values                    True
month                         True
installment                   True
passenger                     True
leg_of_the_trip     

Now we can take a slice of the datasets by year and compare their sizes. We also remove the current year, because this ongoing registry seems to have different update pace between versions, so it makes no sense comparing them:

In [7]:
old_dataset = old_dataset[old_dataset['year'] != 2017]
new_dataset = new_dataset[new_dataset['year'] != 2017]

for year in pd.unique(old_dataset['year']):
    old_size = len(old_dataset[old_dataset['year']==year])
    new_size = len(new_dataset[new_dataset['year']==year])
    print('year: {} old: {} new: {} diff: {}'.format(year, old_size, new_size, new_size-old_size))

year: 2009 old: 171942 new: 171942 diff: 0
year: 2010 old: 204299 new: 204299 diff: 0
year: 2011 old: 213379 new: 213379 diff: 0
year: 2012 old: 197019 new: 197019 diff: 0
year: 2013 old: 194157 new: 194157 diff: 0
year: 2014 old: 172144 new: 172144 diff: 0
year: 2015 old: 208729 new: 208729 diff: 0
year: 2016 old: 200943 new: 200942 diff: -1


Oddly enough, there is a single row missing in the new dataset. Let's find out which document is that and also make sure the exact document_ids are present in both datasets:

In [8]:
new_docs = list(new_dataset['document_id'])
old_docs = list(old_dataset['document_id'])

old_extra = list(set(old_docs) - set(new_docs))
print('Extra documents found in old dataset: {}'.format(len(old_extra)))

new_extra = list(set(new_docs) - set(old_docs))
print('Extra documents found in new dataset: {}'.format(len(new_extra)))

Extra documents found in old dataset: 1
Extra documents found in new dataset: 0


So there is really only one inconsistency between datasets. A quick query can show us the culprit:

In [9]:
old_dataset[old_dataset['document_id'].isin(old_extra)]

Unnamed: 0,year,applicant_id,document_id,reimbursement_value_total,total_net_value,reimbursement_numbers,congressperson_name,congressperson_id,congressperson_document,term,state,party,term_id,subquota_number,subquota_description,subquota_group_id,subquota_group_description,supplier,cnpj_cpf,document_number,document_type,issue_date,document_value,remark_value,net_values,month,installment,passenger,leg_of_the_trip,batch_number,reimbursement_values
1560132,2016,831,6271581,,3051.3,5838,RUBENS BUENO,73466.0,460.0,2015.0,PR,PPS,55.0,137,"Participation in course, talk or similar event",0,,INTERNATIONAL FOUNDATION FOR ELECTORAL SYSTEMS,,NS,2,2016-11-06T00:00:00,3051.3,0.0,3051.3,11,0,,,1380250,


Checking the CSV file for year 2016 in the 2.0 version of the data, this document_id 6271581 is really missing, so it's not a parse problem on our side. An email was sent to Camara's Open Data team so we can understand what is happening.