# Invalid CNPJ or CPF

`cnpj_cpf` is the column identifying the company or individual who received the payment made by the congressperson. Having this value empty should mean that it's an expense made outside Brazil, with a company (or person) without a Brazilian ID.

In [1]:
import numpy as np
import pandas as pd

dataset = pd.read_csv('../data/2017-05-17-federal-senate-reimbursements.xz',\
                      dtype={'cnpj_cpf': np.str}, encoding = "utf-8")

In [2]:
dataset = dataset[dataset['cnpj_cpf'].notnull()]
dataset.head()

Unnamed: 0,year,month,congressperson_name,expense_type,cnpj_cpf,supplier,document_id,date,expense_details,reimbursement_value
2449,2009,12,ACIR GURGACZ,"Rent of real estate for political office, comp...",494802863,GILBERTO PISELO DO NASCIMENTO,,2009-11-12,,5000
2450,2009,12,ACIR GURGACZ,Publicity of parliamentary activity,2831112000209,INTERCOM INTERMEDIAÇÕES E COMUNICAÇÃO INTEGRAD...,330.0,2009-09-12,,12620
2456,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",306597001683,Cascol combustíveis para veículos Ltda,106471.0,2009-12-04,,17901
2457,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",358192000102,Ribeiro e Pereira Ltda,77472.0,2009-04-04,,30
2458,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",6098111000169,Tudo de Bom Comércio de alimentos Ltda,10169.0,2009-04-14,,2158


In [3]:
from pycpfcnpj import cpfcnpj

def validate_cnpj_cpf(cnpj_or_cpf):
    return (cnpj_or_cpf == None) | cpfcnpj.validate(cnpj_or_cpf)



cnpj_cpf_list = dataset['cnpj_cpf'].astype(np.str).replace('nan', None)
dataset['valid_cnpj_cpf'] = np.vectorize(validate_cnpj_cpf)(cnpj_cpf_list)

In [4]:
dataset.query('valid_cnpj_cpf != True').head()

Unnamed: 0,year,month,congressperson_name,expense_type,cnpj_cpf,supplier,document_id,date,expense_details,reimbursement_value,valid_cnpj_cpf
3997,2009,4,CRISTOVAM BUARQUE,"Rent of real estate for political office, comp...",0,Secretaria de Estado de Fazenda do GDF,20102402,2009-06-04,,9828,False
4029,2009,5,CRISTOVAM BUARQUE,Acquisition of consumables for use in the poli...,0,TEC Jet - Jato de Tinta e Toner,9,2009-05-28,,15,False
27395,2010,3,JOÃO DURVAL,"Locomotion, lodging, food, fuels and lubricants",240900000025,Posto Pituba,9544,2010-03-15,,50,False
28584,2010,5,JOSÉ NERY,"Locomotion, lodging, food, fuels and lubricants",478056500043,E. Carvalho Com Navegação Ltda,19108,2010-05-21,,20,False
41257,2011,4,CASILDO MALDANER,Acquisition of consumables for use in the poli...,7388199700010,ECT- Empresa Brasileira de Correios e Telegráfos,18,2011-04-27,,33,False


So, this proves that we can find reimbursements without valid `cnpj_cpf`.

Plus, we need to add a `document_type` to the dataset to fit in the core module.

In [5]:
dataset['document_type'] = 'simple_receipt'
dataset.head()

Unnamed: 0,year,month,congressperson_name,expense_type,cnpj_cpf,supplier,document_id,date,expense_details,reimbursement_value,valid_cnpj_cpf,document_type
2449,2009,12,ACIR GURGACZ,"Rent of real estate for political office, comp...",494802863,GILBERTO PISELO DO NASCIMENTO,,2009-11-12,,5000,True,simple_receipt
2450,2009,12,ACIR GURGACZ,Publicity of parliamentary activity,2831112000209,INTERCOM INTERMEDIAÇÕES E COMUNICAÇÃO INTEGRAD...,330.0,2009-09-12,,12620,True,simple_receipt
2456,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",306597001683,Cascol combustíveis para veículos Ltda,106471.0,2009-12-04,,17901,True,simple_receipt
2457,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",358192000102,Ribeiro e Pereira Ltda,77472.0,2009-04-04,,30,True,simple_receipt
2458,2009,4,ADELMIR SANTANA,"Locomotion, lodging, food, fuels and lubricants",6098111000169,Tudo de Bom Comércio de alimentos Ltda,10169.0,2009-04-14,,2158,True,simple_receipt
