# Invalid CNPJ or CPF

`cnpj_cpf` is the column identifying the company or individual who received the payment made by the congressperson. Having this value empty should mean that it's an expense made outside Brazil, with a company (or person) without a Brazilian ID.

In [1]:
import numpy as np
import pandas as pd

dataset = pd.read_csv('../data/2016-11-19-reimbursements.xz',
                      dtype={'applicant_id': np.str,
                             'cnpj_cpf': np.str,
                             'congressperson_id': np.str,
                             'subquota_number': np.str},
                      low_memory=False)
dataset.shape

(1532491, 31)

In [2]:
from pycpfcnpj import cpfcnpj

def validate_cnpj_cpf(cnpj_or_cpf):
    return (cnpj_or_cpf == None) | cpfcnpj.validate(cnpj_or_cpf)



cnpj_cpf_list = dataset['cnpj_cpf'].astype(np.str).replace('nan', None)
dataset['valid_cnpj_cpf'] = np.vectorize(validate_cnpj_cpf)(cnpj_cpf_list)

`document_type` 2 means expenses made abroad.

In [3]:
keys = ['year',
        'applicant_id',
        'document_id',
        'total_net_value',
        'cnpj_cpf',
        'supplier',
        'document_type']
dataset.query('document_type != 2').loc[~dataset['valid_cnpj_cpf'], keys]

Unnamed: 0,year,applicant_id,document_id,total_net_value,cnpj_cpf,supplier,document_type
53466,2009,1607,1748889,123.57,11111111111,CAP HORN,0
53467,2009,1607,1748896,100.25,11111111111,CAP HORN,0
53468,2009,1607,1748909,229.25,11111111111,DENSKALDEDEKOK RESTAURANT,0
53469,2009,1607,1748911,18.89,11111111111,BELLA CENTER,0
53470,2009,1607,1748915,581.85,11111111111,FIRST HOTEL SKT. PETRI,0
284494,2010,184,1987827,2974.63,11111111111,AKA CENTRAL PARK - NEW YORK,0
284495,2010,184,1987829,2974.63,11111111111,AKA CENTRAL PARK - NEW YORK,0
527753,2011,2329,2085477,190.74,0,PREFEITURA MUNICIPAL DE FORTALEZA,0
552301,2011,2387,2055025,372.72,0,TAM LINHAS AREAS S/A,0
552775,2011,2387,2209688,290.91,0,CONDOMINIO CENTRO EMPRESARIAL IGUATEMI,1


With 1,532,491 records in the dataset and just 10 with invalid CNPJ/CPF, we can probably assume that the Chamber of Deputies has a validation in the tool where the congressperson requests for reimbursements. These represent a mistake in the implemented algorithm.