# Expenses in closed companies
Recently we found out that there are many companies that are already closed or out of service, we are aiming to find if there are expenses made after the company situation as other than open.

In [1]:
import pandas as pd
import numpy as np
from serenata_toolbox.datasets import fetch

fetch('2016-09-03-companies.xz', '../data')
fetch('2016-11-19-reimbursements.xz', '../data')

In [2]:
companies = pd.read_csv('../data/2016-09-03-companies.xz', low_memory=False)
reimbursements = pd.read_csv('../data/2016-11-19-reimbursements.xz',
                      dtype={'applicant_id': np.str,
                             'cnpj_cpf': np.str,
                             'congressperson_id': np.str,
                             'subquota_number': np.str},
                      low_memory=False)

## Formatting
Formatting companies situation_date and reimbursements issue_date columns to correct date format (will be needed for a query later), and formatting the companies cpnj to a format without dash and dots.

In [3]:
reimbursements['issue_date'] = pd.to_datetime(reimbursements['issue_date'], errors='coerce')
companies['situation_date'] = pd.to_datetime(companies['situation_date'], errors='coerce')
companies['cnpj'] = companies['cnpj'].str.replace(r'\D', '')

In [4]:
statuses = ['BAIXADA', 'NULA', 'SUSPENSA', 'INAPTA']
not_open = companies[companies['situation'].isin(statuses)]
not_open[['cnpj', 'situation_date','situation', 'situation_reason']].head(5)

Unnamed: 0,cnpj,situation_date,situation,situation_reason
37,3956142000115,2005-09-20,BAIXADA,EXTINCAO P/ ENC LIQ VOLUNTARIA
248,8594693000108,2016-06-28,BAIXADA,EXTINCAO P/ ENC LIQ VOLUNTARIA
329,20768047000107,2016-12-04,BAIXADA,EXTINCAO P/ ENC LIQ VOLUNTARIA
364,3380051000346,2016-05-01,BAIXADA,EXTINCAO P/ ENC LIQ VOLUNTARIA
395,17479634000171,2016-06-28,BAIXADA,EXTINCAO P/ ENC LIQ VOLUNTARIA


The column situation_date is the one that is interesting. Expenses made after that date should be considered suspicious.

The inner join on merge will give reimbursements that were requested for out of service companies.

In [5]:
dataset = pd.merge(reimbursements, not_open, how='inner',
                   left_on='cnpj_cpf', right_on='cnpj')

In [6]:
columns = ['congressperson_name', 'issue_date','cnpj', 'situation_date',
           'situation', 'situation_reason']
dataset[columns].head(10)

Unnamed: 0,congressperson_name,issue_date,cnpj,situation_date,situation,situation_reason
0,DILCEU SPERAFICO,2009-04-06,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
1,DILCEU SPERAFICO,2009-09-23,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
2,DOMINGOS DUTRA,2009-10-14,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
3,EDINHO BEZ,2009-10-19,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
4,HERMES PARCIANELLO,2009-05-29,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
5,JAIME MARTINS,2009-04-08,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
6,JOSÉ CARLOS VIEIRA,2009-07-01,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
7,PAULO BORNHAUSEN,2009-03-26,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
8,PAULO BORNHAUSEN,2009-04-07,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
9,PAULO BORNHAUSEN,2009-04-28,2989654001197,2013-01-03,BAIXADA,INCORPORACAO


In [7]:
dataset.shape

(93133, 259)

In [8]:
dataset.iloc[0]

year                                                       2009
applicant_id                                               1001
document_id                                             1564212
reimbursement_value_total                                   NaN
total_net_value                                             130
reimbursement_numbers                                      2888
congressperson_name                            DILCEU SPERAFICO
congressperson_id                                         73768
congressperson_document                                     444
term                                                       2015
state_x                                                      PR
party                                                        PP
term_id                                                      55
subquota_number                                               3
subquota_description                       Fuels and lubricants
subquota_group_id                       

## Filtering suspicious reimbursements
We have all reibursements requested for expenses made in companies that have situation other than "open".
It is still necessary to check the reimbursement issue_date is "bigger" than the situation_date.

In [9]:
expenses_in_closed_companies = dataset.query('issue_date > situation_date')
expenses_in_closed_companies[columns].head()

Unnamed: 0,congressperson_name,issue_date,cnpj,situation_date,situation,situation_reason
2429,EDINHO ARAÚJO,2013-01-30,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
2430,EDINHO ARAÚJO,2013-02-02,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
2431,EDINHO ARAÚJO,2013-02-26,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
2432,EDINHO ARAÚJO,2013-03-01,2989654001197,2013-01-03,BAIXADA,INCORPORACAO
2433,HERMES PARCIANELLO,2013-01-28,2989654001197,2013-01-03,BAIXADA,INCORPORACAO


In [10]:
expenses_in_closed_companies.shape

(5222, 259)

We can safely say that there are 5222 suspicious reimbursements.
For this analysis, I would like to thank @jtemporal for being my pair for all the coding, and helping me to understand the hypothesis.