# Exploring negative net_values

This [issue](https://github.com/datasciencebr/serenata-de-amor/issues/29) explains the goal of this analysis.

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('../data/2016-08-08-last-year.xz',
                   parse_dates=[16],
                   dtype={'document_id': np.str,
                          'congressperson_id': np.str,
                          'congressperson_document': np.str,
                          'term_id': np.str,
                          'cnpj_cpf': np.str,
                          'reimbursement_number': np.str})

374.484 expenses in total

In [3]:
print(data.shape)

(374484, 29)


In [4]:
data.head()

Unnamed: 0,document_id,congressperson_name,congressperson_id,congressperson_document,term,state,party,term_id,subquota_number,subquota_description,...,net_value,month,year,installment,passenger,leg_of_the_trip,batch_number,reimbursement_number,reimbursement_value,applicant_id
0,5886345,ABEL MESQUITA JR.,178957,1,2015,RR,DEM,55,1,Maintenance of office supporting parliamentary...,...,165.65,11,2015,0,,,1255355,5294,,3074
1,5886361,ABEL MESQUITA JR.,178957,1,2015,RR,DEM,55,1,Maintenance of office supporting parliamentary...,...,59.48,12,2015,0,,,1255361,5294,,3074
2,5886341,ABEL MESQUITA JR.,178957,1,2015,RR,DEM,55,1,Maintenance of office supporting parliamentary...,...,130.95,11,2015,0,,,1255355,5294,,3074
3,5928783,ABEL MESQUITA JR.,178957,1,2015,RR,DEM,55,1,Maintenance of office supporting parliamentary...,...,193.06,12,2015,0,,,1268867,5370,,3074
4,5608486,ABEL MESQUITA JR.,178957,1,2015,RR,DEM,55,1,Maintenance of office supporting parliamentary...,...,310.25,2,2015,0,,,1168538,4966,,3074


In [5]:
data.iloc[0]

document_id                                                             5886345
congressperson_name                                           ABEL MESQUITA JR.
congressperson_id                                                        178957
congressperson_document                                                       1
term                                                                       2015
state                                                                        RR
party                                                                       DEM
term_id                                                                      55
subquota_number                                                               1
subquota_description          Maintenance of office supporting parliamentary...
subquota_group_id                                                             0
subquota_group_description                                                  NaN
supplier                                

There's an expense with a `net_value` of R$`-9240,77`.

In [6]:
data['net_value'].describe()

count    374484.000000
mean        570.566565
std        1993.167639
min       -9240.770000
25%          45.000000
50%         134.310000
75%         481.000000
max      189600.000000
Name: net_value, dtype: float64

Taking a look at the expense with the highest negative value:

In [7]:
highest_negative_expense = \
    data[data['net_value'] == data['net_value'].min()].iloc[0]
highest_negative_expense

document_id                                        NaN
congressperson_name                       SIBÁ MACHADO
congressperson_id                               160613
congressperson_document                             58
term                                              2015
state                                               AC
party                                               PT
term_id                                             55
subquota_number                                    999
subquota_description               Flight ticket issue
subquota_group_id                                    0
subquota_group_description                         NaN
supplier                               Cia Aérea - TAM
cnpj_cpf                                02012862000160
document_number               Bilhete: 957-2117.270689
document_type                                        0
issue_date                         2015-09-15 00:00:00
document_value                                -9240.77
remark_val

How many expenses with the same `document_number`?

In [8]:
expenses = data[data['document_number'] == highest_negative_expense['document_number']]
len(expenses)

2

In [9]:
expenses['net_value'].describe()

count        2.000000
mean       -56.830000
std      12988.052504
min      -9240.770000
25%      -4648.800000
50%        -56.830000
75%       4535.140000
max       9127.110000
Name: net_value, dtype: float64

In this specific case, it seems that Sibá Machado purchased a flight ticket of R\$ 9127,11 on 15/09/2015. He canceled it on the same day and the returned amount was R\$ 9240.77, generating an actual **profit** of R\$ 113,66 (1,3%). Not bad mr Sibá.

In [10]:
expenses.iloc[0]

document_id                                        NaN
congressperson_name                       SIBÁ MACHADO
congressperson_id                               160613
congressperson_document                             58
term                                              2015
state                                               AC
party                                               PT
term_id                                             55
subquota_number                                    999
subquota_description               Flight ticket issue
subquota_group_id                                    0
subquota_group_description                         NaN
supplier                               Cia Aérea - TAM
cnpj_cpf                                02012862000160
document_number               Bilhete: 957-2117.270689
document_type                                        0
issue_date                         2015-09-15 00:00:00
document_value                                -9240.77
remark_val

In [11]:
expenses.iloc[1]

document_id                                        NaN
congressperson_name                       SIBÁ MACHADO
congressperson_id                               160613
congressperson_document                             58
term                                              2015
state                                               AC
party                                               PT
term_id                                             55
subquota_number                                    999
subquota_description               Flight ticket issue
subquota_group_id                                    0
subquota_group_description                         NaN
supplier                               Cia Aérea - TAM
cnpj_cpf                                02012862000160
document_number               Bilhete: 957-2117.270689
document_type                                        0
issue_date                         2015-09-15 00:00:00
document_value                                 9127.11
remark_val

There are **17.646** of them and all have "Flight ticket issue" as `subquota_description`.

In [12]:
negative_documents = data[data['net_value'] < 0]
len(negative_documents)

17646

In [13]:
negative_documents['subquota_description'].unique()

array(['Flight ticket issue'], dtype=object)

Summing negative expenses and postive expenses with the same `document_number` as one of the negatives gives us **31.104** expenses. We have a big discrepancy: **17.646** negatives and **13.458** positives.

In [14]:
negatives_and_counterparts = data[data['document_number'].isin(negative_documents['document_number'])]

len(negatives_and_counterparts)


31104

In [15]:
counterparts = negatives_and_counterparts[negatives_and_counterparts['net_value'] > 0]

len(counterparts)

13458

In [16]:
# If every negative document is to have a pair, we're short on positive ones by:
len(negative_documents) - len(counterparts)

4188

What comes to mind is: 
- Those remaining **4.188** are negative expenses without a corresponding positive one?;
- The document numbers are messed up? Maybe something like: "Bilhete: 957-2117.270689" for the negative and "Vôo: 957-2117.270689" for the positive;
- The net_values are messed up? Maybe some positive one were registered as negatives;
- The positive expense is not in this dataset? Maybe in an older one;

Taking from negatives without counterparts (**31.104**) and removing every document with a matching document number in counterparts(**13.458**) gives us a total of **4.458** expenses. More than the **4.188** I antecipaded.

In [17]:
negatives_without_counterparts = negative_documents[~negative_documents.document_number.isin(counterparts['document_number'])]

len(negatives_without_counterparts)

4458

It seems we're dealing with expenses with the same document number were both are negatives...

In [18]:
len(negatives_without_counterparts['document_number'].unique())

4403

Here I'm importing previous year's dataset

In [19]:
old_data = pd.read_csv('../data/2016-08-08-previous-years.xz',
                   parse_dates=[16],
                   dtype={'document_id': np.str,
                          'congressperson_id': np.str,
                          'congressperson_document': np.str,
                          'term_id': np.str,
                          'cnpj_cpf': np.str,
                          'reimbursement_number': np.str})

In [20]:
print(old_data.shape)

(1512786, 29)


It seems we have **6.655** matches for our negatives without counterparts (**4.458**) list's document numbers in the old data. It's more than the number of negatives.

In [21]:
old_data_counterparts = old_data[old_data['document_number'].isin(negatives_without_counterparts['document_number'])]

len(old_data_counterparts)

6655

**Most** but not all are positives.

In [22]:
old_data_only_positives = old_data_counterparts[old_data_counterparts['net_value'] > 0]

len(old_data_only_positives)

6638