### Importing data

In [1]:
import pandas as pd
from utils.dates import Dates

data_folder = 'data/'
itens_file_name = 'valor_unitario_aprovado_sample.csv'

itens_file_path = data_folder + itens_file_name

date_column = 'Data'
price_column = 'ValorUnitarioAprovado'

dt = pd.read_csv(itens_file_path)
dt[date_column] = pd.to_datetime(dt[date_column], format = Dates.DATE_INPUT_FORMAT)
dt.head()

Unnamed: 0.1,Unnamed: 0,Item,ValorUnitarioAprovado,Data,idPRONAC
0,1894129,Material de consumo,7000.0,2014-07-31 07:27:29,179966
1,2003098,Locação de teatro,465.39,2015-05-05 16:38:24,185736
2,925488,Banheiro químico,250.0,2012-03-16 00:00:00,143223
3,1326505,Técnico de som,3500.0,2013-01-07 00:00:00,155368
4,136947,Cartaz,300.0,2010-01-01 00:00:00,114412


### Cleaning data

**Eliminating items with approved value equal to zero (0)**

There are some items with approved value equal to zero (0), for example the item with index **154** in the above table. For now they have no use, so lets get rid of them.

In [2]:
dt = dt[dt.ValorUnitarioAprovado > 0.0]

**Eliminating items dated before year 1992**

The Rouanet Law is from late 1991, so it's reasonable to take data only from year 1992 and above.

In [3]:
dt = dt.sort_values(by = ['Data'])
display(dt.head())

dt = dt[dt.Data >= '1992']
display(dt.sort_values(by=['Data']).head())

Unnamed: 0.1,Unnamed: 0,Item,ValorUnitarioAprovado,Data,idPRONAC
7633,1603805,Secretária,770.0,1969-12-31,162835
14946,9355,Secretária,625.0,2009-04-13,111157
15581,9832,Contra-regra,260.0,2009-04-14,111278
15821,27415,Mídia radiofônica,18000.0,2009-04-17,111191
10681,38839,Cenotécnico,700.0,2009-04-20,111123


Unnamed: 0.1,Unnamed: 0,Item,ValorUnitarioAprovado,Data,idPRONAC
14946,9355,Secretária,625.0,2009-04-13,111157
15581,9832,Contra-regra,260.0,2009-04-14,111278
15821,27415,Mídia radiofônica,18000.0,2009-04-17,111191
10681,38839,Cenotécnico,700.0,2009-04-20,111123
16241,38842,Assistente de iluminação,850.0,2009-04-20,111123


In [4]:
dt.sort_values(by=['Data']).head()
rows = dt[dt.Item == 'Transporte Local / Locação de Automóvel / Combustível']
dates = rows['Data'].copy()
dates.sort_values(inplace=True)
dates.head()

5392    2009-07-01
15495   2009-07-09
3720    2009-07-10
2195    2009-07-20
13290   2009-08-01
Name: Data, dtype: datetime64[ns]

### Number of distinct items

In [5]:
print(len(dt['Item'].unique()))

1083


**Lots of distinct items**

There are more than **1000** distinct items in that sample. It will be hard to plot them one by one, so it's a good idea to plot some subset of those items (the most frequent ones, for example).

**Getting the most frequent items**

In [6]:
top_frequent = dt['Item'].value_counts().head(10)
print(top_frequent.index)
display(top_frequent)

Index(['Transporte Local / Locação de Automóvel / Combustível',
       'Passagens Aéreas (Descrever os trechos na tela de deslocamentos)',
       'Assessor de imprensa', 'Refeição', 'Assistente de produção',
       'Produtor Executivo', 'Contador', 'Hospedagem sem Alimentação',
       'Cartaz',
       'Banner/faixa adesiva/faixa de lona/saia de palco/testeira/pórtico\r\n'],
      dtype='object')


Transporte Local / Locação de Automóvel / Combustível                    331
Passagens Aéreas (Descrever os trechos na tela de deslocamentos)         318
Assessor de imprensa                                                     308
Refeição                                                                 308
Assistente de produção                                                   283
Produtor Executivo                                                       267
Contador                                                                 249
Hospedagem sem Alimentação                                               240
Cartaz                                                                   225
Banner/faixa adesiva/faixa de lona/saia de palco/testeira/pórtico\r\n    220
Name: Item, dtype: int64

In [7]:
Dates.get_xy(dt, date_column, price_column)

(array(['2009-04-13T00:00:00.000000000', '2009-04-14T00:00:00.000000000',
        '2009-04-17T00:00:00.000000000', ...,
        '2018-09-01T00:00:00.000000000', '2019-01-15T21:24:42.000000000',
        '2019-01-21T09:07:02.000000000'], dtype='datetime64[ns]'),
 array([6.250e+02, 2.600e+02, 1.800e+04, ..., 1.039e+01, 9.000e+01,
        5.900e+04]))

In [8]:
from utils.plotter import Plotter
import matplotlib.pyplot as plt
%matplotlib inline


frequent_item = top_frequent.index[0]

#x, y = Dates.get_xy_dates(dt, date_column, price_column)

#Plotter.plot_scatter_along_time(x, y, subplot = 121, x_label = date_column, y_label = price_column, title = frequent_item)
#Plotter.plot_log_along_time(x, y, subplot = 121, x_label = date_column, y_label = price_column, title = frequent_item)
#plt.show()

def debug_dates():
    print('frequent_item = [{}]'.format(frequent_item))
    rows = dt[dt.Item == frequent_item]
    x_dates = rows['Data'].copy()
    x_dates.sort_values(inplace=True)
    display(x_dates)
    display(x_dates.min())
    
debug_dates()


frequent_item = [Transporte Local / Locação de Automóvel / Combustível]


5392    2009-07-01 00:00:00
15495   2009-07-09 00:00:00
3720    2009-07-10 00:00:00
2195    2009-07-20 00:00:00
13290   2009-08-01 00:00:00
10302   2009-08-20 00:00:00
893     2009-09-01 00:00:00
2990    2009-09-30 00:00:00
1587    2009-10-01 00:00:00
6199    2009-10-08 00:00:00
14642   2009-10-10 00:00:00
4732    2009-10-17 00:00:00
1006    2009-10-26 00:00:00
11009   2009-11-01 00:00:00
6271    2009-12-10 00:00:00
14992   2009-12-13 00:00:00
194     2009-12-16 00:00:00
15734   2009-12-21 00:00:00
3815    2010-01-01 00:00:00
17029   2010-01-01 00:00:00
8555    2010-01-04 00:00:00
15755   2010-01-05 00:00:00
7276    2010-01-10 00:00:00
7214    2010-01-15 00:00:00
7278    2010-01-18 00:00:00
964     2010-01-29 00:00:00
4295    2010-02-01 00:00:00
2460    2010-02-02 00:00:00
1797    2010-02-10 00:00:00
13126   2010-02-14 00:00:00
                ...        
1484    2017-01-06 16:36:31
7359    2017-02-06 08:11:17
15664   2017-02-17 12:05:16
14451   2017-02-23 14:33:31
13128   2017-02-28 0

Timestamp('2009-07-01 00:00:00')