# Visually interacting with data

Visually analyzing a data set is an important task in the data science process. So far, we have seen how to do this using static graphics produced by the matplotlib and seaborn libraries. In this notebook, we will explore the interactive data visualization library [Plotly](https://plot.ly/python/), a library compatible with different programming languages. For the Python ecosystem, we can use the `plotly.express` module,  created to facilitate the production of interactive visualizations.

In [1]:
import pandas as pd
import plotly.express as px

For this analysis, we will use an [open database about fuel prices](https://http://dados.gov.br/dataset/infopreco), made available by the posts on the ANP website. In the code below, we will inform Pandas to treat the REGISTRATION DATE feature as a date:

In [2]:
prices = pd.read_csv('http://www.anp.gov.br/images/infopreco/infopreco.csv', encoding='latin-1', sep=';', decimal=',', parse_dates=['DATA CADASTRO'])
prices.head()

Unnamed: 0,CNPJ,NOME,ENDEREÇO,COMPLEMENTO,BAIRRO,MUNICÍPIO,UF,PRODUTO,VALOR VENDA,DATA CADASTRO
0,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Gasolina C Comum,4.436,2018-06-28 17:49:00
1,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Etanol,3.482,2018-06-28 17:49:00
2,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Diesel S500,3.644,2018-06-28 17:49:00
3,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Diesel S10,3.734,2018-06-28 17:49:00
4,300357000195,POSTO E TRANSPORTADORA PEGORARO,"RODOVIA BR 163,S/N",KM 786,ZONA RURAL,COXIM,MS,Gasolina C Comum,4.879,2020-03-14 10:09:00


Since this dataframe is provided in Brazilian Portuguese, let's first translate the feature names:

In [3]:
prices.columns = ['CNPJ',
                  'NAME',
                  'ADDRESS',
                  'COMPLEMENT',
                  'NEIGHBORHOOD',
                  'CITY',
                  'UF',
                  'PRODUCT',
                  'SALE PRICE',
                  'REGISTRATION DATE']
prices.head()

Unnamed: 0,CNPJ,NAME,ADDRESS,COMPLEMENT,NEIGHBORHOOD,CITY,UF,PRODUCT,SALE PRICE,REGISTRATION DATE
0,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Gasolina C Comum,4.436,2018-06-28 17:49:00
1,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Etanol,3.482,2018-06-28 17:49:00
2,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Diesel S500,3.644,2018-06-28 17:49:00
3,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Diesel S10,3.734,2018-06-28 17:49:00
4,300357000195,POSTO E TRANSPORTADORA PEGORARO,"RODOVIA BR 163,S/N",KM 786,ZONA RURAL,COXIM,MS,Gasolina C Comum,4.879,2020-03-14 10:09:00


## Validating the data

We will begin our analysis by validating the data present in the database. This is particularly important in open government databases, which often lacks good data accuration.

In [None]:
prices.isnull().sum()

CNPJ                   0
NAME                   0
ADDRESS                0
COMPLEMENT           677
NEIGHBORHOOD          11
CITY                   0
UF                     0
PRODUCT                0
SALE PRICE             0
REGISTRATION DATE      0
dtype: int64

In [None]:
product_price = prices.pivot_table(index="PRODUCT", values="SALE PRICE")
product_price.head()

Unnamed: 0_level_0,SALE PRICE
PRODUCT,Unnamed: 1_level_1
Diesel S10,9.640719
Diesel S500,9.639641
Etanol,8.913325
GNV,18.371409
Gasolina C Comum,12.828701


The missing data are limited to neighborhood and complement, which will not be the focus of our analysis. On the other hand, the analysis of the average price of the listed fuels indicates considerably high values. This peak indicates a presence of incorrect values, which we can verify by analyzing the data distribution. To interact with the chart, hover your mouse over the chart and explore the options in the upper right corner:

In [None]:
px.box(prices, x="PRODUCT", y="SALE PRICE")

Notice that the graph produced by Plotly brings a number of tools that we can use to further our investigation. Let's highlight some of them:

* When placing the mouse over a boxplot, we see its information.
* By positioning the mouse over an outlier, we can see its value.
* We can zoom in on the parts of the graph that interest us most.
* We can save the chart directly.

Analyzing specifically the data of the base in question, the invalid values it was reported without decimal places. Let's check how many values are above R$10,00:

In [None]:
prices.query("`SALE PRICE` > 10")

Unnamed: 0,CNPJ,NAME,ADDRESS,COMPLEMENT,NEIGHBORHOOD,CITY,UF,PRODUCT,SALE PRICE,REGISTRATION DATE
200,4528732000371,POSTO NOVA PIRAI,"RUA XV DE NOVEMBRO,36",GALPAO,CENTRO,PIRAI,RJ,Gasolina C Comum,499.0,2019-08-16 11:39:00
201,4528732000371,POSTO NOVA PIRAI,"RUA XV DE NOVEMBRO,36",GALPAO,CENTRO,PIRAI,RJ,Etanol,389.0,2019-08-16 11:39:00
202,4528732000371,POSTO NOVA PIRAI,"RUA XV DE NOVEMBRO,36",GALPAO,CENTRO,PIRAI,RJ,GNV,349.0,2019-08-16 11:39:00
203,4528732000371,POSTO NOVA PIRAI,"RUA XV DE NOVEMBRO,36",GALPAO,CENTRO,PIRAI,RJ,Diesel S500,349.0,2019-08-16 11:39:00
204,4528732000371,POSTO NOVA PIRAI,"RUA XV DE NOVEMBRO,36",GALPAO,CENTRO,PIRAI,RJ,Diesel S10,359.0,2019-08-16 11:39:00
208,4624593000118,C JUSTINIANO ROCHA & CIA LTDA.,"AVENIDA LUCIO PEREIRA LUZ,790",,CENTRO,LUCIARA,MT,Gasolina C Comum,550.0,2019-06-14 10:28:00
209,4624593000118,C JUSTINIANO ROCHA & CIA LTDA.,"AVENIDA LUCIO PEREIRA LUZ,790",,CENTRO,LUCIARA,MT,Etanol,360.0,2019-06-14 10:28:00
210,4624593000118,C JUSTINIANO ROCHA & CIA LTDA.,"AVENIDA LUCIO PEREIRA LUZ,790",,CENTRO,LUCIARA,MT,Diesel S500,445.0,2019-06-14 10:28:00
211,4624593000118,C JUSTINIANO ROCHA & CIA LTDA.,"AVENIDA LUCIO PEREIRA LUZ,790",,CENTRO,LUCIARA,MT,Diesel S10,455.0,2019-06-14 10:28:00
350,7780837000140,COMERCIO DE COMBUSTIVEIS TREVO CACHOEIRAS LTDA,"RUA ESCRITORA MARIA COTTAS,S/N",LOTES 1 A 6 E 8 A 11,PARQUE SANTA LUZIA,CACHOEIRAS DE MACACU,RJ,Gasolina C Comum,499.0,2020-02-21 11:26:00


Because there are few observations, we can remove them without prejudice to the base:

In [None]:
prices = prices.query("`SALE PRICE` <= 10")

In [None]:
px.box(prices, x="PRODUCT", y="SALE PRICE")

Interestingly, now we see that there are also outliers below the boxplots. Let's investigate these cases:

In [None]:
prices.query("`SALE PRICE` <= 1")

Unnamed: 0,CNPJ,NAME,ADDRESS,COMPLEMENT,NEIGHBORHOOD,CITY,UF,PRODUCT,SALE PRICE,REGISTRATION DATE
249,5037623000152,AUTO POSTO TIO SAM LTDA,"RODOVIA BR 163,S/N","KM 20,5",ZONA RURAL,MUNDO NOVO,MS,GNV,0.001,2019-09-20 13:24:00
365,8355825000130,POSTO DE COMBUSTÍVEIS 214 SUL,SETOR SHC/SUL SQ 214 BLOCO A PAG - LOJA DE CON...,,ASA SUL,BRASILIA,DF,GNV,1.0,2018-06-26 10:59:00
366,8355825000130,POSTO DE COMBUSTÍVEIS 214 SUL,SETOR SHC/SUL SQ 214 BLOCO A PAG - LOJA DE CON...,,ASA SUL,BRASILIA,DF,Diesel S500,1.0,2018-06-26 10:59:00
503,11664743000182,POSTO UNIVERSITÁRIO,"RUA FREI GABRIEL,897",TERREO,UNIVERSITARIO,LAGES,SC,GNV,0.001,2019-03-29 17:00:00
671,19292157000166,VALTER GAVASSA COMBUSTIVEIS LTDA,"AVENIDA AFIF JOSE ABDO,37",,RESIDENCIAL PORTAL DA PEROLA II,BIRIGUI,SP,GNV,1.0,2018-11-10 17:15:00
731,22794128000298,POSTO MIMIM II,"AVENIDA SETE DE SETEMBRO,623",LOJA 01,CENTRO,ITAJAI,SC,Gasolina C Comum,1.0,2019-10-16 17:50:00
732,22794128000298,POSTO MIMIM II,"AVENIDA SETE DE SETEMBRO,623",LOJA 01,CENTRO,ITAJAI,SC,Etanol,1.0,2019-10-16 17:50:00
733,22794128000298,POSTO MIMIM II,"AVENIDA SETE DE SETEMBRO,623",LOJA 01,CENTRO,ITAJAI,SC,GNV,1.0,2019-10-16 17:50:00
734,22794128000298,POSTO MIMIM II,"AVENIDA SETE DE SETEMBRO,623",LOJA 01,CENTRO,ITAJAI,SC,Diesel S10,1.0,2019-10-16 17:50:00


Again, it seems safe to remove these cases:

In [None]:
prices = prices.query("`SALE PRICE` > 1")

In [None]:
px.box(prices, x="PRODUCT", y="SALE PRICE")

## Average fuel price

Now that the data is validated, let's visualize the average price per fuel type:

In [None]:
product_price = prices.pivot_table(index="PRODUCT", values="SALE PRICE")
product_price.head()

Unnamed: 0_level_0,SALE PRICE
PRODUCT,Unnamed: 1_level_1
Diesel S10,3.714015
Diesel S500,3.65283
Etanol,3.426082
GNV,3.260563
Gasolina C Comum,4.575132


In [None]:
px.bar(product_price, x=product_price.index, y="SALE PRICE", title='Average price per fuel type')

We can refine our analysis by looking at the price of "Gasolina Comum" per state.

In [None]:
gasoline_price = prices.query("PRODUCT == 'Gasolina C Comum'")
gasoline_by_state = gasoline_price.pivot_table(index="UF", values='SALE PRICE')
gasoline_by_state.head()

Unnamed: 0_level_0,SALE PRICE
UF,Unnamed: 1_level_1
AL,3.959
AM,3.98
BA,4.638111
CE,4.7057
DF,4.455333


To make a plot from the above dynamic table, let's set the indexes of our samples that should be used as values ​​on the x-axis:

In [None]:
px.bar(gasoline_by_state, x=gasoline_by_state.index, y="SALE PRICE", title = 'Average price of \"Gasolina Comum\" per UF')

We can expand this analysis by including all considered products. To do it, let's use a histogram passing the parameter `histfunc="avg"` (because we are interested in the average values.) Note that, despite having a lot of information in the plot, it is possible to select which products we want to see by clicking at the label shown on the right side. 

In [None]:
px.histogram(prices, x="UF", y="SALE PRICE", color="PRODUCT", histfunc="avg",
             barmode="group", title='Fuel price distribution')

## Analyzing the evolution of fuel prices

The data available in the database that we donwloaded are sample of several different months. We called this type of data **temporal series**, or historical series. We can visualyze the series evolution using a line graph. In order to get this, we initially need to generate a series with the average value of each product per month.

> Indented block



The first step is to produce a feature which contains only the year and month, so we can have enough data to make an aggregation. To do that we use the `.dt.to_period("M")` method.


In [None]:
prices['MONTH'] = prices['REGISTRATION DATE'].dt.to_period('M').astype(str)
prices.head()

Unnamed: 0,CNPJ,NAME,ADDRESS,COMPLEMENT,NEIGHBORHOOD,CITY,UF,PRODUCT,SALE PRICE,REGISTRATION DATE,MONTH
0,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Gasolina C Comum,4.436,2018-06-28 17:49:00,2018-06
1,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Etanol,3.482,2018-06-28 17:49:00,2018-06
2,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Diesel S500,3.644,2018-06-28 17:49:00,2018-06
3,62780000102,AUTO POSTO PARATI,"AVENIDA FILINTO MULLER,645",,CENTRO,TRES LAGOAS,MS,Diesel S10,3.734,2018-06-28 17:49:00,2018-06
4,300357000195,POSTO E TRANSPORTADORA PEGORARO,"RODOVIA BR 163,S/N",KM 786,ZONA RURAL,COXIM,MS,Gasolina C Comum,4.349,2018-09-12 19:06:00,2018-09


A technical detail about the code above is that the `.dt.to_period("M")` method returns an object with type `Period` and this type is incompatible with the line graph of Plotly. Because of that, we need to ask to Pandas to handles the column as text using the `astype(str)` method.

Now we have all information about each month, so we can build a dinamic table to visualize the average per month and product.



In [None]:
prices_per_month = prices.pivot_table(index="MONTH", columns="PRODUCT", values="SALE PRICE")
prices_per_month

PRODUCT,Diesel S10,Diesel S500,Etanol,GNV,Gasolina C Comum
MONTH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01,3.97,3.89,3.108,,5.122
2018-03,3.619,3.6715,3.599,5.099,4.91175
2018-04,3.694333,3.635333,2.973,,4.514333
2018-05,3.737,3.647,3.399,2.899,4.6674
2018-06,3.694588,3.566875,3.506625,2.197,4.522105
2018-07,3.487478,3.41025,3.1811,2.7435,4.465407
2018-08,3.642308,3.57525,3.04,,4.5472
2018-09,3.591118,3.583,3.445143,,4.592706
2018-10,3.776667,3.876667,3.2958,,5.0675
2018-11,3.8109,3.713,3.37125,,4.623273


Notice that the `DataFrame` above contains many GNV values missing. Let's discard the data about this product.
 


In [None]:
prices_per_month = prices_per_month.drop("GNV", axis=1)
prices_per_month.head()

PRODUCT,Diesel S10,Diesel S500,Etanol,Gasolina C Comum
MONTH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01,3.97,3.89,3.108,5.122
2018-03,3.619,3.6715,3.599,4.91175
2018-04,3.694333,3.635333,2.973,4.514333
2018-05,3.737,3.647,3.399,4.6674
2018-06,3.694588,3.566875,3.506625,4.522105


Although the dinamic table be useful to data exploration, it is not proper to build a line graph with Plotly lib. The code below converts the *wide* format to a *long* format, using `stack` and `reset_index` methods.



In [None]:
prices_per_month = prices_per_month.stack().reset_index(name="SALE PRICE")
prices_per_month.head()

Unnamed: 0,MONTH,PRODUCT,SALE PRICE
0,2018-01,Diesel S10,3.97
1,2018-01,Diesel S500,3.89
2,2018-01,Etanol,3.108
3,2018-01,Gasolina C Comum,5.122
4,2018-03,Diesel S10,3.619


Now we can investigate the evolution of the average monthly prices of each product during the whole period covered in the database. Thus how in the case of the histogram we will be able to select the series which we desired to analyze by interacting with the graph label.


In [None]:
px.line(prices_per_month, x="MONTH", y="SALE PRICE", color="PRODUCT",
        title='Price evolution per observation month (month average)')