## Notebook #2: Limpando e preparando os dados

#### Este notebook tem como objetivo medir e definir or dados usados para a solução, fazendo as limpezas necessárias dos dados brutos e preparando uma base para a análise.

Para primeiro plano, apenas as entregas consideradas como concluídas (delivered) serão consideradas. Isso será feito para evitar que pedidos cancelados influenciem a análise, já que as métricas não deveriam se aplicar da mesma forma.

### Limpeza

In [None]:
import pandas as pd

#### Agrupando datasets e passando valores como datas:

In [None]:
orders = pd.read_csv("../data/raw/olist_orders_dataset.csv", parse_dates=[
    'order_purchase_timestamp', 
    'order_approved_at', 
    'order_delivered_carrier_date', 
    'order_delivered_customer_date', 
    'order_estimated_delivery_date'
])

In [None]:
reviews = pd.read_csv("../data/raw/olist_order_reviews_dataset.csv", parse_dates=[
    'review_creation_date',
    'review_answer_timestamp'
])

In [None]:
customers = pd.read_csv("../data/raw/olist_customers_dataset.csv")

#### Deixando apenas pedidos entregues, criando métricas de quantidade total de dias até entrega e de atraso baseado no tempo estimado para o dataset dos pedidos e excluindo valores inválidos:

Valores negativos indicam entregas realizadas antes do prazo esperado.

In [None]:
orders_clean = orders[orders['order_status'] == 'delivered'].copy()

In [None]:
orders_clean['delivery_time_days'] = (orders_clean['order_delivered_customer_date'] - orders_clean['order_purchase_timestamp']).dt.days

In [None]:
orders_clean["delay_time_days"] = (orders_clean["order_delivered_customer_date"] - orders_clean["order_estimated_delivery_date"]).dt.days

In [None]:
orders_clean = orders_clean.dropna(subset=[
    'order_delivered_customer_date',
    'order_estimated_delivery_date'
])

orders_clean = orders_clean[orders_clean["delivery_time_days"] >= 0]

#### Limpando Reviews:

In [None]:
reviews_clean = reviews.dropna(subset=['review_score'])
reviews_clean = reviews_clean[reviews_clean["review_score"].between(1, 5)]

#### Unindo os pedidos com suas revews referentes:

In [None]:
df = orders_clean.merge(reviews_clean[['order_id', 'review_score', 'review_comment_message']], 
                                  on='order_id', 
                                  how='left'
)

#### Juntando com os clientes respectivos:

In [None]:
df = df.merge(
    customers[["customer_id", "customer_city", "customer_state"]],
    on = "customer_id",
    how = "left"
)

#### Resultados:

In [None]:
df.info()
df.describe()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96999 entries, 0 to 96998
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       96999 non-null  object        
 1   customer_id                    96999 non-null  object        
 2   order_status                   96999 non-null  object        
 3   order_purchase_timestamp       96999 non-null  datetime64[ns]
 4   order_approved_at              96985 non-null  datetime64[ns]
 5   order_delivered_carrier_date   96998 non-null  datetime64[ns]
 6   order_delivered_customer_date  96999 non-null  datetime64[ns]
 7   order_estimated_delivery_date  96999 non-null  datetime64[ns]
 8   delivery_time_days             96999 non-null  float64       
 9   delay_time_days                96999 non-null  float64       
 10  review_score                   96353 non-null  float64       
 11  review_comment_

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delivery_time_days,delay_time_days,review_score,review_comment_message,customer_city,customer_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8.0,-8.0,4.0,"Não testei o produto ainda, mas ele veio corre...",sao paulo,SP
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,13.0,-6.0,4.0,Muito bom o produto.,barreiras,BA
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,9.0,-18.0,5.0,,vianopolis,GO
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,13.0,-13.0,5.0,O produto foi exatamente o que eu esperava e e...,sao goncalo do amarante,RN
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,2.0,-10.0,5.0,,santo andre,SP


Os resultados marcam um boa limpeza para o propósito inicial. Alguns resultados como os "delay times" negativos podem parecer estranho mas podem ser prestativos para análises de diferença de avaliação por dias de demora da entrega, té dentro da data esperada.

In [None]:
df.to_csv("../data/processed/olist_clean.csv", index=False)