# Lab - End-to-end Machine Learning

### Dataset

Vamos trabalhar com dataset de e-commerce da [Olist](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce).

Neste dataset temos informações de ordens, entrega, localizações, reviews, preços e etc

### Hipótese

**Será que podemos prever qual rating que o cliente dará ao serviço?**

Quais os motivos para que um pedido seja mal avaliado?

1.   Atraso na entrega
2.   Pedido veio errado, com defeito ou não satisfez a necessidade do cliente

### O Fluxo de Modelagem:

Este fluxo deve ser cíclico, ou seja, devemos repetir os passos até chegar na performance adequada do modelo.

3. Feature Engineering
  * Construção de novas variáveis informativas que podem ajudar o modelo a encontrar melhor os padrões nos dados
  * Podemos realizar também um feature selection, ou seja, remover variáveis não informativas que degradam a performance do modelo
4. Modelagem
  * Train / Test split dos dados para evitar overfitting
  * Construção de um baseline a ser batido
  * Construção de modelos propícios ao problema a ser resolvido


In [21]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 500)
import warnings

from pathlib import Path
import pickle
warnings.filterwarnings('ignore')

# Importando libs de plots
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
# sns.set_context('talk')
sns.set_palette('rainbow')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

In [22]:
df = pd.read_csv("olist_final_dataset_clean.csv")

df.head()

Unnamed: 0,shipping_limit_date,price,freight_value,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,seller_zip_code_prefix,seller_city,seller_state,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_zip_code_prefix,customer_city,customer_state,payment_value,review_score,customer_seller_distance
0,2017-09-19 09:45:35,58.9,13.29,cool_stuff,58.0,598.0,4.0,650.0,28.0,9.0,14.0,27277,volta redonda,SP,delivered,2017-09-13 08:59:02,2017-09-13 09:45:35,2017-09-19 18:34:16,2017-09-20 23:43:48,2017-09-29,28013,campos dos goytacazes,RJ,72.19,5.0,301.858959
1,2017-05-03 11:05:13,239.9,19.93,pet_shop,56.0,239.0,2.0,30000.0,50.0,30.0,40.0,3471,sao paulo,SP,delivered,2017-04-26 10:53:06,2017-04-26 11:05:13,2017-05-04 14:35:00,2017-05-12 16:04:24,2017-05-15,15775,santa fe do sul,SP,259.83,4.0,585.131104
2,2018-01-18 14:48:30,199.0,17.87,moveis_decoracao,59.0,695.0,2.0,3050.0,33.0,13.0,33.0,37564,borda da mata,MG,delivered,2018-01-14 14:33:31,2018-01-14 14:48:30,2018-01-16 12:36:48,2018-01-22 13:19:16,2018-02-05,35661,para de minas,MG,216.87,5.0,311.506212
3,2018-08-15 10:10:18,12.99,12.79,perfumaria,42.0,480.0,1.0,200.0,16.0,10.0,15.0,14403,franca,SP,delivered,2018-08-08 10:00:35,2018-08-08 10:10:18,2018-08-10 13:28:00,2018-08-14 13:32:39,2018-08-20,12952,atibaia,SP,25.78,4.0,292.064162
4,2017-02-13 13:57:51,199.9,18.14,ferramentas_jardim,59.0,409.0,1.0,3750.0,35.0,40.0,30.0,87900,loanda,PR,delivered,2017-02-04 13:57:51,2017-02-04 14:10:13,2017-02-16 09:46:09,2017-03-01 16:42:31,2017-03-17,13226,varzea paulista,SP,218.04,5.0,647.209811


## Criando novas colunas

In [23]:
df.columns

Index(['shipping_limit_date', 'price', 'freight_value',
       'product_category_name', 'product_name_lenght',
       'product_description_lenght', 'product_photos_qty', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm',
       'seller_zip_code_prefix', 'seller_city', 'seller_state', 'order_status',
       'order_purchase_timestamp', 'order_approved_at',
       'order_delivered_carrier_date', 'order_delivered_customer_date',
       'order_estimated_delivery_date', 'customer_zip_code_prefix',
       'customer_city', 'customer_state', 'payment_value', 'review_score',
       'customer_seller_distance'],
      dtype='object')

In [24]:
df["order_delay_time"] = pd.to_datetime(df["order_delivered_customer_date"]) - pd.to_datetime(df["order_estimated_delivery_date"])
df["order_delay_time"] = df["order_delay_time"].dt.days
df["order_delay_time"] 

0         -9
1         -3
2        -14
3         -6
4        -16
          ..
107820    -8
107821    -9
107822   -13
107823    -9
107824   -14
Name: order_delay_time, Length: 107825, dtype: int64

In [25]:
df['is_delayed'] = df['order_delay_time'] > 0
df['is_delayed']

0         False
1         False
2         False
3         False
4         False
          ...  
107820    False
107821    False
107822    False
107823    False
107824    False
Name: is_delayed, Length: 107825, dtype: bool

In [26]:
df["order_time_to_process"] = pd.to_datetime(df["shipping_limit_date"]) - pd.to_datetime(df["order_purchase_timestamp"])
df["order_time_to_process"] = df["order_time_to_process"].dt.days
df["order_time_to_process"]

0         6
1         7
2         4
3         7
4         9
         ..
107820    8
107821    5
107822    7
107823    6
107824    3
Name: order_time_to_process, Length: 107825, dtype: int64

In [27]:
df["product_volume_cm3"] = df["product_length_cm"] * df["product_height_cm"] * df["product_width_cm"]

In [28]:
df["total_cost"] = df["price"] + df["freight_value"]	

In [29]:
df['seller_state'].unique()

array(['SP', 'MG', 'PR', 'SC', 'DF', 'RS', 'RJ', 'GO', 'MA', 'ES', 'BA',
       'PI', 'RO', 'MT', 'CE', 'RN', 'PE', 'SE', 'MS', 'PB', 'PA', 'AM'],
      dtype=object)

In [30]:
def define_region(df, col):

    regions = []
    for d in df[col]:
        if d in ["AC", "AP", "AM", "PA", "RO", "RR", "TO"]:
            regions.append("Norte")
        elif d in ["AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE"]:
            regions.append("Nordeste")
        elif d in ["DF", "GO", "MT", "MS"]:
            regions.append("Centro Oeste")
        elif d in ["ES", "MG", "RJ", "SP"]:
            regions.append("Sudeste")
        elif d in ["PR", "RS", "SC"]:
            regions.append("Sul")

    return regions

df["seller_region"] = define_region(df, "seller_state")

df["customer_region"] = define_region(df, "customer_state")

In [31]:
df.columns

Index(['shipping_limit_date', 'price', 'freight_value',
       'product_category_name', 'product_name_lenght',
       'product_description_lenght', 'product_photos_qty', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm',
       'seller_zip_code_prefix', 'seller_city', 'seller_state', 'order_status',
       'order_purchase_timestamp', 'order_approved_at',
       'order_delivered_carrier_date', 'order_delivered_customer_date',
       'order_estimated_delivery_date', 'customer_zip_code_prefix',
       'customer_city', 'customer_state', 'payment_value', 'review_score',
       'customer_seller_distance', 'order_delay_time', 'is_delayed',
       'order_time_to_process', 'product_volume_cm3', 'total_cost',
       'seller_region', 'customer_region'],
      dtype='object')

# Deletando colunas

In [32]:
df.drop(columns = ['shipping_limit_date', 'seller_zip_code_prefix', 'seller_city',
                   'seller_state', 'order_purchase_timestamp', 'order_approved_at',
                    'order_delivered_carrier_date', 'order_delivered_customer_date',
                    'order_estimated_delivery_date', 'customer_zip_code_prefix',
                    'customer_city', 'customer_state'], inplace = True)

In [33]:
df.columns

Index(['price', 'freight_value', 'product_category_name',
       'product_name_lenght', 'product_description_lenght',
       'product_photos_qty', 'product_weight_g', 'product_length_cm',
       'product_height_cm', 'product_width_cm', 'order_status',
       'payment_value', 'review_score', 'customer_seller_distance',
       'order_delay_time', 'is_delayed', 'order_time_to_process',
       'product_volume_cm3', 'total_cost', 'seller_region', 'customer_region'],
      dtype='object')

# Ajustando dados para o treinamento

In [34]:
from sklearn.preprocessing import LabelEncoder

df_modelagem = df.copy()

le = LabelEncoder()
le.fit(df_modelagem["seller_region"])
df_modelagem["seller_region"] = le.transform(df_modelagem["seller_region"])

le = LabelEncoder()
le.fit(df_modelagem["customer_region"])
df_modelagem["customer_region"] = le.transform(df_modelagem["customer_region"])

le = LabelEncoder()
le.fit(df_modelagem["product_category_name"])
df_modelagem["product_category_name"] = le.transform(df_modelagem["product_category_name"])

le = LabelEncoder()
le.fit(df_modelagem["order_status"])
df_modelagem["order_status"] = le.transform(df_modelagem["order_status"])

# Definindo o Target

In [35]:
df_modelagem['target'] = np.where(df_modelagem['review_score'] < 5, 0, 1)
df_modelagem['target'].value_counts()

df_modelagem.drop(columns = ['review_score'], inplace = True)

In [36]:
df_modelagem.corr()['target'].sort_values()

is_delayed                   -0.217949
order_delay_time             -0.156001
customer_seller_distance     -0.055029
payment_value                -0.047895
freight_value                -0.027849
product_weight_g             -0.026718
product_length_cm            -0.025050
product_volume_cm3           -0.023352
order_time_to_process        -0.021808
product_name_lenght          -0.021234
product_height_cm            -0.020203
product_width_cm             -0.014549
product_category_name        -0.008557
order_status                  0.004721
seller_region                 0.006440
total_cost                    0.007142
product_photos_qty            0.009464
price                         0.009841
product_description_lenght    0.010971
customer_region               0.031289
target                        1.000000
Name: target, dtype: float64

# Seleção de Features Inicial

In [37]:
# Importações necessárias
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [38]:
features = [col for col in df_modelagem.columns if col != 'target']
target = 'target'

X = df_modelagem[features]
y = df_modelagem[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
# Modelo Decision Tree
model_tree = DecisionTreeClassifier()
model_tree.fit(X_train, y_train)
y_pred_tree = model_tree.predict(X_test)
print("Decision Tree Classifier")
print(classification_report(y_test, y_pred_tree, digits=4))

# Modelo Random Forest
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
print("Random Forest Classifier")
print(classification_report(y_test, y_pred_rf, digits=4))

# Modelo Logistic Regression
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)
print("Logistic Regression Classifier")
print(classification_report(y_test, y_pred_lr, digits=4))

# # Modelo SVM
# model_svm = SVC()
# model_svm.fit(X_train, y_train)
# y_pred_svm = model_svm.predict(X_test)
# print("Support Vector Machine Classifier")
# print(classification_report(y_test, y_pred_svm, digits=4))

# Modelo XGBoost
model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train)
y_pred_xgb = model_xgb.predict(X_test)
print("XGBoost Classifier")
print(classification_report(y_test, y_pred_xgb, digits=4))

Decision Tree Classifier
              precision    recall  f1-score   support

           0     0.5459    0.5649    0.5552      9033
           1     0.6783    0.6613    0.6697     12532

    accuracy                         0.6209     21565
   macro avg     0.6121    0.6131    0.6125     21565
weighted avg     0.6228    0.6209    0.6217     21565

Random Forest Classifier
              precision    recall  f1-score   support

           0     0.6554    0.4665    0.5450      9033
           1     0.6816    0.8232    0.7457     12532

    accuracy                         0.6738     21565
   macro avg     0.6685    0.6448    0.6454     21565
weighted avg     0.6706    0.6738    0.6617     21565

Logistic Regression Classifier
              precision    recall  f1-score   support

           0     0.6340    0.2569    0.3657      9033
           1     0.6251    0.8931    0.7354     12532

    accuracy                         0.6266     21565
   macro avg     0.6295    0.5750    0.5506    