# Case Ifood
*Desenvolvido por Mário de Deus*

# Installs

In [1]:
# !pip uninstall numpy -q
# !pip install numpy==1.19 -q
# !pip install numba==0.54.1 -q
# !pip install pycaret -U -q

# Imports

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

from pycaret.classification import *

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 40)
pd.set_option("display.max_colwidth", 1000)

import warnings

warnings.filterwarnings("ignore")

# Descrição / Objetivo do problema

* O objetivo

O objetivo da equipe é construir um modelo preditivo que produzirá o maior lucro para a próxima campanha de marketing direto, programada para o próximo mês. A nova campanha, sexta, visa a venda de um novo gadget para clientes cadastrados no Banco de Dados da empresa. Para construir o modelo, foi realizada uma campanha piloto envolvendo 2.240 clientes. Os clientes foram selecionados aleatoriamente e contatados por telefone para a aquisição do gadget. Durante os meses seguintes, os clientes que compraram a oferta foram devidamente etiquetados. O custo total da campanha da amostra foi de 6,720MU e a receita gerada pelos clientes que aceitaram a oferta foi de 3,674MU. Globalmente, a campanha teve um lucro de -3,046MU. A taxa de sucesso da campanha foi de 15%. O objetivo da equipe é desenvolver um modelo que preveja o comportamento do cliente e aplicá-lo ao restante da base de clientes. Felizmente, o modelo permitirá que a empresa escolha a dedo os clientes com maior probabilidade de comprar a oferta, deixando de fora os não respondentes, tornando a próxima campanha altamente lucrativa. Além disso, além de maximizar o lucro da campanha, o CMO está interessado em estudar as características dos clientes que desejam comprar o gadget.
Os dados
O conjunto de dados contém características sociodemográficas e firográficas de cerca de 2.240 clientes contatados. Além disso, contém um sinalizador para aqueles clientes que responderam à campanha, comprando o produto.


# Data Loading

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# Google COlab
df = pd.read_csv("data.csv", encoding="utf-8")

# Jupyter
# df = pd.read_csv('data.csv',encoding='utf-8')

df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0


# Data Cleaning

Drop da feature ID por ser um identificador

In [5]:
df.drop("ID", axis=1, inplace=True, errors="ignore")
df.shape

(2240, 28)

## Features com valores unicos
Verificando a existência de features com valores únicos (devem ser dropadas por não contribuirem para a explicar a variação da feature target)

In [6]:
df.nunique().sort_values()

Z_Revenue                 1
Z_CostContact             1
Response                  2
AcceptedCmp3              2
AcceptedCmp4              2
AcceptedCmp5              2
AcceptedCmp2              2
AcceptedCmp1              2
Complain                  2
Teenhome                  3
Kidhome                   3
Education                 5
Marital_Status            8
NumCatalogPurchases      14
NumStorePurchases        14
NumDealsPurchases        15
NumWebPurchases          15
NumWebVisitsMonth        16
Year_Birth               59
Recency                 100
MntFruits               158
MntSweetProducts        177
MntFishProducts         182
MntGoldProds            213
MntMeatProducts         558
Dt_Customer             663
MntWines                776
Income                 1974
dtype: int64

In [7]:
df.drop(["Z_CostContact", "Z_Revenue"], axis=1, inplace=True, errors="ignore")

## NaN analysis

In [8]:
df.isna().sum()

Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Response                0
dtype: int64

Somente a feature Income possui valores nulos.
Analisando as linhas com valores nulos em relação aos valores da feature target

In [9]:
# Distribuição da feature Response entre as amostras com Income = NaN
df[df.Income.isna()].Response.value_counts()

0    23
1     1
Name: Response, dtype: int64

In [10]:
# Proporção de 0 e 1 da fetaure Response no df completo
df.Response.value_counts(normalize=True)

0    0.850893
1    0.149107
Name: Response, dtype: float64

In [11]:
print("% amostras com NaN: ", np.round((df.Income.isna().sum() / len(df)) * 100, 2))
print(
    "% amostras com NaN e Response = 1: ",
    np.round(((len(df[(df.Income.isna()) & (df.Response == 1)]) / len(df)) * 100), 2),
)

% amostras com NaN:  1.07
% amostras com NaN e Response = 1:  0.04


Dado que as 24 linhas com valores Nan representam 1% do dataset total, e que entre as 24 linhas com Income == Nan somente uma apresentou Response == 1 (0.04%), as 24 linhas serão dropadas

In [12]:
print("Shape antes do dropna: ", df.shape[0])
df.dropna(axis=0, inplace=True)
print("Shape após o dropna: ", df.shape[0])

Shape antes do dropna:  2240
Shape após o dropna:  2216


## Ajuste do dtypes

In [13]:
df = df.convert_dtypes()
df.Dt_Customer = pd.to_datetime(df.Dt_Customer)
df.Response = df.Response.astype("bool")
df.dtypes

Year_Birth                      Int64
Education                      string
Marital_Status                 string
Income                          Int64
Kidhome                         Int64
Teenhome                        Int64
Dt_Customer            datetime64[ns]
Recency                         Int64
MntWines                        Int64
MntFruits                       Int64
MntMeatProducts                 Int64
MntFishProducts                 Int64
MntSweetProducts                Int64
MntGoldProds                    Int64
NumDealsPurchases               Int64
NumWebPurchases                 Int64
NumCatalogPurchases             Int64
NumStorePurchases               Int64
NumWebVisitsMonth               Int64
AcceptedCmp3                    Int64
AcceptedCmp4                    Int64
AcceptedCmp5                    Int64
AcceptedCmp1                    Int64
AcceptedCmp2                    Int64
Complain                        Int64
Response                         bool
dtype: objec

# Feature Engineering

## Idade dos clientes

In [14]:
from datetime import datetime

ano_atual = pd.datetime.now().year
df["Age"] = ano_atual - df.Year_Birth
df.drop("Year_Birth", axis=1, errors="ignore", inplace=True)
df.head()

Unnamed: 0,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Age
0,Graduation,Single,58138,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,True,66
1,Graduation,Single,46344,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,False,69
2,Graduation,Together,71613,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,False,58
3,Graduation,Together,26646,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,False,39
4,PhD,Married,58293,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,False,42


## Tempo como cliente

In [15]:
dt = pd.datetime.now().date()
df["Time_Customer"] = dt - pd.to_datetime(df["Dt_Customer"]).dt.date
df["Time_Customer"] = df["Time_Customer"] / np.timedelta64(1, "Y")
print(df[["Dt_Customer", "Time_Customer"]].head())
df.drop("Dt_Customer", axis=1, inplace=True)

  Dt_Customer  Time_Customer
0  2012-09-04      10.491660
1  2014-03-08       8.985811
2  2013-08-21       9.530654
3  2014-02-10       9.056996
4  2014-01-19       9.117230


### Removendo valores incoerentes com a variável Marital_Status

In [16]:
index_to_drop = df[
    (df["Marital_Status"] == "YOLO")
    | (df["Marital_Status"] == "Absurd")
    | (df["Marital_Status"] == "absurd")
    | (df["Marital_Status"] == "Alone")
].index
df.drop(index_to_drop, inplace=True)
df = df.reset_index(drop=True)
df.Marital_Status.value_counts()
print(df.shape)

(2209, 26)


In [17]:
df.rename(columns={"Response": "z_Response"}, inplace=True)
cols = df.columns.sort_values()
df = df[cols]
df.rename(columns={"z_Response": "Response"}, inplace=True)

df.columns

Index(['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4',
       'AcceptedCmp5', 'Age', 'Complain', 'Education', 'Income', 'Kidhome',
       'Marital_Status', 'MntFishProducts', 'MntFruits', 'MntGoldProds',
       'MntMeatProducts', 'MntSweetProducts', 'MntWines',
       'NumCatalogPurchases', 'NumDealsPurchases', 'NumStorePurchases',
       'NumWebPurchases', 'NumWebVisitsMonth', 'Recency', 'Teenhome',
       'Time_Customer', 'Response'],
      dtype='object')

# Preparação do dataset para Modelagem


## Train Test Validation Split

In [18]:
# sample 5% of data to be used as unseen data
df_train_test = df.sample(frac=0.95, random_state=123)
df_valid = df.drop(df_train_test.index)
df_train_test.reset_index(inplace=True, drop=True)
df_valid.reset_index(inplace=True, drop=True)
# print the revised shape
print("Data for Modeling: " + str(df_train_test.shape))
print("Unseen Data For Predictions: " + str(df_valid.shape))

Data for Modeling: (2099, 26)
Unseen Data For Predictions: (110, 26)


# Auto ML - PYCARET 

**Para o problema de negócio em questão, a métrica Precision é a mais relevante.**

## Setup

In [22]:
s = setup(
    data=df_train_test,
    target="Response",
    fix_imbalance=False,
    remove_outliers=True,
    session_id=123,
    categorical_features=["Education", "Marital_Status"],
)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Response
2,Target type,Binary
3,Original data shape,"(2099, 26)"
4,Transformed data shape,"(2047, 34)"
5,Transformed train set shape,"(1408, 34)"
6,Transformed test set shape,"(630, 34)"
7,Numeric features,23
8,Categorical features,2
9,Preprocess,True


In [23]:
# check available models
# has to be called necessary only after having defined a setup.
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsClassifier,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDClassifier,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessClassifier,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron.MLPClassifier,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


## Comparativo entre Modelos

In [28]:
best_model = compare_models(sort="auc", errors="raise")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.8883,0.8949,0.5334,0.655,0.5856,0.5223,0.5272,0.026
lightgbm,Light Gradient Boosting Machine,0.8727,0.8825,0.3443,0.6605,0.4421,0.3797,0.4103,0.037
rf,Random Forest Classifier,0.8543,0.879,0.163,0.5746,0.2452,0.1931,0.2435,0.044
gbc,Gradient Boosting Classifier,0.872,0.8761,0.3346,0.6618,0.4389,0.3756,0.4062,0.034
et,Extra Trees Classifier,0.8747,0.8593,0.2666,0.7382,0.3787,0.3289,0.3859,0.054
ada,Ada Boost Classifier,0.8727,0.8518,0.3844,0.6392,0.4726,0.4062,0.427,0.04
qda,Quadratic Discriminant Analysis,0.7627,0.8426,0.6654,0.4172,0.4963,0.3787,0.3888,0.025
lr,Logistic Regression,0.8543,0.7926,0.1852,0.5487,0.2739,0.2153,0.2558,0.039
nb,Naive Bayes,0.7937,0.7791,0.5379,0.3725,0.439,0.3181,0.3269,0.026
dt,Decision Tree Classifier,0.8244,0.6547,0.4121,0.416,0.4115,0.3089,0.3102,0.027


In [29]:
print(best_model)

LinearDiscriminantAnalysis(covariance_estimator=None, n_components=None,
                           priors=None, shrinkage=None, solver='svd',
                           store_covariance=False, tol=0.0001)


## Análise do Modelo

In [None]:
# evaluate modeltreshold

In [None]:
# plot model - auc

In [None]:
# plot model - confusion matrix

In [None]:
# plot model - feature

In [None]:
# plot model - raw score

* Outros tipos de plot:
https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.plot_model

## Criando um Modelo

In [None]:
# create model lgbm

In [None]:
# create model rf

## Tuning dos Hiperparâmetros

### LGBM

In [None]:
tuned_lgbm =

In [None]:
# predict

* AUC aumentou/diminuiu de XXX para XXX após a etapa de tuning dos hiperparâmetros

### RFC

In [None]:
tuned_rfc = 

* Modelo RFC apresentou leve piora/melhora na métrica AUC após o tuning dos hiperparâmetros, de XXX para XXX

In [None]:
# predict

# Melhor Modelo: LGBM com tuning
* Foram comparados os modelos LGBM e RFC, antes e depois do tuning dos hiperparametros, e em ambas condições o LGBM apresentou melhores AUCs.

## AUC Plot

In [None]:
# auc

## Feature Importance

In [None]:
# feature

## Matriz de Confusão

In [None]:
# confusion matrix

# Referências:
* https://towardsdatascience.com/introduction-to-binary-classification-with-pycaret-a37b3e89ad8d
* https://pycaret.gitbook.io/docs/get-started/quickstart#classification
* https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.plot_model