

**Author: Joyce Maria do Carmo de Sá, 18 de Agosto de 2022**

Todos os experimentos realizados assim como as métricas de avaliação podem ser observados no dagshub:   
https://dagshub.com/joycesafg/DataMaster_Case/experiments

O objetivo é desenvolver um modelo, onde consigamos identificar clientes insatisfeitos, para que possamos atuar com alguma campanha e evitar o churnning desses clientes de forma a obter o maior lucro. 

Comandos iniciais para recuperar os dados - É necessario ter uma chave kaggle (kaggle.json) para fazer o download.

In [None]:

# ! pip install kaggle
# ! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/
# ! chmod 600 ~/.kaggle/kaggle.json
# ! kaggle competitions download -c santander-customer-satisfaction -f train.csv

Etapa que transforma os dados de zip para o dataframe pandas - O dado tambem esta disponivel como csv na pasta "data"

In [None]:

# import pandas as pd
# import zipfile
# import os

# zip_file_path = 'train.csv.zip'
# extract_to = 'train_dataset'

# os.makedirs(extract_to, exist_ok=True)

# with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
#     zip_ref.extractall(extract_to)

# extracted_files = zip_ref.namelist()

# csv_file_path = os.path.join(extract_to, extracted_files[0])



In [None]:
#Imports necessários para executar o código
import pandas as pd
import optuna
import numpy as np 
from sklearn.preprocessing import  MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, make_scorer, roc_auc_score
from sklearn.model_selection import KFold
import warnings
import mlflow
import dagshub

from functions import *

## Como instalar os requisitos

Para instalar os requisitos necessários para executar este notebook, você pode seguir os passos abaixo:

1. Certifique-se de ter o Python instalado em sua máquina. Recomendamos usar o Python 3.6 ou superior.
2. Crie um ambiente virtual (opcional, mas recomendado):
    ```bash
    python -m venv myenv
    source myenv/bin/activate  # No Windows, use `myenv\Scripts\activate`
    ```
3. Instale os pacotes listados no arquivo `requirements_dev.txt`:
    ```bash
    pip install -r requirements.txt
    ```


In [None]:
#Inicialização do mlflow e do dagsHub para o acompanhamento do experimento

dagshub.init(repo_owner='joycesafg', repo_name='DataMaster_Case', mlflow=True)
mlflow.set_tracking_uri(uri="https://dagshub.com/joycesafg/DataMaster_Case.mlflow")
mlflow.set_experiment("Data_Master_ML_Model")

In [None]:
#Read the dataset
df_case = pd.read_csv('data/train.csv')

#Check the types of the columns
print(df_case.dtypes.value_counts())

#Check if there are any missing values
print(df_case.isnull().values.any())

# 3.96% of the clients are unsatisfied
print(round(df_case['TARGET'].mean()*100, 2)) 

1. Experimento inicial

In [None]:
#split da base em treino e teste
X_train1, X_test1, Y_train1, Y_test1 = train_test_split(df_case.drop(columns = ['ID', 'TARGET']), df_case['TARGET'], test_size = 0.2, random_state = 5456)

mlflow.autolog()
with mlflow.start_run(run_name='BaseLine'):
    xgb = XGBClassifier()

    attempt1 = xgb.fit(X_train1, Y_train1)

    params = attempt1.get_params()
    mlflow.log_params(params)


    DT_pred= attempt1.predict(X_test1)

    values = evaluate_pto_corte(attempt1, X_train1, X_test1, Y_train1, Y_test1, 0.1)
    print(values)

    # Log the loss metric
    metrics = {
    "pto_corte": values[8],
    "auc_score_train": values[0],
    "auc_score_test": values[1],
    "profit_percent_train":round((values[2]/values[3])*100, 2),
    "profit_percent_test": round((values[4]/values[5])*100, 2),
    }

    mlflow.log_artifact(local_path="model_train.ipynb")
    mlflow.log_metrics(metrics)

    model_info = mlflow.sklearn.log_model(
        sk_model=attempt1,
        artifact_path="Model",
        input_example=X_train1.head(2),
        registered_model_name="Baseline_Model",
    )

2. Tratamento e remoção de features com pouca variabilidade

In [None]:
variancia = removeUnvariable(df_case, df_case.columns,0.0) 
train_vars = df_case.drop(columns = variancia)
equals = []
for x in train_vars:
    for y in train_vars:
        if x != y:
            if df_case[x].equals(df_case[y]):
                if (y,x) not in equals:
                    equals.append((x, y))
                    

drop_cols = [x[1] for x in equals] 


df_final = train_vars.drop(columns = drop_cols)

3. Novo exmperimento sem as features com pouca variabilidade


In [None]:
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(df_final.drop(columns = ['ID', 'TARGET']), df_final['TARGET'], test_size = 0.2, random_state = 5456)


mlflow.autolog()
with mlflow.start_run(run_name='Features_Treatment', description="removidas variáveis constantes e repetidas do dataframe"):
    xgb = XGBClassifier()

    attempt2 = xgb.fit(X_train2, Y_train2)



    params = attempt2.get_params()
    mlflow.log_params(params)


    DT_pred= attempt2.predict(X_test2)

    values = evaluate_pto_corte(attempt2, X_train2, X_test2, Y_train2, Y_test2, 0.1)
    print(values)

    # Log the loss metric
    metrics = {
    "pto_corte": values[8],
    "auc_score_train": values[0],
    "auc_score_test": values[1],
    "profit_percent_train":round((values[2]/values[3])*100, 2),
    "profit_percent_test": round((values[4]/values[5])*100, 2),
    }

    mlflow.log_metrics(metrics)

    model_info = mlflow.sklearn.log_model(
        sk_model=attempt2,
        artifact_path="Model",
        input_example=X_train2.head(2),
        registered_model_name="Case_Data_Master",
    )

4. Analise e transformação e remoção de mais features e feature engineering

In [None]:
#Calcula a matriz de correlação de Spearman para remoção de variaveis altamente correlacionadas com drop de variaveis acima de 95% de correção 
spm_corr =df_final.corr(method='spearman').abs() #Spearman é menos sensivel a outliers

upper_tri = spm_corr.where(np.triu(np.ones(spm_corr.shape),k=1).astype(bool))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)] 

df_corr = df_final.drop(columns = to_drop)

#Identificação de variáveis inteiras com exatamente dois valores distintos (flags)

vars_int = df_corr.select_dtypes('int64').drop(columns = ['ID']).columns
flags_ = [x for x in df_corr[vars_int].columns if len(df_corr[x].unique()) == 2] 

cat_vars = [x for x in df_corr[vars_int].columns if len(df_corr[x].unique()) > 2] + ['TARGET'] #variáveis inteiras com + de dois valores distintos

#feature engineering

# feature enginnering agrupando por 'num_var4' e calculando a média de 'TARGET'.
new_df = pd.DataFrame(df_corr.groupby('num_var4').mean()['TARGET'].sort_values(ascending = False))
new_df = new_df.reset_index().reset_index()
df_final_eng = new_df[['index', 'num_var4']].join(df_corr.set_index('num_var4'), on= 'num_var4')#

df_final_eng.rename(columns = {'index':'var4_ordered'}, inplace = True)

# feature enginnering agrupando por 'num_var5_0' e calculando a média de 'TARGET'.
new_df = pd.DataFrame(df_corr.groupby('num_var5_0').mean()['TARGET'].sort_values(ascending = False))
new_df = new_df.reset_index().reset_index()
df_final_eng = new_df[['index', 'num_var5_0']].join(df_final_eng.set_index('num_var5_0'), on= 'num_var5_0')
df_final_eng.rename(columns = {'index':'num_var5_0_ordered'}, inplace = True)

# nova coluna 'zero_info' que conta o número de zeros em cada linha do DataFrame.
df_final_eng['zero_info'] =(df_final_eng == 0).astype(int).sum(axis=1)


5. Novo teste após as transformações descritas em 4.

In [None]:
y = df_final_eng['TARGET']
X = df_final_eng.drop(columns = ['TARGET'])
X_train, X_test, Y_train, Y_test = train_test_split(X.drop(columns = ['ID']), y, test_size = 0.2, random_state = 5456)

mlflow.autolog()
with mlflow.start_run(run_name='Feature Engineering', description="Criação de novas vars e removação de vars irrelevantes"):
    xgb = XGBClassifier(seed = 2938)

    model_xgb = xgb.fit(X_train, Y_train)

    params = model_xgb.get_params()
    mlflow.log_params(params)


    DT_pred= model_xgb.predict(X_test)

    values = evaluate_pto_corte(model_xgb, X_train, X_test, Y_train, Y_test, 0.1)
    print(values)

    # Log the loss metric
    metrics = {
    "pto_corte": values[8],
    "auc_score_train": values[0],
    "auc_score_test": values[1],
    "profit_percent_train":round((values[2]/values[3])*100, 2),
    "profit_percent_test": round((values[4]/values[5])*100, 2),
    }

    mlflow.log_metrics(metrics)

    model_info = mlflow.sklearn.log_model(
        sk_model=model_xgb,
        artifact_path="Model",
        input_example=X_train.head(2),
        registered_model_name="Case_Data_Master",
    )


6. Novo treinamento testando o balanceamento das classes usando a técnica undersampling


In [None]:
#Treino com Undersampling
df_under = X_train
df_under["TARGET"] = Y_train

qtd_c1 = sum(Y_train)

classe_1 = df_under[df_under["TARGET"] == 1]
classe_0 = df_under[df_under['TARGET'] == 0].sample(qtd_c1)

df_under = pd.concat([classe_0, classe_1]) 
X_trainUnder = df_under.drop(columns = ['TARGET'])
Y_trainUnder = df_under['TARGET']

#Treino com Undersampling
df_under = X_train
df_under["TARGET"] = Y_train

qtd_c1 = sum(Y_train)

classe_1 = df_under[df_under["TARGET"] == 1]
classe_0 = df_under[df_under['TARGET'] == 0].sample(qtd_c1)

df_under = pd.concat([classe_0, classe_1]) 
X_trainUnder = df_under.drop(columns = ['TARGET'])
Y_trainUnder = df_under['TARGET']


mlflow.autolog()
with mlflow.start_run(run_name='Undersampling', description="Teste com Undersampling devido a classe ser desbalanceada"):
    
    model_xgb_Under = xgb.fit(X_trainUnder, Y_trainUnder)

    params = model_xgb_Under.get_params()
    mlflow.log_params(params)


    DT_pred= model_xgb_Under.predict(X_test)

    values = evaluate_pto_corte(model_xgb_Under, X_trainUnder, X_test, Y_trainUnder, Y_test, 0.1)
    print(values)

    # Log the loss metric
    metrics = {
    "pto_corte": values[8],
    "auc_score_train": values[0],
    "auc_score_test": values[1],
    "profit_percent_train":round((values[2]/values[3])*100, 2),
    "profit_percent_test": round((values[4]/values[5])*100, 2),
    }

    mlflow.log_metrics(metrics)

    model_info = mlflow.sklearn.log_model(
        sk_model=model_xgb_Under,
        artifact_path="Model",
        input_example=X_train.head(2),
        registered_model_name="Case_Data_Master",
    )

7. Novo experimento otimizando hiperparametros

In [None]:

# Definição dos hiperparâmetros a serem otimizados, incluindo lambda, alpha, colsample_bytree, max_depth e min_child_weight.
#Criação de um modelo XGBClassifier com os hiperparâmetros sugeridos.
# Execução da validação cruzada (cross-validation) com 5 folds, usando a métrica de lucro máximo.
# Registro dos parâmetros e da métrica de lucro máximo no MLflow.

max_profit = make_scorer(lucro_maximo, greater_is_better=True)

def objective(trial):
  
  mlflow.autolog()
  
  with mlflow.start_run(run_name= str(trial), nested=True):

    param = {
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'max_depth': trial.suggest_categorical('max_depth', [5, 9, 11]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
    }
    model = XGBClassifier(**param)  

    cval = cross_val_score(model, X_train, Y_train, 
                             cv=KFold(n_splits=5,
                                      shuffle=True,
                                      random_state=42),
                                 verbose= 3, scoring = max_profit,
                            n_jobs= 1
                            )
    
    print("cval", cval)

    mlflow.log_params(param)
    mlflow.log_metric('max_profit_kfold', cval.mean())
    # Save the model as an artifact
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="Model",
        input_example=X_train.head(2),
        registered_model_name="Case_Data_Master",
    )
    
    return cval.mean()
  
  

with mlflow.start_run(run_name="Optuna") as mlflow_parent:
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=5)
    print('Number of finished trials:', len(study.trials))
    print('Best trial:', study.best_trial.params)

Best_trial = study.best_trial.params

8. Treinamento de novo modelo utilizando os melhores parametros definidos pelo Optuna

In [None]:
mlflow.autolog()
with mlflow.start_run(run_name='Optuna Best Model', description="Modelo usando os melhores parametros do optuna"):
    
    #Best_trial = {'lambda': 0.01444365452315395, 'learning_rate': 0.09, 'n_estimators': 138, 'max_depth': 5, 'random_state': 2020, 'min_child_weight': 25}

    model = XGBClassifier(**Best_trial)
    model_optuna = model.fit(X_train, Y_train)

    params = model_optuna.get_params()
    mlflow.log_params(params)

    values = evaluate_pto_corte(model_optuna, X_train, X_test, Y_train, Y_test, 0.1)
    print(values)

    # Log the loss metric
    metrics = {
    "pto_corte": values[8],
    "auc_score_train": values[0],
    "auc_score_test": values[1],
    "profit_percent_train":round((values[2]/values[3])*100, 2),
    "profit_percent_test": round((values[4]/values[5])*100, 2),
    }

    mlflow.log_metrics(metrics)

    model_info = mlflow.sklearn.log_model(
        sk_model=model_optuna,
        artifact_path="Model",
        input_example=X_train.head(2),
        registered_model_name="Case_Data_Master",
    )


9. Filtro de feature mais importantes para o modelo de acordo com feature importance e novo treino 

In [None]:
# Extração da feature importance do modelo treinado.
# Filtro das features cuja importância é maior que 0.02 

fi = list(zip(model_optuna.get_booster().feature_names, model_optuna.feature_importances_))
try_features = [x[0] for x in fi if x[1] > 0.02]

mlflow.autolog()
with mlflow.start_run(run_name='Final Model Feature Reduction', description="Deixando apenas a top 40 vars"):
    
    Best_trial = {'lambda': 0.01444365452315395, 'learning_rate': 0.09, 'n_estimators': 138, 'max_depth': 5, 'random_state': 2020, 'min_child_weight': 25}

    model = XGBClassifier(**Best_trial)
    model_optuna = model.fit(X_train[try_features], Y_train)
    
    params = model_optuna.get_params()
    mlflow.log_params(params)

    values = evaluate_pto_corte(model_optuna, X_train[try_features], X_test[try_features], Y_train, Y_test, 0.1)
    print(values)

    # Log the loss metric
    metrics = {
    "pto_corte": values[8],
    "auc_score_train": values[0],
    "auc_score_test": values[1],
    "profit_percent_train":round((values[2]/values[3])*100, 2),
    "profit_percent_test": round((values[4]/values[5])*100, 2),
    }

    mlflow.log_metrics(metrics)

    model_info = mlflow.sklearn.log_model(
        sk_model=model_optuna,
        artifact_path="Model",
        input_example=X_train[try_features].head(2),
        registered_model_name="Case_Data_Master",
    )


Todos os experimentos realizados assim como as métricas de cada um podem ser observados no dagshub:     
https://dagshub.com/joycesafg/DataMaster_Case/experiments