# Projet 7 : Implémentez un modèle de scoring
*Philippe LONJON (janvier 2020)*

---
Ce projet consiste à développer un modèle de scoring, qui donnera une prédiction sur la probabilité de défaut de paiement d'un client qui demande un prêt.
Il s'agit d'un problème :
* **Supervisé** : Les étiquettes (Défauts de paiement) sont connus
* **Classification** : Les valeurs à prédire sont des variables qualitatives
---
## Notebook 4.2 : Optimisation des hyper-paramètres avec métrique recall
Le notebook comprend :
- L'optimisation des hyper-paramètres d'un modèle de gradient boosting selon la métrique choisie

Les données à utiliser sont issues du notebook précédent, et se trouvent dans le répertoire : `data/features`<br>
Le résultat de l'optimisation sont sauvegardées dans le répertoire : ``data/model``

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Graphics
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
import lightgbm as lgb
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import f1_score, recall_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

# Divers
from time import time, strftime, gmtime
import gc
import pickle

# Module des fonctions du notebook
import fonctions08 as f

# Autoreload pour prise en compte des changments dans le module fonctions
%load_ext autoreload
%autoreload 1
%aimport fonctions08

In [2]:
# Heure démarrage
t0=time()

# Constantes
RANDOM_STATE = 1
NROWS = None
NCOLS = None
N_FOLDS = 5

# Paramètres pour les validations croisées
random_seed = 1 # Seed pour générateur de nombre aléatoire
folds = 5 # Nombre de folds pour validation croisée

In [3]:
# Chargement des données d'entrainement
features = f.import_csv("data/features/train_features_selected.csv", nrows=NROWS)
target = f.import_csv("data/features/train_target.csv", nrows=NROWS)

features = features.set_index("SK_ID_CURR")
target = target.set_index("SK_ID_CURR")

print(features.shape)
print(target.shape)

Memory usage of dataframe is 505.82 MB
Memory usage after optimization is: 164.84 MB
Decreased by 67.4%
Memory usage of dataframe is 3.28 MB
Memory usage after optimization is: 1.03 MB
Decreased by 68.7%
(215257, 307)
(215257, 1)


# 1. Création d'un jeu de données avec classes équilibrées
Les clients ayant fait défaut ne représentent que 8% des données. Ce déséquilibre dans les classes des étiquettes conduit à un taux de faux positifs élevé. Pour faire face à ce problème, on va sous-échantillonner la classe majoritaire, de sorte que les deux classes aient un échantillon de même taille.

In [4]:
# Taille de l'échantillon avec une faible représentation
target[target['TARGET'] == 1].shape

(17377, 1)

In [5]:
# Création d'un jeu de données ave classes équilibrées
target_0 = target[target['TARGET'] == 0]
target_1 = target[target['TARGET'] == 1]

target_0_sample = target_0.sample(len(target_1), random_state=RANDOM_STATE)

target_sample_balanced = pd.concat([target_0_sample, target_1], axis=0)
features_sample_balanced = features.loc[target_sample_balanced.index, :].iloc[:, :NCOLS]

In [6]:
# Dimensions du nouveau jeu de données
features_sample_balanced.shape

(34754, 307)

In [7]:
# Creation des données d'entrainement et de test
# On limite les tailles pour garder des temps de calculs raisonnables
X_train, X_test, y_train, y_test =\
    train_test_split(features_sample_balanced, target_sample_balanced,
                     train_size=15000, test_size=6000,
                     stratify=target_sample_balanced, random_state=RANDOM_STATE)

# 2. Recherche des hyper-paramètres
Nous effectuons la recherche d'hyper-paramètres avec l'algorithme de la librairie Optuna.<br>
Cet algorithme effectue la recherche selon une approche probabilistique, en prenant en compte les évaluations déjà effectuées, ce qui permet d'optimiser le temps de recherche.<br>
Il faut définir la fonction objectif, qui retourne le score à optimiser, à maximiser ou minimiser selon la métrique choisie.

In [8]:
import optuna

def objective(trial):
    params = {
        'verbosity':-1,
        'objective': 'binary',
        'metric': 'auc',        
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators', 20, 200),
        'num_leaves ': trial.suggest_int('num_leaves', 8, 128),
        'max_depth': trial.suggest_int('max_depth', 4, 10),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_uniform('reg_alpha', 0.0, 1.0),
    }
    
    k_fold = KFold(n_splits=N_FOLDS, shuffle=False, random_state=RANDOM_STATE)
    
    scores=[]

    for train_index, valid_index in k_fold.split(X_train):
    
        # Training data for the fold
        fold_X_train, fold_y_train = X_train.iloc[train_index, :], y_train.iloc[train_index, :]
        
        # Validation data for the fold
        fold_X_valid, fold_y_valid = X_train.iloc[valid_index, :], y_train.iloc[valid_index, :]
        
        model = lgb.LGBMModel(random_state=1)
        model.set_params(**params)
        model.fit(fold_X_train, np.ravel(fold_y_train))
        preds = model.predict(fold_X_valid)
        scores.append(recall_score(fold_y_valid, np.rint(preds)))
   
    return np.mean(scores)

On se fixe 1000 itérations pour trouver les hyper-paramètres optimaux qui maximise le recall.

In [9]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000, timeout=None) # timeout (seconds)

[I 2020-01-04 01:34:56,950] Finished trial#0 resulted in value: 0.6853253513425506. Current best value is 0.6853253513425506 with parameters: {'n_estimators': 49, 'num_leaves': 70, 'max_depth': 4, 'learning_rate': 0.054763041022005354, 'min_child_samples': 53, 'colsample_bytree': 0.5913699903459337, 'reg_alpha': 0.6533399897308126}.
[I 2020-01-04 01:35:03,181] Finished trial#1 resulted in value: 0.6959763133546993. Current best value is 0.6959763133546993 with parameters: {'n_estimators': 57, 'num_leaves': 29, 'max_depth': 9, 'learning_rate': 0.04478207196014068, 'min_child_samples': 73, 'colsample_bytree': 0.5375896301797386, 'reg_alpha': 0.27050621091108396}.
[I 2020-01-04 01:35:07,030] Finished trial#2 resulted in value: 0.6724118965787566. Current best value is 0.6959763133546993 with parameters: {'n_estimators': 57, 'num_leaves': 29, 'max_depth': 9, 'learning_rate': 0.04478207196014068, 'min_child_samples': 73, 'colsample_bytree': 0.5375896301797386, 'reg_alpha': 0.270506210911083

[I 2020-01-04 01:47:24,908] Finished trial#48 resulted in value: 0.7019745143369442. Current best value is 0.7089971206816188 with parameters: {'n_estimators': 191, 'num_leaves': 76, 'max_depth': 9, 'learning_rate': 0.033478965786686445, 'min_child_samples': 34, 'colsample_bytree': 0.7515663548357849, 'reg_alpha': 0.29981309616204654}.
[I 2020-01-04 01:47:31,223] Finished trial#49 resulted in value: 0.6892915945057422. Current best value is 0.7089971206816188 with parameters: {'n_estimators': 191, 'num_leaves': 76, 'max_depth': 9, 'learning_rate': 0.033478965786686445, 'min_child_samples': 34, 'colsample_bytree': 0.7515663548357849, 'reg_alpha': 0.29981309616204654}.
[I 2020-01-04 01:47:48,986] Finished trial#50 resulted in value: 0.7053823201334681. Current best value is 0.7089971206816188 with parameters: {'n_estimators': 191, 'num_leaves': 76, 'max_depth': 9, 'learning_rate': 0.033478965786686445, 'min_child_samples': 34, 'colsample_bytree': 0.7515663548357849, 'reg_alpha': 0.299813

[I 2020-01-04 02:02:45,926] Finished trial#96 resulted in value: 0.7038051142780597. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981171}.
[I 2020-01-04 02:03:07,124] Finished trial#97 resulted in value: 0.7083304535246251. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981171}.
[I 2020-01-04 02:03:25,128] Finished trial#98 resulted in value: 0.70248790419928. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981

[I 2020-01-04 02:18:34,331] Finished trial#144 resulted in value: 0.7022590854772552. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981171}.
[I 2020-01-04 02:18:54,030] Finished trial#145 resulted in value: 0.7060100994831122. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981171}.
[I 2020-01-04 02:19:15,802] Finished trial#146 resulted in value: 0.7066511019711685. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.30274596

[I 2020-01-04 02:34:32,628] Finished trial#192 resulted in value: 0.7062981759014975. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981171}.
[I 2020-01-04 02:34:53,170] Finished trial#193 resulted in value: 0.7069645535870153. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.3027459692981171}.
[I 2020-01-04 02:35:13,366] Finished trial#194 resulted in value: 0.7065062933371582. Current best value is 0.7094149953860109 with parameters: {'n_estimators': 195, 'num_leaves': 110, 'max_depth': 9, 'learning_rate': 0.0413619942438833, 'min_child_samples': 62, 'colsample_bytree': 0.7550950467412609, 'reg_alpha': 0.30274596

[I 2020-01-04 02:51:34,680] Finished trial#240 resulted in value: 0.7073952228714783. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.48293604293319414}.
[I 2020-01-04 02:51:56,512] Finished trial#241 resulted in value: 0.7069294453257141. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.48293604293319414}.
[I 2020-01-04 02:52:19,268] Finished trial#242 resulted in value: 0.7044481847820284. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.482

[I 2020-01-04 03:05:55,858] Finished trial#288 resulted in value: 0.7100932078301081. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.48293604293319414}.
[I 2020-01-04 03:06:11,995] Finished trial#289 resulted in value: 0.7087844103518162. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.48293604293319414}.
[I 2020-01-04 03:06:27,300] Finished trial#290 resulted in value: 0.7080644898907945. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.482

[I 2020-01-04 03:19:03,644] Finished trial#336 resulted in value: 0.7090230243054527. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.48293604293319414}.
[I 2020-01-04 03:19:21,801] Finished trial#337 resulted in value: 0.7090181653006825. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.48293604293319414}.
[I 2020-01-04 03:19:36,739] Finished trial#338 resulted in value: 0.7051818115152136. Current best value is 0.7106494255370988 with parameters: {'n_estimators': 197, 'num_leaves': 94, 'max_depth': 10, 'learning_rate': 0.03772051798574484, 'min_child_samples': 33, 'colsample_bytree': 0.7206854350185867, 'reg_alpha': 0.482

[I 2020-01-04 03:30:57,317] Finished trial#384 resulted in value: 0.7083509425811002. Current best value is 0.7114444947254412 with parameters: {'n_estimators': 173, 'num_leaves': 95, 'max_depth': 10, 'learning_rate': 0.044005010414934986, 'min_child_samples': 34, 'colsample_bytree': 0.5392310080577489, 'reg_alpha': 0.054364580473195485}.
[I 2020-01-04 03:31:12,366] Finished trial#385 resulted in value: 0.7053783842294458. Current best value is 0.7114444947254412 with parameters: {'n_estimators': 173, 'num_leaves': 95, 'max_depth': 10, 'learning_rate': 0.044005010414934986, 'min_child_samples': 34, 'colsample_bytree': 0.5392310080577489, 'reg_alpha': 0.054364580473195485}.
[I 2020-01-04 03:31:19,443] Finished trial#386 resulted in value: 0.6983355007961795. Current best value is 0.7114444947254412 with parameters: {'n_estimators': 173, 'num_leaves': 95, 'max_depth': 10, 'learning_rate': 0.044005010414934986, 'min_child_samples': 34, 'colsample_bytree': 0.5392310080577489, 'reg_alpha': 

[I 2020-01-04 03:43:36,425] Finished trial#432 resulted in value: 0.7114120209067798. Current best value is 0.7121006330985835 with parameters: {'n_estimators': 179, 'num_leaves': 100, 'max_depth': 10, 'learning_rate': 0.03654514708034321, 'min_child_samples': 30, 'colsample_bytree': 0.5392171033859147, 'reg_alpha': 0.2024098524103825}.
[I 2020-01-04 03:43:53,031] Finished trial#433 resulted in value: 0.7054220711660546. Current best value is 0.7121006330985835 with parameters: {'n_estimators': 179, 'num_leaves': 100, 'max_depth': 10, 'learning_rate': 0.03654514708034321, 'min_child_samples': 30, 'colsample_bytree': 0.5392171033859147, 'reg_alpha': 0.2024098524103825}.
[I 2020-01-04 03:44:09,663] Finished trial#434 resulted in value: 0.709338070050541. Current best value is 0.7121006330985835 with parameters: {'n_estimators': 179, 'num_leaves': 100, 'max_depth': 10, 'learning_rate': 0.03654514708034321, 'min_child_samples': 30, 'colsample_bytree': 0.5392171033859147, 'reg_alpha': 0.202

[I 2020-01-04 03:56:41,582] Finished trial#480 resulted in value: 0.7068219639546711. Current best value is 0.7121006330985835 with parameters: {'n_estimators': 179, 'num_leaves': 100, 'max_depth': 10, 'learning_rate': 0.03654514708034321, 'min_child_samples': 30, 'colsample_bytree': 0.5392171033859147, 'reg_alpha': 0.2024098524103825}.
[I 2020-01-04 03:56:58,000] Finished trial#481 resulted in value: 0.7075469151516565. Current best value is 0.7121006330985835 with parameters: {'n_estimators': 179, 'num_leaves': 100, 'max_depth': 10, 'learning_rate': 0.03654514708034321, 'min_child_samples': 30, 'colsample_bytree': 0.5392171033859147, 'reg_alpha': 0.2024098524103825}.
[I 2020-01-04 03:57:13,443] Finished trial#482 resulted in value: 0.7098705304994579. Current best value is 0.7121006330985835 with parameters: {'n_estimators': 179, 'num_leaves': 100, 'max_depth': 10, 'learning_rate': 0.03654514708034321, 'min_child_samples': 30, 'colsample_bytree': 0.5392171033859147, 'reg_alpha': 0.20

[I 2020-01-04 04:09:30,205] Finished trial#528 resulted in value: 0.7068089089762711. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:09:46,425] Finished trial#529 resulted in value: 0.7084746220088998. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:10:01,972] Finished trial#530 resulted in value: 0.7073989052191874. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

[I 2020-01-04 04:22:49,939] Finished trial#576 resulted in value: 0.7065906829909662. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:23:07,367] Finished trial#577 resulted in value: 0.7095713679121375. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:23:24,972] Finished trial#578 resulted in value: 0.7083936818545278. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

[I 2020-01-04 04:37:00,424] Finished trial#624 resulted in value: 0.7056574152626837. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:37:17,083] Finished trial#625 resulted in value: 0.7067174695455495. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:37:25,300] Finished trial#626 resulted in value: 0.7026568305823788. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

[I 2020-01-04 04:50:19,640] Finished trial#672 resulted in value: 0.7090051864360924. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:50:36,246] Finished trial#673 resulted in value: 0.707775042061706. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 04:50:52,597] Finished trial#674 resulted in value: 0.7065818655490063. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.

[I 2020-01-04 05:03:48,650] Finished trial#720 resulted in value: 0.6983904218608126. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:04:05,818] Finished trial#721 resulted in value: 0.7060835451562657. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:04:21,971] Finished trial#722 resulted in value: 0.7093856239587656. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

[I 2020-01-04 05:17:18,739] Finished trial#768 resulted in value: 0.7080800270097614. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:17:36,688] Finished trial#769 resulted in value: 0.7059492165382135. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:18:03,064] Finished trial#770 resulted in value: 0.7034799337850112. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

[I 2020-01-04 05:30:41,813] Finished trial#816 resulted in value: 0.709933377422545. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:30:57,465] Finished trial#817 resulted in value: 0.7049297201340906. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:31:13,974] Finished trial#818 resulted in value: 0.7083471400321184. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.

[I 2020-01-04 05:44:17,272] Finished trial#864 resulted in value: 0.7056827447189165. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:44:35,445] Finished trial#865 resulted in value: 0.7108925076364818. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:44:45,478] Finished trial#866 resulted in value: 0.6983622711416808. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

[I 2020-01-04 05:58:12,289] Finished trial#912 resulted in value: 0.710889668537741. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:58:29,269] Finished trial#913 resulted in value: 0.7064341431510828. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 05:58:46,015] Finished trial#914 resulted in value: 0.7067649584290554. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.

[I 2020-01-04 06:12:03,488] Finished trial#960 resulted in value: 0.7075629346911295. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 06:12:20,450] Finished trial#961 resulted in value: 0.7056933193763154. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0.1975803071715644}.
[I 2020-01-04 06:12:34,379] Finished trial#962 resulted in value: 0.7013616432133517. Current best value is 0.7133186643905154 with parameters: {'n_estimators': 177, 'num_leaves': 105, 'max_depth': 10, 'learning_rate': 0.041062702635337185, 'min_child_samples': 38, 'colsample_bytree': 0.5260058472317749, 'reg_alpha': 0

On peut alors afficher le résultat de la recherche.

In [10]:
print('Number of finished trials: {}'.format(len(study.trials)))

print('Best trial:')
trial = study.best_trial

print('  Value: {}'.format(trial.value))

print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

Number of finished trials: 1000
Best trial:
  Value: 0.7133186643905154
  Params: 
    n_estimators: 177
    num_leaves: 105
    max_depth: 10
    learning_rate: 0.041062702635337185
    min_child_samples: 38
    colsample_bytree: 0.5260058472317749
    reg_alpha: 0.1975803071715644


# 3. Sauvegarde du modèle
On crée ensuite un modèle de gradient bossting avec les meilleurs hyper-paramètres.

In [11]:
# Modele final
model = lgb.LGBMModel(objective='binary', metric='auc',
                      boosting_type= 'gbdt')
model.set_params(**study.best_params)
model

LGBMModel(boosting_type='gbdt', class_weight=None,
     colsample_bytree=0.5260058472317749, importance_type='split',
     learning_rate=0.041062702635337185, max_depth=10, metric='auc',
     min_child_samples=38, min_child_weight=0.001, min_split_gain=0.0,
     n_estimators=177, n_jobs=-1, num_leaves=105, objective='binary',
     random_state=None, reg_alpha=0.1975803071715644, reg_lambda=0.0,
     silent=True, subsample=1.0, subsample_for_bin=200000,
     subsample_freq=0)

Et on sauvegarde ce modèle dans le répertoire ``data/model``.

In [12]:
# On sauvegarde le modèle et la liste des features
model_params = {'model': model.get_params(), 'study': study}
filename = 'data/model/params_recall.sav'
pickle.dump(model_params, open(filename, 'wb'))

In [13]:
# Aperçu du dataframe des combinaisons de valeurs testées
df_res = pickle.load(open(filename, 'rb'))['study'].trials_dataframe()
df_res.head()

Unnamed: 0_level_0,number,state,value,datetime_start,datetime_complete,params,params,params,params,params,params,params,system_attrs
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,colsample_bytree,learning_rate,max_depth,min_child_samples,n_estimators,num_leaves,reg_alpha,_number
0,0,TrialState.COMPLETE,0.685325,2020-01-04 01:34:53.160693,2020-01-04 01:34:56.950181,0.59137,0.054763,4,53,49,70,0.65334,0
1,1,TrialState.COMPLETE,0.695976,2020-01-04 01:34:56.958199,2020-01-04 01:35:03.181929,0.53759,0.044782,9,73,57,29,0.270506,1
2,2,TrialState.COMPLETE,0.672412,2020-01-04 01:35:03.181929,2020-01-04 01:35:07.030881,0.860833,0.027084,5,30,23,67,0.222834,2
3,3,TrialState.COMPLETE,0.704542,2020-01-04 01:35:07.030881,2020-01-04 01:35:20.652091,0.82806,0.063779,10,23,94,113,0.978763,3
4,4,TrialState.COMPLETE,0.680679,2020-01-04 01:35:20.652091,2020-01-04 01:35:25.212316,0.997542,0.034954,7,86,23,104,0.34843,4


In [14]:
t1 = time()
print("computing time : {:8.6f} sec".format(t1-t0))
print("computing time : " + strftime('%H:%M:%S', gmtime(t1-t0)))

computing time : 17337.996145 sec
computing time : 04:48:57
