# Projet 7 : Implémentez un modèle de scoring
*Philippe LONJON (janvier 2020)*

---
Ce projet consiste à développer un modèle de scoring, qui donnera une prédiction sur la probabilité de défaut de paiement d'un client qui demande un prêt.
Il s'agit d'un problème :
* **Supervisé** : Les étiquettes (Défauts de paiement) sont connus
* **Classification** : Les valeurs à prédire sont des variables qualitatives
---
## Notebook 4.5 : Optimisation des hyper-paramètres avec métrique auc-score
Le notebook comprend :
- L'optimisation des hyper-paramètres d'un modèle de gradient boosting selon la métrique choisie

Les données à utiliser sont issues du notebook précédent, et se trouvent dans le répertoire : `data/features`<br>
Le résultat de l'optimisation sont sauvegardées dans le répertoire : ``data/model``

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Graphics
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
import lightgbm as lgb
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import f1_score, recall_score, accuracy_score, precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

# Divers
from time import time, strftime, gmtime
import gc
import pickle

# Module des fonctions du notebook
import fonctions08 as f

# Autoreload pour prise en compte des changments dans le module fonctions
%load_ext autoreload
%autoreload 1
%aimport fonctions08

In [2]:
# Heure démarrage
t0=time()

# Constantes
RANDOM_STATE = 1
NROWS = None
NCOLS = None
N_FOLDS = 5

# Paramètres pour les validations croisées
random_seed = 1 # Seed pour générateur de nombre aléatoire
folds = 5 # Nombre de folds pour validation croisée

In [3]:
# Chargement des données d'entrainement
features = f.import_csv("data/features/train_features_selected.csv", nrows=NROWS)
target = f.import_csv("data/features/train_target.csv", nrows=NROWS)

features = features.set_index("SK_ID_CURR")
target = target.set_index("SK_ID_CURR")

print(features.shape)
print(target.shape)

Memory usage of dataframe is 505.82 MB
Memory usage after optimization is: 164.84 MB
Decreased by 67.4%
Memory usage of dataframe is 3.28 MB
Memory usage after optimization is: 1.03 MB
Decreased by 68.7%
(215257, 307)
(215257, 1)


# 1. Création d'un jeu de données avec classes équilibrées
Les clients ayant fait défaut ne représentent que 8% des données. Ce déséquilibre dans les classes des étiquettes conduit à un taux de faux positifs élevé. Pour faire face à ce problème, on va sous-échantillonner la classe majoritaire, de sorte que les deux classes aient un échantillon de même taille.

In [4]:
# Taille de l'échantillon avec une faible représentation
target[target['TARGET'] == 1].shape

(17377, 1)

In [5]:
# Création d'un jeu de données ave classes équilibrées
target_0 = target[target['TARGET'] == 0]
target_1 = target[target['TARGET'] == 1]

target_0_sample = target_0.sample(len(target_1), random_state=RANDOM_STATE)

target_sample_balanced = pd.concat([target_0_sample, target_1], axis=0)
features_sample_balanced = features.loc[target_sample_balanced.index, :].iloc[:, :NCOLS]

In [6]:
# Dimensions du nouveau jeu de données
features_sample_balanced.shape

(34754, 307)

In [7]:
# Creation des données d'entrainement et de test
# On limite les tailles pour garder des temps de calculs raisonnables
X_train, X_test, y_train, y_test =\
    train_test_split(features_sample_balanced, target_sample_balanced,
                     train_size=15000, test_size=6000,
                     stratify=target_sample_balanced, random_state=RANDOM_STATE)

# 2. Recherche des hyper-paramètres
Nous effectuons la recherche d'hyper-paramètres avec l'algorithme de la librairie Optuna.<br>
Cet algorithme effectue la recherche selon une approche probabilistique, en prenant en compte les évaluations déjà effectuées, ce qui permet d'optimiser le temps de recherche.<br>
Il faut définir la fonction objectif, qui retourne le score à optimiser, à maximiser ou minimiser selon la métrique choisie.

In [8]:
import optuna

def objective(trial):
    params = {
        'verbosity':-1,
        'objective': 'binary',
        'metric': 'auc',        
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators', 20, 200),
        'num_leaves ': trial.suggest_int('num_leaves', 8, 128),
        'max_depth': trial.suggest_int('max_depth', 4, 10),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_uniform('reg_alpha', 0.0, 1.0),
    }
    
    k_fold = KFold(n_splits=N_FOLDS, shuffle=False, random_state=RANDOM_STATE)
    
    scores=[]

    for train_index, valid_index in k_fold.split(X_train):
    
        # Training data for the fold
        fold_X_train, fold_y_train = X_train.iloc[train_index, :], y_train.iloc[train_index, :]
        
        # Validation data for the fold
        fold_X_valid, fold_y_valid = X_train.iloc[valid_index, :], y_train.iloc[valid_index, :]
        
        model = lgb.LGBMModel(random_state=1)
        model.set_params(**params)
        model.fit(fold_X_train, np.ravel(fold_y_train))
        preds = model.predict(fold_X_valid)
        scores.append(roc_auc_score(fold_y_valid, preds))
   
    return np.mean(scores)

On se fixe 1000 itérations pour trouver les hyper-paramètres optimaux qui maximise l'aire sous la courbe ROC.

In [9]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000, timeout=None) # timeout (seconds)

[I 2020-01-04 16:21:52,941] Finished trial#0 resulted in value: 0.7637830202977364. Current best value is 0.7637830202977364 with parameters: {'n_estimators': 48, 'num_leaves': 68, 'max_depth': 7, 'learning_rate': 0.09427210953337621, 'min_child_samples': 35, 'colsample_bytree': 0.9062174927316424, 'reg_alpha': 0.9852975191722028}.
[I 2020-01-04 16:22:05,329] Finished trial#1 resulted in value: 0.768816355910494. Current best value is 0.768816355910494 with parameters: {'n_estimators': 167, 'num_leaves': 42, 'max_depth': 9, 'learning_rate': 0.07882156726248452, 'min_child_samples': 83, 'colsample_bytree': 0.5650707190729094, 'reg_alpha': 0.4872301449397253}.
[I 2020-01-04 16:22:20,039] Finished trial#2 resulted in value: 0.7692060761731658. Current best value is 0.7692060761731658 with parameters: {'n_estimators': 188, 'num_leaves': 100, 'max_depth': 7, 'learning_rate': 0.07067560809683394, 'min_child_samples': 68, 'colsample_bytree': 0.7061200004593196, 'reg_alpha': 0.6095074657722065

[I 2020-01-04 16:31:29,729] Finished trial#48 resulted in value: 0.7563554656989843. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 16:31:42,498] Finished trial#49 resulted in value: 0.7706079924254645. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 16:31:55,934] Finished trial#50 resulted in value: 0.770287557011269. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.4345673

[I 2020-01-04 16:41:39,667] Finished trial#96 resulted in value: 0.7463028707655537. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 16:41:50,524] Finished trial#97 resulted in value: 0.7705649969337236. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 16:42:04,007] Finished trial#98 resulted in value: 0.7717648246462112. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.434567

[I 2020-01-04 16:51:58,237] Finished trial#144 resulted in value: 0.7715536841681281. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 16:52:06,407] Finished trial#145 resulted in value: 0.7669589249829649. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 16:52:14,762] Finished trial#146 resulted in value: 0.7689513988962491. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.434

[I 2020-01-04 17:01:22,896] Finished trial#192 resulted in value: 0.7713874584085894. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 17:01:34,143] Finished trial#193 resulted in value: 0.7714320407610853. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 17:01:47,676] Finished trial#194 resulted in value: 0.7703879495090075. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.434

[I 2020-01-04 17:11:20,385] Finished trial#240 resulted in value: 0.771232774184854. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 17:11:31,679] Finished trial#241 resulted in value: 0.7715553892764777. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.43456736725232115}.
[I 2020-01-04 17:11:43,724] Finished trial#242 resulted in value: 0.7707220596697244. Current best value is 0.7726570781055055 with parameters: {'n_estimators': 162, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.05051756660216091, 'min_child_samples': 93, 'colsample_bytree': 0.6315516044993039, 'reg_alpha': 0.4345

[I 2020-01-04 17:21:00,093] Finished trial#288 resulted in value: 0.7713926326306207. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.6605510148168316}.
[I 2020-01-04 17:21:13,091] Finished trial#289 resulted in value: 0.7713676109191969. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.6605510148168316}.
[I 2020-01-04 17:21:26,037] Finished trial#290 resulted in value: 0.7700988149389215. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.66055101

[I 2020-01-04 17:30:58,211] Finished trial#336 resulted in value: 0.7723303205411931. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.6605510148168316}.
[I 2020-01-04 17:31:12,058] Finished trial#337 resulted in value: 0.7702244498115499. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.6605510148168316}.
[I 2020-01-04 17:31:26,521] Finished trial#338 resulted in value: 0.7706786594575188. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.66055101

[I 2020-01-04 17:40:57,557] Finished trial#384 resulted in value: 0.7712734000667806. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.6605510148168316}.
[I 2020-01-04 17:41:09,851] Finished trial#385 resulted in value: 0.7706379442098472. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.6605510148168316}.
[I 2020-01-04 17:41:23,224] Finished trial#386 resulted in value: 0.771894798707028. Current best value is 0.7728789984972189 with parameters: {'n_estimators': 157, 'num_leaves': 64, 'max_depth': 9, 'learning_rate': 0.05193211203815715, 'min_child_samples': 95, 'colsample_bytree': 0.5135455567644769, 'reg_alpha': 0.660551014

[I 2020-01-04 17:51:25,974] Finished trial#432 resulted in value: 0.7714047948769835. Current best value is 0.7730313476788926 with parameters: {'n_estimators': 175, 'num_leaves': 39, 'max_depth': 10, 'learning_rate': 0.055464850568572095, 'min_child_samples': 91, 'colsample_bytree': 0.5530373120225581, 'reg_alpha': 0.583826582640208}.
[I 2020-01-04 17:51:39,393] Finished trial#433 resulted in value: 0.7718932566727363. Current best value is 0.7730313476788926 with parameters: {'n_estimators': 175, 'num_leaves': 39, 'max_depth': 10, 'learning_rate': 0.055464850568572095, 'min_child_samples': 91, 'colsample_bytree': 0.5530373120225581, 'reg_alpha': 0.583826582640208}.
[I 2020-01-04 17:51:52,973] Finished trial#434 resulted in value: 0.7708251390016114. Current best value is 0.7730313476788926 with parameters: {'n_estimators': 175, 'num_leaves': 39, 'max_depth': 10, 'learning_rate': 0.055464850568572095, 'min_child_samples': 91, 'colsample_bytree': 0.5530373120225581, 'reg_alpha': 0.5838

[I 2020-01-04 18:02:05,591] Finished trial#480 resulted in value: 0.7727192919771849. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:02:20,088] Finished trial#481 resulted in value: 0.7723984977637045. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:02:34,850] Finished trial#482 resulted in value: 0.7712347623712443. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

[I 2020-01-04 18:12:51,780] Finished trial#528 resulted in value: 0.7703979632349391. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:13:05,823] Finished trial#529 resulted in value: 0.7700025436702665. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:13:18,600] Finished trial#530 resulted in value: 0.77178936147301. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161

[I 2020-01-04 18:23:09,941] Finished trial#576 resulted in value: 0.7705207078843518. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:23:23,677] Finished trial#577 resulted in value: 0.7716632917000091. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:23:36,176] Finished trial#578 resulted in value: 0.770651356053359. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.585816

[I 2020-01-04 18:33:59,356] Finished trial#624 resulted in value: 0.7690476296446207. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:34:12,072] Finished trial#625 resulted in value: 0.7714951277314767. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:34:24,455] Finished trial#626 resulted in value: 0.7715996962127926. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

[I 2020-01-04 18:44:00,529] Finished trial#672 resulted in value: 0.7712542567636251. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:44:13,542] Finished trial#673 resulted in value: 0.7718302190048469. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:44:27,648] Finished trial#674 resulted in value: 0.7726387560690996. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

[I 2020-01-04 18:54:32,591] Finished trial#720 resulted in value: 0.7712625158116954. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:54:45,353] Finished trial#721 resulted in value: 0.7717118495865904. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 18:54:58,925] Finished trial#722 resulted in value: 0.7716767923712755. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

[I 2020-01-04 19:04:47,175] Finished trial#768 resulted in value: 0.7724042983017085. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:04:59,813] Finished trial#769 resulted in value: 0.7714910881052213. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:05:12,653] Finished trial#770 resulted in value: 0.7708751488147175. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

[I 2020-01-04 19:15:18,163] Finished trial#816 resulted in value: 0.7718773384074094. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:15:30,832] Finished trial#817 resulted in value: 0.7708597145505005. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:15:42,908] Finished trial#818 resulted in value: 0.7728039312026508. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

[I 2020-01-04 19:25:10,114] Finished trial#864 resulted in value: 0.770988119862215. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:25:23,817] Finished trial#865 resulted in value: 0.7717273636173145. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:25:35,223] Finished trial#866 resulted in value: 0.7708295787228148. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.585816

[I 2020-01-04 19:34:46,792] Finished trial#912 resulted in value: 0.772134762016903. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:34:59,899] Finished trial#913 resulted in value: 0.7719407500415215. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:35:12,286] Finished trial#914 resulted in value: 0.7716373446199157. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.585816

[I 2020-01-04 19:46:19,508] Finished trial#960 resulted in value: 0.7720436169938011. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:46:35,410] Finished trial#961 resulted in value: 0.7721381755230157. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.5858161302486099}.
[I 2020-01-04 19:46:51,878] Finished trial#962 resulted in value: 0.7716806732602914. Current best value is 0.7733331767775958 with parameters: {'n_estimators': 166, 'num_leaves': 41, 'max_depth': 10, 'learning_rate': 0.05108522686706924, 'min_child_samples': 91, 'colsample_bytree': 0.5264262885044033, 'reg_alpha': 0.58581

On peut alors afficher le résultat de la recherche.

In [10]:
print('Number of finished trials: {}'.format(len(study.trials)))

print('Best trial:')
trial = study.best_trial

print('  Value: {}'.format(trial.value))

print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

Number of finished trials: 1000
Best trial:
  Value: 0.7733331767775958
  Params: 
    n_estimators: 166
    num_leaves: 41
    max_depth: 10
    learning_rate: 0.05108522686706924
    min_child_samples: 91
    colsample_bytree: 0.5264262885044033
    reg_alpha: 0.5858161302486099


# 3. Sauvegarde du modèle
On crée ensuite un modèle de gradient bossting avec les meilleurs hyper-paramètres.

In [11]:
# Modele final
model = lgb.LGBMModel(objective='binary', metric='auc',
                      boosting_type= 'gbdt')
model.set_params(**study.best_params)
model

LGBMModel(boosting_type='gbdt', class_weight=None,
     colsample_bytree=0.5264262885044033, importance_type='split',
     learning_rate=0.05108522686706924, max_depth=10, metric='auc',
     min_child_samples=91, min_child_weight=0.001, min_split_gain=0.0,
     n_estimators=166, n_jobs=-1, num_leaves=41, objective='binary',
     random_state=None, reg_alpha=0.5858161302486099, reg_lambda=0.0,
     silent=True, subsample=1.0, subsample_for_bin=200000,
     subsample_freq=0)

Et on sauvegarde ce modèle dans le répertoire ``data/model``.

In [12]:
# On sauvegarde le modèle et la liste des features
model_params = {'model': model.get_params(), 'study': study}
filename = 'data/model/params_auc.sav'
pickle.dump(model_params, open(filename, 'wb'))

In [13]:
# Aperçu du dataframe des combinaisons de valeurs testées
df_res = pickle.load(open(filename, 'rb'))['study'].trials_dataframe()
df_res.head()

Unnamed: 0_level_0,number,state,value,datetime_start,datetime_complete,params,params,params,params,params,params,params,system_attrs
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,colsample_bytree,learning_rate,max_depth,min_child_samples,n_estimators,num_leaves,reg_alpha,_number
0,0,TrialState.COMPLETE,0.763783,2020-01-04 16:21:45.676411,2020-01-04 16:21:52.941398,0.906217,0.094272,7,35,48,68,0.985298,0
1,1,TrialState.COMPLETE,0.768816,2020-01-04 16:21:52.957019,2020-01-04 16:22:05.329816,0.565071,0.078822,9,83,167,42,0.48723,1
2,2,TrialState.COMPLETE,0.769206,2020-01-04 16:22:05.329816,2020-01-04 16:22:20.024201,0.70612,0.070676,7,68,188,100,0.609507,2
3,3,TrialState.COMPLETE,0.767707,2020-01-04 16:22:20.039821,2020-01-04 16:22:27.862950,0.874462,0.075611,6,62,67,95,0.797562,3
4,4,TrialState.COMPLETE,0.761671,2020-01-04 16:22:27.864959,2020-01-04 16:22:46.008238,0.7272,0.014404,6,57,199,26,0.147548,4


In [14]:
t1 = time()
print("computing time : {:8.6f} sec".format(t1-t0))
print("computing time : " + strftime('%H:%M:%S', gmtime(t1-t0)))

computing time : 12956.779487 sec
computing time : 03:35:56
