# Projet 7 : Implémentez un modèle de scoring
*Philippe LONJON (janvier 2020)*

---
Ce projet consiste à développer un modèle de scoring, qui donnera une prédiction sur la probabilité de défaut de paiement d'un client qui demande un prêt.
Il s'agit d'un problème :
* **Supervisé** : Les étiquettes (Défauts de paiement) sont connus
* **Classification** : Les valeurs à prédire sont des variables qualitatives
---
## Notebook 4.3 : Optimisation des hyper-paramètres avec métrique precision
Le notebook comprend :
- L'optimisation des hyper-paramètres d'un modèle de gradient boosting selon la métrique choisie

Les données à utiliser sont issues du notebook précédent, et se trouvent dans le répertoire : `data/features`<br>
Le résultat de l'optimisation sont sauvegardées dans le répertoire : ``data/model``

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Graphics
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
import lightgbm as lgb
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import f1_score, recall_score, accuracy_score, precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

# Divers
from time import time, strftime, gmtime
import gc
import pickle

# Module des fonctions du notebook
import fonctions08 as f

# Autoreload pour prise en compte des changments dans le module fonctions
%load_ext autoreload
%autoreload 1
%aimport fonctions08

In [2]:
# Heure démarrage
t0=time()

# Constantes
RANDOM_STATE = 1
NROWS = None
NCOLS = None
N_FOLDS = 5

# Paramètres pour les validations croisées
random_seed = 1 # Seed pour générateur de nombre aléatoire
folds = 5 # Nombre de folds pour validation croisée

In [3]:
# Chargement des données d'entrainement
features = f.import_csv("data/features/train_features_selected.csv", nrows=NROWS)
target = f.import_csv("data/features/train_target.csv", nrows=NROWS)

features = features.set_index("SK_ID_CURR")
target = target.set_index("SK_ID_CURR")

print(features.shape)
print(target.shape)

Memory usage of dataframe is 505.82 MB
Memory usage after optimization is: 164.84 MB
Decreased by 67.4%
Memory usage of dataframe is 3.28 MB
Memory usage after optimization is: 1.03 MB
Decreased by 68.7%
(215257, 307)
(215257, 1)


# 1. Création d'un jeu de données avec classes équilibrées
Les clients ayant fait défaut ne représentent que 8% des données. Ce déséquilibre dans les classes des étiquettes conduit à un taux de faux positifs élevé. Pour faire face à ce problème, on va sous-échantillonner la classe majoritaire, de sorte que les deux classes aient un échantillon de même taille.

In [4]:
# Taille de l'échantillon avec une faible représentation
target[target['TARGET'] == 1].shape

(17377, 1)

In [5]:
# Création d'un jeu de données ave classes équilibrées
target_0 = target[target['TARGET'] == 0]
target_1 = target[target['TARGET'] == 1]

target_0_sample = target_0.sample(len(target_1), random_state=RANDOM_STATE)

target_sample_balanced = pd.concat([target_0_sample, target_1], axis=0)
features_sample_balanced = features.loc[target_sample_balanced.index, :].iloc[:, :NCOLS]

In [6]:
# Dimensions du nouveau jeu de données
features_sample_balanced.shape

(34754, 307)

In [7]:
# Creation des données d'entrainement et de test
# On limite les tailles pour garder des temps de calculs raisonnables
X_train, X_test, y_train, y_test =\
    train_test_split(features_sample_balanced, target_sample_balanced,
                     train_size=15000, test_size=6000,
                     stratify=target_sample_balanced, random_state=RANDOM_STATE)

# 2. Recherche des hyper-paramètres
Nous effectuons la recherche d'hyper-paramètres avec l'algorithme de la librairie Optuna.<br>
Cet algorithme effectue la recherche selon une approche probabilistique, en prenant en compte les évaluations déjà effectuées, ce qui permet d'optimiser le temps de recherche.<br>
Il faut définir la fonction objectif, qui retourne le score à optimiser, à maximiser ou minimiser selon la métrique choisie.

In [8]:
import optuna

def objective(trial):
    params = {
        'verbosity':-1,
        'objective': 'binary',
        'metric': 'auc',        
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators', 20, 200),
        'num_leaves ': trial.suggest_int('num_leaves', 8, 128),
        'max_depth': trial.suggest_int('max_depth', 4, 10),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 100),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_uniform('reg_alpha', 0.0, 1.0),
    }
    
    k_fold = KFold(n_splits=N_FOLDS, shuffle=False, random_state=RANDOM_STATE)
    
    scores=[]

    for train_index, valid_index in k_fold.split(X_train):
    
        # Training data for the fold
        fold_X_train, fold_y_train = X_train.iloc[train_index, :], y_train.iloc[train_index, :]
        
        # Validation data for the fold
        fold_X_valid, fold_y_valid = X_train.iloc[valid_index, :], y_train.iloc[valid_index, :]
        
        model = lgb.LGBMModel(random_state=1)
        model.set_params(**params)
        model.fit(fold_X_train, np.ravel(fold_y_train))
        preds = model.predict(fold_X_valid)
        scores.append(precision_score(fold_y_valid, np.rint(preds)))
   
    return np.mean(scores)

On se fixe 1000 itérations pour trouver les hyper-paramètres optimaux qui maximise la précision.

In [9]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000, timeout=None) # timeout (seconds)

[I 2020-01-04 07:10:33,651] Finished trial#0 resulted in value: 0.6969779237188753. Current best value is 0.6969779237188753 with parameters: {'n_estimators': 134, 'num_leaves': 20, 'max_depth': 5, 'learning_rate': 0.029467337406420886, 'min_child_samples': 79, 'colsample_bytree': 0.7663610999519443, 'reg_alpha': 0.9088739868230972}.
[I 2020-01-04 07:10:43,696] Finished trial#1 resulted in value: 0.7005030040927451. Current best value is 0.7005030040927451 with parameters: {'n_estimators': 85, 'num_leaves': 69, 'max_depth': 7, 'learning_rate': 0.06363288580508421, 'min_child_samples': 43, 'colsample_bytree': 0.7864735238206862, 'reg_alpha': 0.8599518902866621}.
[I 2020-01-04 07:10:52,523] Finished trial#2 resulted in value: 0.6829695317823791. Current best value is 0.7005030040927451 with parameters: {'n_estimators': 85, 'num_leaves': 69, 'max_depth': 7, 'learning_rate': 0.06363288580508421, 'min_child_samples': 43, 'colsample_bytree': 0.7864735238206862, 'reg_alpha': 0.859951890286662

[I 2020-01-04 07:20:32,709] Finished trial#48 resulted in value: 0.7032306233336851. Current best value is 0.7037333226976445 with parameters: {'n_estimators': 173, 'num_leaves': 101, 'max_depth': 6, 'learning_rate': 0.03348001639333178, 'min_child_samples': 65, 'colsample_bytree': 0.7062067124716284, 'reg_alpha': 0.27804470497588735}.
[I 2020-01-04 07:20:51,611] Finished trial#49 resulted in value: 0.7038236908247204. Current best value is 0.7038236908247204 with parameters: {'n_estimators': 188, 'num_leaves': 54, 'max_depth': 10, 'learning_rate': 0.07100930450029344, 'min_child_samples': 75, 'colsample_bytree': 0.82929896042348, 'reg_alpha': 0.3414728518361173}.
[I 2020-01-04 07:21:10,473] Finished trial#50 resulted in value: 0.7008844817280966. Current best value is 0.7038236908247204 with parameters: {'n_estimators': 188, 'num_leaves': 54, 'max_depth': 10, 'learning_rate': 0.07100930450029344, 'min_child_samples': 75, 'colsample_bytree': 0.82929896042348, 'reg_alpha': 0.34147285183

[I 2020-01-04 07:32:48,543] Finished trial#96 resulted in value: 0.7013315879720519. Current best value is 0.704412630130687 with parameters: {'n_estimators': 175, 'num_leaves': 96, 'max_depth': 8, 'learning_rate': 0.0688759519216885, 'min_child_samples': 63, 'colsample_bytree': 0.7574905885788854, 'reg_alpha': 0.2776951644063018}.
[I 2020-01-04 07:33:05,217] Finished trial#97 resulted in value: 0.7034934525727612. Current best value is 0.704412630130687 with parameters: {'n_estimators': 175, 'num_leaves': 96, 'max_depth': 8, 'learning_rate': 0.0688759519216885, 'min_child_samples': 63, 'colsample_bytree': 0.7574905885788854, 'reg_alpha': 0.2776951644063018}.
[I 2020-01-04 07:33:15,415] Finished trial#98 resulted in value: 0.6984576258055868. Current best value is 0.704412630130687 with parameters: {'n_estimators': 175, 'num_leaves': 96, 'max_depth': 8, 'learning_rate': 0.0688759519216885, 'min_child_samples': 63, 'colsample_bytree': 0.7574905885788854, 'reg_alpha': 0.2776951644063018}

[I 2020-01-04 07:44:57,937] Finished trial#144 resulted in value: 0.7025253048469958. Current best value is 0.7058214030558541 with parameters: {'n_estimators': 173, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.039073999129071654, 'min_child_samples': 73, 'colsample_bytree': 0.7242872169757882, 'reg_alpha': 0.2014060217468419}.
[I 2020-01-04 07:45:14,965] Finished trial#145 resulted in value: 0.7037050936803053. Current best value is 0.7058214030558541 with parameters: {'n_estimators': 173, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.039073999129071654, 'min_child_samples': 73, 'colsample_bytree': 0.7242872169757882, 'reg_alpha': 0.2014060217468419}.
[I 2020-01-04 07:45:32,309] Finished trial#146 resulted in value: 0.702843393126323. Current best value is 0.7058214030558541 with parameters: {'n_estimators': 173, 'num_leaves': 115, 'max_depth': 9, 'learning_rate': 0.039073999129071654, 'min_child_samples': 73, 'colsample_bytree': 0.7242872169757882, 'reg_alpha': 0.201

[I 2020-01-04 07:56:40,123] Finished trial#192 resulted in value: 0.7006082468874721. Current best value is 0.7063691129690083 with parameters: {'n_estimators': 171, 'num_leaves': 88, 'max_depth': 8, 'learning_rate': 0.04820691947055349, 'min_child_samples': 67, 'colsample_bytree': 0.5682266546312748, 'reg_alpha': 0.16956673760292854}.
[I 2020-01-04 07:56:54,423] Finished trial#193 resulted in value: 0.7065933131375645. Current best value is 0.7065933131375645 with parameters: {'n_estimators': 175, 'num_leaves': 77, 'max_depth': 8, 'learning_rate': 0.048194550671653576, 'min_child_samples': 62, 'colsample_bytree': 0.6085925136927083, 'reg_alpha': 0.14519701809637617}.
[I 2020-01-04 07:57:08,645] Finished trial#194 resulted in value: 0.7039191989946978. Current best value is 0.7065933131375645 with parameters: {'n_estimators': 175, 'num_leaves': 77, 'max_depth': 8, 'learning_rate': 0.048194550671653576, 'min_child_samples': 62, 'colsample_bytree': 0.6085925136927083, 'reg_alpha': 0.1451

[I 2020-01-04 08:08:17,001] Finished trial#240 resulted in value: 0.7033861141087434. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:08:31,029] Finished trial#241 resulted in value: 0.7019310987515583. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:08:45,645] Finished trial#242 resulted in value: 0.7041231452189498. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 08:19:59,849] Finished trial#288 resulted in value: 0.7029675436377295. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:20:15,486] Finished trial#289 resulted in value: 0.7007062716262681. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:20:29,995] Finished trial#290 resulted in value: 0.7029753069304021. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 08:31:43,403] Finished trial#336 resulted in value: 0.703208535588787. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:31:58,605] Finished trial#337 resulted in value: 0.7027128857229953. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:32:11,602] Finished trial#338 resulted in value: 0.7042997683872314. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.0826639

[I 2020-01-04 08:43:02,931] Finished trial#384 resulted in value: 0.7038045079031929. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:43:17,318] Finished trial#385 resulted in value: 0.7027671067266719. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:43:32,785] Finished trial#386 resulted in value: 0.702252507591022. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.0826639

[I 2020-01-04 08:54:19,754] Finished trial#432 resulted in value: 0.700946197547917. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:54:34,281] Finished trial#433 resulted in value: 0.7014175881762521. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 08:54:48,098] Finished trial#434 resulted in value: 0.7043812240752797. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.0826639

[I 2020-01-04 09:06:04,145] Finished trial#480 resulted in value: 0.7057550980000385. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:06:20,063] Finished trial#481 resulted in value: 0.7041438107044286. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:06:36,456] Finished trial#482 resulted in value: 0.7059796698377314. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 09:19:07,637] Finished trial#528 resulted in value: 0.7048446022820796. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:19:24,164] Finished trial#529 resulted in value: 0.7056803792290152. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:19:40,566] Finished trial#530 resulted in value: 0.7044219360173334. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 09:32:09,737] Finished trial#576 resulted in value: 0.7011311049564402. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:32:25,046] Finished trial#577 resulted in value: 0.7045078797873833. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:32:40,653] Finished trial#578 resulted in value: 0.7063655994575313. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 09:44:25,530] Finished trial#624 resulted in value: 0.7049680742795637. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:44:40,964] Finished trial#625 resulted in value: 0.7003951393300799. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:44:55,335] Finished trial#626 resulted in value: 0.7043793952717927. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 09:56:13,844] Finished trial#672 resulted in value: 0.7024576165168913. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:56:29,122] Finished trial#673 resulted in value: 0.7027199977722747. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 09:56:42,864] Finished trial#674 resulted in value: 0.7035822414347497. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 10:07:10,601] Finished trial#720 resulted in value: 0.704093027661163. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:07:25,113] Finished trial#721 resulted in value: 0.7051955451084675. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:07:39,676] Finished trial#722 resulted in value: 0.7043125166460447. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.0826639

[I 2020-01-04 10:18:21,865] Finished trial#768 resulted in value: 0.7054199881566736. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:18:35,346] Finished trial#769 resulted in value: 0.703827853740726. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:18:49,113] Finished trial#770 resulted in value: 0.7010748312329262. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.0826639

[I 2020-01-04 10:29:14,987] Finished trial#816 resulted in value: 0.7022564927786531. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:29:27,809] Finished trial#817 resulted in value: 0.7043466346395711. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:29:42,945] Finished trial#818 resulted in value: 0.7059055895229143. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 10:40:31,210] Finished trial#864 resulted in value: 0.7035642273100364. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:40:44,848] Finished trial#865 resulted in value: 0.702748879988673. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:40:58,844] Finished trial#866 resulted in value: 0.7057126822535453. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.0826639

[I 2020-01-04 10:51:56,662] Finished trial#912 resulted in value: 0.7032405692578663. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:52:11,549] Finished trial#913 resulted in value: 0.7037598446886094. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 10:52:26,733] Finished trial#914 resulted in value: 0.7029367925639387. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

[I 2020-01-04 11:03:40,601] Finished trial#960 resulted in value: 0.7003924245874872. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 11:03:55,646] Finished trial#961 resulted in value: 0.7035571140630655. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.08266397035033914}.
[I 2020-01-04 11:04:05,552] Finished trial#962 resulted in value: 0.7046166292020436. Current best value is 0.7074819047765459 with parameters: {'n_estimators': 185, 'num_leaves': 79, 'max_depth': 8, 'learning_rate': 0.04590899063540378, 'min_child_samples': 61, 'colsample_bytree': 0.6303490466483364, 'reg_alpha': 0.082663

On peut alors afficher le résultat de la recherche.

In [10]:
print('Number of finished trials: {}'.format(len(study.trials)))

print('Best trial:')
trial = study.best_trial

print('  Value: {}'.format(trial.value))

print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

Number of finished trials: 1000
Best trial:
  Value: 0.7074819047765459
  Params: 
    n_estimators: 185
    num_leaves: 79
    max_depth: 8
    learning_rate: 0.04590899063540378
    min_child_samples: 61
    colsample_bytree: 0.6303490466483364
    reg_alpha: 0.08266397035033914


# 3. Sauvegarde du modèle
On crée ensuite un modèle de gradient bossting avec les meilleurs hyper-paramètres.

In [11]:
# Modele final
model = lgb.LGBMModel(objective='binary', metric='auc',
                      boosting_type= 'gbdt')
model.set_params(**study.best_params)
model

LGBMModel(boosting_type='gbdt', class_weight=None,
     colsample_bytree=0.6303490466483364, importance_type='split',
     learning_rate=0.04590899063540378, max_depth=8, metric='auc',
     min_child_samples=61, min_child_weight=0.001, min_split_gain=0.0,
     n_estimators=185, n_jobs=-1, num_leaves=79, objective='binary',
     random_state=None, reg_alpha=0.08266397035033914, reg_lambda=0.0,
     silent=True, subsample=1.0, subsample_for_bin=200000,
     subsample_freq=0)

Et on sauvegarde ce modèle dans le répertoire ``data/model``.

In [12]:
# On sauvegarde le modèle et la liste des features
model_params = {'model': model.get_params(), 'study': study}
filename = 'data/model/params_precision.sav'
pickle.dump(model_params, open(filename, 'wb'))

In [13]:
# Aperçu du dataframe des combinaisons de valeurs testées
df_res = pickle.load(open(filename, 'rb'))['study'].trials_dataframe()
df_res.head()

Unnamed: 0_level_0,number,state,value,datetime_start,datetime_complete,params,params,params,params,params,params,params,system_attrs
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,colsample_bytree,learning_rate,max_depth,min_child_samples,n_estimators,num_leaves,reg_alpha,_number
0,0,TrialState.COMPLETE,0.696978,2020-01-04 07:10:23.892168,2020-01-04 07:10:33.651048,0.766361,0.029467,5,79,134,20,0.908874,0
1,1,TrialState.COMPLETE,0.700503,2020-01-04 07:10:33.651048,2020-01-04 07:10:43.696018,0.786474,0.063633,7,43,85,69,0.859952,1
2,2,TrialState.COMPLETE,0.68297,2020-01-04 07:10:43.696018,2020-01-04 07:10:52.523756,0.818411,0.016907,9,23,52,53,0.463047,2
3,3,TrialState.COMPLETE,0.700285,2020-01-04 07:10:52.523756,2020-01-04 07:11:01.994538,0.894669,0.059838,4,83,187,13,0.65539,3
4,4,TrialState.COMPLETE,0.692109,2020-01-04 07:11:01.994538,2020-01-04 07:11:06.282309,0.75909,0.07007,8,82,25,99,0.095026,4


In [14]:
t1 = time()
print("computing time : {:8.6f} sec".format(t1-t0))
print("computing time : " + strftime('%H:%M:%S', gmtime(t1-t0)))

computing time : 14611.552607 sec
computing time : 04:03:31
