# Sobre gradient boosting 

* ref. alg. gradient boosting:  https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

Los parámetros clave a optimizar son:

1. `n_estimators`: el número de weak learners (árboles) utilizados.
   
2. `max_depth`: controla la profundidad de cada árbol.
   
3. `learning_rate`: hiperparámetro entre 0 y 1, que controla overfitting por medio del método *shringkage* (método de regularización. Parámetro que controla la contribución de cada árbol al aprendizaje. La forma adecuada de utilización es setear learning rate a un valor bajo, < 0.1, y controlar `n_estimators` por *early_stopping*.)

4. `subsample`: permite controlar la proporción de datos con el que es entrenado un árbol individual. Por default es `1` pero si se parametriza < 1 los árboles se construyen con data aleatoria. Esto ayuda a disminuir la varianza y aumentar sesgo. 

Parámetros para el early stopping:

1. `validation_fraction`: permite especificar una fracción de datos para utilizarlo en el early stopping. Si después de `n_iter_no_change` en `validation_fraction` no mejora `loss_function`, entonces para el proceso.

2. `n_iter_no_change`: número de iteraciones a considerar en el boosting después del cual se aplica early stopping.

In [1]:
import os
import numpy as np
import pandas as pd
import pickle
from ipynb.fs.full.n01preprocessing import load_obj
from ipynb.fs.full.n01preprocessing import save_obj
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from time import time

> Trained imputers.
> Applied imputers.
> Trained encoders.
> Applied encoders.
> Applied imputers.
> Applied encoders.


In [2]:
# Project
workdir = '/home/walter/Documents/personal_projects/new-titan'
exp_prefix = 'notebooks/experiments/exp_03'
data_prefix = 'data'
chk_prefix = 'checkpoint'

# Params
target = 'Survived'
features = ['Sex', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
idx = 'Passengerid'

# ALGO PARS FOR TUNNING
loss_function = ['log_loss'] # función objetivo a optimizar 
learning_rate = [0.01, 0.03, 0.05, 0.1]
n_estimators = [50, 100, 300, 500]
max_depth = [2,3,4,5,6]
subsample = [0.6, 0.8, 1]
validation_fraction = [0.2]
n_iter_no_change = [5]

# TUNNING CONTROL
cv = 5
n_iter = 100
score = 'accuracy'

# SETUP
params = {
'loss': loss_function,
'learning_rate': learning_rate, 
'n_estimators': n_estimators,
'max_depth': max_depth,
'subsample': subsample,
'validation_fraction': validation_fraction,
'n_iter_no_change': n_iter_no_change
}

In [3]:
def load_data(prefix):
    X_train = np.genfromtxt(os.path.join(prefix, 'data_train', 'X_train.csv'), delimiter=',')
    y_train = np.genfromtxt(os.path.join(prefix, 'data_train', 'y_train.csv'), delimiter=',').astype('int')
    label_train = np.genfromtxt(os.path.join(prefix, 'data_train', 'label_train.csv'), delimiter=',')
    X_test = np.genfromtxt(os.path.join(prefix, 'data_test', 'X_test.csv'), delimiter=',')
    y_test = np.genfromtxt(os.path.join(prefix, 'data_test', 'y_test.csv'), delimiter=',').astype('int')
    label_test = np.genfromtxt(os.path.join(prefix, 'data_test', 'label_test.csv'), delimiter=',')

    return X_train, y_train, label_train, X_test, y_test, label_test

def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results["rank_test_score"] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print(
                "Mean validation score: {0:.3f} (std: {1:.3f})".format(
                    results["mean_test_score"][candidate],
                    results["std_test_score"][candidate],
                )
            )
            print("Parameters: {0}".format(results["params"][candidate]))
            print("")

In [4]:
# load
model = load_obj(os.path.join(workdir, exp_prefix, 'artifacts', 'selected_model.pkl'))
X, y, label, _, _, _ = load_data(os.path.join(workdir, data_prefix, 'processed'))
model

In [6]:
search = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    scoring=score,
    cv=cv,
    n_iter=n_iter,
    return_train_score=False,
    refit=True
)

# fit
start = time()
search.fit(X, y)

# save
save_obj(search.best_estimator_, os.path.join(workdir, exp_prefix, 'artifacts', 'tunned_model.pkl'))

# report
print(
    "Tunning took %.2f seconds for %d candidate parameter settings."
    % (time() - start, len(search.cv_results_["params"]))
)

print('###')

report(search.cv_results_, n_top=10)

Tunning took 43.76 seconds for 100 candidate parameter settings.
###
Model with rank: 1
Mean validation score: 0.823 (std: 0.022)
Parameters: {'validation_fraction': 0.2, 'subsample': 0.6, 'n_iter_no_change': 5, 'n_estimators': 300, 'max_depth': 3, 'loss': 'log_loss', 'learning_rate': 0.05}

Model with rank: 2
Mean validation score: 0.823 (std: 0.022)
Parameters: {'validation_fraction': 0.2, 'subsample': 0.8, 'n_iter_no_change': 5, 'n_estimators': 50, 'max_depth': 6, 'loss': 'log_loss', 'learning_rate': 0.05}

Model with rank: 3
Mean validation score: 0.822 (std: 0.028)
Parameters: {'validation_fraction': 0.2, 'subsample': 0.6, 'n_iter_no_change': 5, 'n_estimators': 500, 'max_depth': 4, 'loss': 'log_loss', 'learning_rate': 0.05}

Model with rank: 3
Mean validation score: 0.822 (std: 0.032)
Parameters: {'validation_fraction': 0.2, 'subsample': 0.6, 'n_iter_no_change': 5, 'n_estimators': 300, 'max_depth': 4, 'loss': 'log_loss', 'learning_rate': 0.03}

Model with rank: 5
Mean validation s