# GBM Experiments

Setting up GBM experiments to run hyperparameter optimization + model training using
- catboost
- lightgbm
- xgboost

On some tabular data 

So experiments can be extended to score combiner data once we get it in order to determine the gradient boostling library to use for our data

In [94]:
# set optuna variables
n_trials = 25
timeout = 30  # time out in minutes

## Importing Data

- create train/val split based off of a col 
- separate categorical from continuous features
- create `train_df`, `val_df`

data comes from the kaggle titanic dataset, but can be any dataset

to download the data, create a kaggle account and run 
`kaggle competitions download -c titanic`

In [54]:
import pandas as pd
import numpy as np
import optuna
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


In [95]:
df = pd.read_csv("train.csv")  # path to train csv should be a part of the config yaml
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [56]:
# fill nas 
df.fillna(-999, inplace=True)

In [57]:
# all these should be in the config yaml name
split_by_col_name = "PassengerId"  # used because we want to split by questions in score combbiner
cat_feat_names = ["Pclass", "Sex", "Embarked"]
cont_feat_names = ["Age", "SibSp", "Fare"]
label = "Survived"

In [58]:
for feat in cat_feat_names:
    df[feat] = df[feat].astype("category")

for feat in cont_feat_names:
    df[feat] = df[feat].astype("float64")

In [59]:
# only take features we care about
all_cols = [label, *cat_feat_names, *cont_feat_names]

In [60]:
len(df["PassengerId"]), df.shape

(891, (891, 12))

In [102]:
# only take cols we care about
df = df[all_cols]
print(df.shape)


(891, 7)


in this case, i am using just a simple train/val split, for a k-fold cross validation, do the following

```
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True)
kf.get_n_splits(df[split_by_col_name].unique())
```
when running optuna trial, want to train k models with the same hyperparams (1 model on each fold) and average the val accuracies

don't save the best model found with kfold cv. return the best parameters, then fit the model on **all** the training data as the final model
```
def objective(trial):
    avg_acc = []
    for train_ids, val_ids in kf.split(df[split_by_col_name].unique()):
        train_df, val_df = df[df[split_by_col_name].isin(train_ids)], df[df[split_by_col_name].isin(val_ids)]

        rest of the code for data prep + model fitting goes here
        
        avg_acc.append(accuracy)
    return sum(avg_acc)/len(avg_acc)

```

once you run the hpo, use `study.best_params` to get the best found hyperparams, then retrain your model using the best params on the **entire df (no split)**
eg
```
x, y = df[df.columns.drop(label)], df[label]
cat_feat_idxs = np.where(x_train.dtypes != np.float)[0]

model = CatBoostClassifier(
    border_count=254,
    **study.best_params
)

model.fit(
    x_train, y_train,
    cat_features=cat_feat_idxs,
    eval_set=(x_val, y_val),
    verbose=0
)
```


However, it's important to know when to use kfold cv

we want to look at the distribution of classes between the train/val split. make sure if we do kfold cv that
- the classe distribution between train/val split are same
- no data leakage between the train/val split (this is why in score combiner we're splitting by questions, not just randomly)

- The above applies to doing a random train/val split as well
- if those requirements are not satisfied by a random split, then we want to consider manually constructing a validation dataset from the data, and not running any kfold cv

In [61]:
# splitting data by column value
train_ids, val_ids = train_test_split(df[split_by_col_name].unique(), test_size=0.2)

train_df, val_df = df[df[split_by_col_name].isin(train_ids)], df[df[split_by_col_name].isin(val_ids)]



(712, 12) (179, 12)
(712, 7) (179, 7)


In [62]:
# splitting train_df, val_df into x_train, y_train, x_val, y_val
x_train, y_train = train_df[train_df.columns.drop(label)], train_df[label]

x_val, y_val = val_df[val_df.columns.drop(label)], val_df[label]

In [63]:
print(x_train.dtypes)
cat_feat_idxs = np.where(x_train.dtypes != np.float)[0]
cat_feat_idxs

Pclass      category
Sex         category
Embarked    category
Age          float64
SibSp        float64
Fare         float64
dtype: object


array([0, 1, 2])

## Catboost

https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb

https://www.kaggle.com/satorushibata/optimize-catboost-hyperparameter-with-optuna-gpu

In [19]:
from catboost import CatBoostClassifier


In [20]:

# model = CatBoostClassifier(
#     custom_loss=[metrics.Accuracy()],
#     random_seed=42,
#     logging_level='Silent'
# )

# model.fit(
#     x_train, y_train,
#     cat_features=cat_feat_idxs,
#     eval_set=(x_val, y_val),
#     verbose=0
# )

<catboost.core.CatBoostClassifier at 0x16133f820>

In [42]:
# hpo 
def objective(trial):
    params = {
        'iterations' : trial.suggest_int('iterations', 50, 300),                         
        'depth' : trial.suggest_int('depth', 4, 12),                                       
        'random_strength' :trial.suggest_int('random_strength', 0, 100),                       
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
        'learning_rate' :trial.suggest_loguniform('learning_rate', 1e-3, 1),
        "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 10.0),
    }


    model = CatBoostClassifier(
        border_count=254,
        **params
    )

    model.fit(
        x_train, y_train,
        cat_features=cat_feat_idxs,
        eval_set=(x_val, y_val),
        verbose=0
    )
    preds = model.predict(x_val)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(y_val, pred_labels)
    
    trial.set_user_attr(key="best_model", value=model)

    return accuracy

def callback(study, trial):
    if study.best_trial.number == trial.number:
        study.set_user_attr(key="best_model", value=trial.user_attrs["best_model"])


# run study
pruner = optuna.pruners.SuccessiveHalvingPruner()
study = optuna.create_study(direction="maximize", pruner=pruner)
study.optimize(objective, n_trials=n_trials, timeout=timeout*60, callbacks=[callback])

# log intermediate values
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[32m[I 2022-01-24 17:45:56,919][0m A new study created in memory with name: no-name-40b80f69-2c14-4362-9a2e-f2763a34d818[0m
[32m[I 2022-01-24 17:45:58,192][0m Trial 0 finished with value: 0.8044692737430168 and parameters: {'iterations': 212, 'depth': 11, 'random_strength': 87, 'bagging_temperature': 0.4444868603650943, 'learning_rate': 0.10200169286367641, 'objective': 'CrossEntropy', 'l2_leaf_reg': 2.0398002633861805}. Best is trial 0 with value: 0.8044692737430168.[0m
[32m[I 2022-01-24 17:45:58,673][0m Trial 1 finished with value: 0.7988826815642458 and parameters: {'iterations': 200, 'depth': 12, 'random_strength': 48, 'bagging_temperature': 0.04667920409334895, 'learning_rate': 0.0029610102225529015, 'objective': 'Logloss', 'l2_leaf_reg': 2.102125802721445e-06}. Best is trial 0 with value: 0.8044692737430168.[0m
[32m[I 2022-01-24 17:45:59,185][0m Trial 2 finished with value: 0.7877094972067039 and parameters: {'iterations': 242, 'depth': 11, 'random_strength': 26, 'bagg

Number of finished trials: 25
Best trial:
  Value: 0.8156424581005587
  Params: 
    iterations: 181
    depth: 9
    random_strength: 14
    bagging_temperature: 98.64827444466418
    learning_rate: 0.0559949172297065
    objective: Logloss
    l2_leaf_reg: 3.7640306797012666e-05


In [22]:
# reload to verify matches HPO values
best_model=study.user_attrs["best_model"]
preds = best_model.predict(x_val)
pred_labels = np.rint(preds)
accuracy = accuracy_score(y_val, pred_labels)
print(accuracy)

0.8156424581005587


## LGBM

In [23]:
# save best model w/ pickle
import lightgbm as lgb

In [24]:
len(y_train.unique())

2

In [38]:
obj = "binary" if len(y_train.unique()) == 2 else "multiclass"
obj

'binary'

In [28]:
list(cat_feat_idxs)

[0, 1, 2, 3]

In [30]:
print(x_train.dtypes)


Pclass      category
Sex         category
Cabin       category
Embarked    category
Age          float64
SibSp        float64
Fare         float64
dtype: object


In [39]:
params = {
        "objective": obj,
        "verbosity": -1,
        "boosting_type": "gbdt",                
        "seed": 42
    }
dtrain = lgb.Dataset(x_train, label=y_train, categorical_feature=cat_feat_names)   
model = lgb.train(params, dtrain)



In [40]:
# hpo
def objective(trial):
    dtrain = lgb.Dataset(x_train, label=y_train, categorical_feature=cat_feat_names)
    params = {
        "objective": obj,
        "verbosity": -1,
        'num_iterations' : trial.suggest_int('num_iterations', 50, 300), 
        'max_depth' : trial.suggest_int('max_depth', 4, 12),   
        "boosting": trial.suggest_categorical("boosting", ["gbdt", "dart"]),
        'learning_rate' :trial.suggest_loguniform('learning_rate', 1e-3, 1),
        "lambda_l1": trial.suggest_loguniform("lambda_l1", 1e-8, 10.0),
        "lambda_l2": trial.suggest_loguniform("lambda_l2", 1e-8, 10.0),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 5, 100),
    }

    model = lgb.train(params, dtrain)
    preds = model.predict(x_val)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(y_val, pred_labels)
    
    trial.set_user_attr(key="best_model", value=model)
    return accuracy

def callback(study, trial):
    if study.best_trial.number == trial.number:
        study.set_user_attr(key="best_model", value=trial.user_attrs["best_model"])


# run study
pruner = optuna.pruners.SuccessiveHalvingPruner()
study = optuna.create_study(direction="maximize", pruner=pruner)
study.optimize(objective, n_trials=n_trials, timeout=timeout*60, callbacks=[callback])

# log intermediate values
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[32m[I 2022-01-24 17:39:54,705][0m A new study created in memory with name: no-name-8cd73383-a600-4769-9dcc-0451a4fb283e[0m
[32m[I 2022-01-24 17:39:54,890][0m Trial 0 finished with value: 0.7821229050279329 and parameters: {'num_iterations': 287, 'max_depth': 5, 'boosting': 'gbdt', 'learning_rate': 0.004967601640257454, 'lambda_l1': 0.005092660037630496, 'lambda_l2': 2.648607498217249e-08, 'num_leaves': 204, 'feature_fraction': 0.8358845199756579, 'bagging_fraction': 0.6888836168623422, 'bagging_freq': 7, 'min_data_in_leaf': 31}. Best is trial 0 with value: 0.7821229050279329.[0m
[32m[I 2022-01-24 17:39:54,991][0m Trial 1 finished with value: 0.8044692737430168 and parameters: {'num_iterations': 174, 'max_depth': 9, 'boosting': 'dart', 'learning_rate': 0.8558091514908599, 'lambda_l1': 3.470822881962566e-08, 'lambda_l2': 3.1256988321417203, 'num_leaves': 21, 'feature_fraction': 0.9497913370734663, 'bagging_fraction': 0.5663942977724015, 'bagging_freq': 5, 'min_data_in_leaf': 98}

Number of finished trials: 25
Best trial:
  Value: 0.8491620111731844
  Params: 
    num_iterations: 125
    max_depth: 7
    boosting: gbdt
    learning_rate: 0.37473808487232596
    lambda_l1: 1.1987837919667571e-05
    lambda_l2: 2.269126618000925e-05
    num_leaves: 146
    feature_fraction: 0.5992935410689467
    bagging_fraction: 0.4545056624034441
    bagging_freq: 7
    min_data_in_leaf: 19


In [None]:
# only take features we care about
all_cols = [label, *cat_feat_names, *cont_feat_names]

In [41]:
# reload to verify matches HPO values
best_model=study.user_attrs["best_model"]
preds = best_model.predict(x_val)
pred_labels = np.rint(preds)
accuracy = accuracy_score(y_val, pred_labels)
print(accuracy)

0.8491620111731844


## XGBoost

In [85]:
import xgboost as xgb

In [64]:
print(cat_feat_names)
print(x_train.dtypes)
x_train.head(5)

['Pclass', 'Sex', 'Embarked']
Pclass      category
Sex         category
Embarked    category
Age          float64
SibSp        float64
Fare         float64
dtype: object


Unnamed: 0,Pclass,Sex,Embarked,Age,SibSp,Fare
0,3,male,S,22.0,1.0,7.25
1,1,female,C,38.0,1.0,71.2833
2,3,female,S,26.0,0.0,7.925
3,1,female,S,35.0,1.0,53.1
4,3,male,S,35.0,0.0,8.05


In [81]:
# create one-hot encoding
def create_one_hot_dfs():
    one_hot_dfs = []
    for df in [x_train, x_val]:
        one_hot_df = df[cat_feat_names]
        one_hot_df = pd.get_dummies(one_hot_df, columns=cat_feat_names)
        df_no_cat_cols = df.drop(cat_feat_names, axis=1)
        new_df = pd.concat([df_no_cat_cols, one_hot_df], axis=1)
        for feat in new_df.columns:
            new_df[feat] = new_df[feat].astype("float64")
        
        one_hot_dfs.append(new_df)

    return one_hot_dfs
        



In [82]:
x_train_new, x_val_new = create_one_hot_dfs()

In [83]:
print(x_train.shape, x_train_new.shape)
print(x_train_new.dtypes)

x_train_new.head(5)

(712, 6) (712, 12)
Age              float64
SibSp            float64
Fare             float64
Pclass_1         float64
Pclass_2         float64
Pclass_3         float64
Sex_female       float64
Sex_male         float64
Embarked_-999    float64
Embarked_C       float64
Embarked_Q       float64
Embarked_S       float64
dtype: object


Unnamed: 0,Age,SibSp,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_-999,Embarked_C,Embarked_Q,Embarked_S
0,22.0,1.0,7.25,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
1,38.0,1.0,71.2833,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,26.0,0.0,7.925,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,35.0,1.0,53.1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,35.0,0.0,8.05,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


In [84]:
print(x_val.shape, x_val.shape)
print(x_val_new.dtypes)
x_val_new.head(5)

(179, 6) (179, 6)
Age              float64
SibSp            float64
Fare             float64
Pclass_1         float64
Pclass_2         float64
Pclass_3         float64
Sex_female       float64
Sex_male         float64
Embarked_-999    float64
Embarked_C       float64
Embarked_Q       float64
Embarked_S       float64
dtype: object


Unnamed: 0,Age,SibSp,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_-999,Embarked_C,Embarked_Q,Embarked_S
15,55.0,0.0,16.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
17,-999.0,0.0,13.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
18,31.0,1.0,18.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
24,8.0,3.0,21.075,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
26,-999.0,0.0,7.225,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


In [91]:
# hpo

def objective(trial):
    dtrain = xgb.DMatrix(x_train_new, label=y_train)
    dvalid = xgb.DMatrix(x_val_new, label=y_val)

    param = {
        "verbosity": 0,
        # defines booster, gblinear for linear functions.
        "booster": trial.suggest_categorical("booster", ["gbtree", "dart"]),
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    if param["booster"] in ["gbtree", "dart"]:
        # maximum depth of the tree, signifies complexity of the tree.
        param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
        # minimum child weight, larger the term more conservative the tree.
        param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
        param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
        # defines how selective algorithm is.
        param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
        param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    if param["booster"] == "dart":
        param["sample_type"] = trial.suggest_categorical("sample_type", ["uniform", "weighted"])
        param["normalize_type"] = trial.suggest_categorical("normalize_type", ["tree", "forest"])
        param["rate_drop"] = trial.suggest_float("rate_drop", 1e-8, 1.0, log=True)
        param["skip_drop"] = trial.suggest_float("skip_drop", 1e-8, 1.0, log=True)

    model = xgb.train(param, dtrain)
    preds = model.predict(dvalid)
    pred_labels = np.rint(preds)
    accuracy = accuracy_score(y_val, pred_labels)
    trial.set_user_attr(key="best_model", value=model)
    return accuracy

def callback(study, trial):
    if study.best_trial.number == trial.number:
        study.set_user_attr(key="best_model", value=trial.user_attrs["best_model"])


# run study
pruner = optuna.pruners.SuccessiveHalvingPruner()
study = optuna.create_study(direction="maximize", pruner=pruner)
study.optimize(objective, n_trials=n_trials, timeout=timeout*60, callbacks=[callback])

# log intermediate values
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[32m[I 2022-01-25 13:31:53,863][0m A new study created in memory with name: no-name-0f215a4f-ddf2-4ea9-b134-9e47de7febfc[0m
[32m[I 2022-01-25 13:31:53,925][0m Trial 0 finished with value: 0.8100558659217877 and parameters: {'booster': 'gbtree', 'lambda': 7.246869405081062e-08, 'alpha': 2.0864713552543363e-06, 'subsample': 0.33697008050941857, 'colsample_bytree': 0.8887954343897158, 'max_depth': 5, 'min_child_weight': 3, 'eta': 0.0003896523885422125, 'gamma': 5.201055659103267e-06, 'grow_policy': 'lossguide'}. Best is trial 0 with value: 0.8100558659217877.[0m
[32m[I 2022-01-25 13:31:54,017][0m Trial 1 finished with value: 0.770949720670391 and parameters: {'booster': 'dart', 'lambda': 3.0743125199168593e-07, 'alpha': 3.781291921424965e-05, 'subsample': 0.7820918196780899, 'colsample_bytree': 0.8880446295636428, 'max_depth': 7, 'min_child_weight': 7, 'eta': 8.12216822799987e-08, 'gamma': 0.0004606398424415315, 'grow_policy': 'lossguide', 'sample_type': 'weighted', 'normalize_typ

Number of finished trials: 25
Best trial:
  Value: 0.8268156424581006
  Params: 
    booster: gbtree
    lambda: 1.12539313883328e-05
    alpha: 0.018067193021280282
    subsample: 0.8715797724178268
    colsample_bytree: 0.2500057947924821
    max_depth: 5
    min_child_weight: 6
    eta: 0.01873992573733762
    gamma: 5.47607276650213e-07
    grow_policy: depthwise


In [93]:
# reload to verify matches HPO values
best_model=study.user_attrs["best_model"]
# dvalid = xgb.DMatrix(x_val_new, label=y_val)
dvalid = xgb.DMatrix(x_val_new)
preds = best_model.predict(dvalid)
pred_labels = np.rint(preds)
accuracy = accuracy_score(y_val, pred_labels)
print(accuracy)

0.8268156424581006
