# Tabular Playground September 2021

This notebook handles model tuning for the logistic regression model developed for the Kaggle Tabular Playground September 2021 Competition.  For EDA, FE, and initial model development, see the previous notebooks [here](https://github.com/mcnewcp/kaggle-tabular-playground-series-sep21/blob/coy/kaggle-tab-playground-2021-09_MI-PCA.ipynb) and [here](https://github.com/mcnewcp/kaggle-tabular-playground-series-sep21/blob/coy/kaggle-tab-playground-2021-09_FE%2BLogReg-RF-XGB.ipynb).

# Load Data

In [1]:
import numpy as np
import pandas as pd

test = pd.read_csv('datasets/test.csv')
train = pd.read_csv('datasets/train.csv')

# ######TEMP REDUCE DATA SIZE
# train = train.sample(1000)

X_train = train.drop(['id', 'claim'], axis=1)
y_train = train.claim

# Model Pipeline

## Feature Engineering

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from scipy import stats

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    #one hyperparameter per new feature
    def __init__(
        self, 
        add_sum = True, 
        add_num_nan = True,
        add_abs_sum = True, 
        add_sem = True
    ): #no *args or **kargs
        self.add_sum = add_sum
        self.add_num_nan = add_num_nan
        self.add_abs_sum = add_abs_sum
        self.add_sem = add_sem
    def fit(self, X, y=None):
        return self #nothing to fit
    def transform(self, X):
        #generate additional features
        if self.add_sum:
            std_scaler = StandardScaler()
            sum_col = X.copy()
            sum_col[np.isnan(sum_col)] = 0
            sum_col = std_scaler.fit_transform(sum_col)
            sum_col = sum_col.sum(axis=1)
            X = np.c_[X, sum_col]
        if self.add_num_nan:
            num_nan = np.isnan(X).sum(axis=1)
            X = np.c_[X, num_nan]
        if self.add_abs_sum:
            abs_sum = X.copy()
            abs_sum[np.isnan(abs_sum)] = 0
            abs_sum = np.abs(abs_sum).sum(axis=1)
            X = np.c_[X, abs_sum]
        if self.add_sem:
            sem_col = stats.sem(X, nan_policy = 'omit', axis=1)
            X = np.c_[X, sem_col]
        return X


## Preprocessing

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Model

In [4]:
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier

model = Pipeline([
    ('attr_adder', CombinedAttributesAdder()),
    ('imputer', SimpleImputer()),
    ('std_scaler', scaler),
    ('lgbm', LGBMClassifier(silent=False, objective='binary'))
])

# Tuning

For tuning I'm going to use the `Optuna` package.  Optuna provides a number of really cool features when it comes to tuning hyperparameters and processing steps, including built in plotly diagnostic plots.  A pretty good intro to Optuna can be found [here](https://towardsdatascience.com/how-to-make-your-model-awesome-with-optuna-b56d490368af). 

The package works by minimizing an objective function.  The objective function must be defined so that it returns a single value which is the score minimized by the optuna study.  It is suggested to use cv score when scoring inside the objective function.  All possible hyperparameters must be chosen using the built in `trial.suggest_**` functions.  5 possible distributions options are provided:

* uniform — float values
* log-uniform — float values
* discrete uniform — float values with intervals
* integer — integer values
* categorical — categorical values from a list

Of course you can access any hyperparameters along the pipeline using the `step__hyperparam` nomenclature.  I'm also including a pickling step in the objective below, so that I can load intermediate results in case the process is interrupted or if I want to add onto the study.

In [5]:
import optuna
import joblib
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict

def objective(trial):

    joblib.dump(study, 'study-lgbm.pkl')

    attr_adder__add_sum = trial.suggest_categorical('attr_adder__add_sum', [False, True])
    attr_adder__add_num_nan = trial.suggest_categorical('attr_adder__add_num_nan', [False, True])
    attr_adder__add_abs_sum = trial.suggest_categorical('attr_adder__add_abs_sum', [False, True])
    attr_adder__add_sem = trial.suggest_categorical('attr_adder__add_sem', [False, True])
    imputer__strategy = trial.suggest_categorical('imputer__strategy', ['median', 'mean', 'constant', 'most_frequent'])
    lgbm__boosting_type = trial.suggest_categorical('lgbm__boosting_type', ['gbdt', 'goss'])
    lgbm__num_leaves = trial.suggest_int('lgbm__num_leaves', 2, 200)
    lgbm__max_depth = trial.suggest_int('lgbm__max_depth', -1, 50)
    lgbm__learning_rate = trial.suggest_uniform('lgbm__learning_rate', 0.01, 1)
    lgbm__n_estimators = trial.suggest_int('lgbm__n_estimators', 10, 500)
    lgbm__reg_alpha = trial.suggest_loguniform('lgbm__reg_alpha', 0.1, 100)
    lgbm__reg_lambda = trial.suggest_loguniform('lgbm__reg_lambda', 0.1, 100)


    params = {
        'attr_adder__add_sum': attr_adder__add_sum,
        'attr_adder__add_num_nan': attr_adder__add_num_nan,
        'attr_adder__add_abs_sum': attr_adder__add_abs_sum, 
        'attr_adder__add_sem': attr_adder__add_sem, 
        'imputer__strategy': imputer__strategy,
        'lgbm__boosting_type': lgbm__boosting_type,
        'lgbm__num_leaves': lgbm__num_leaves,
        'lgbm__max_depth': lgbm__max_depth,
        'lgbm__learning_rate': lgbm__learning_rate, 
        'lgbm__n_estimators': lgbm__n_estimators, 
        'lgbm__reg_alpha': lgbm__reg_alpha, 
        'lgbm__reg_lambda': lgbm__reg_lambda
    }

    model.set_params(**params)

    preds = cross_val_predict(
        model, X_train, y_train, cv=3, method="predict_proba", n_jobs = -1
    )
    preds = preds[:,1].reshape(len(preds))
    
    return roc_auc_score(y_train, preds)

Now all that's left to do is create the study (or load a previously pickled version) and optimize the objective.  You can specify how long the study lasts in number of trials (`n_trials`) or in time in seconds (`timeout`).  Setting `timeout` defines the time which the last trial must start before, and therefore the study last longer than the `timeout`.

In [6]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, timeout=14400)

[32m[I 2021-09-30 09:18:55,358][0m A new study created in memory with name: no-name-8fd4da59-824f-4d81-80b6-d60e31f2a277[0m
[32m[I 2021-09-30 09:21:12,091][0m Trial 0 finished with value: 0.7673510365180414 and parameters: {'attr_adder__add_sum': False, 'attr_adder__add_num_nan': False, 'attr_adder__add_abs_sum': True, 'attr_adder__add_sem': False, 'imputer__strategy': 'mean', 'lgbm__boosting_type': 'gbdt', 'lgbm__num_leaves': 32, 'lgbm__max_depth': 33, 'lgbm__learning_rate': 0.5380267755835213, 'lgbm__n_estimators': 412, 'lgbm__reg_alpha': 20.248625850287084, 'lgbm__reg_lambda': 3.604256929806227}. Best is trial 0 with value: 0.7673510365180414.[0m
[32m[I 2021-09-30 09:24:42,169][0m Trial 1 finished with value: 0.7276432144757129 and parameters: {'attr_adder__add_sum': False, 'attr_adder__add_num_nan': False, 'attr_adder__add_abs_sum': True, 'attr_adder__add_sem': True, 'imputer__strategy': 'median', 'lgbm__boosting_type': 'gbdt', 'lgbm__num_leaves': 130, 'lgbm__max_depth': 15

# Visualize Results

Optuna provides some really interesting visuals baked right into the study object.  Most are built with plotly, meaning you get a little interactivity to play with.

## Hyperparameter Importance

In [7]:
import plotly
optuna.visualization.plot_param_importances(study)

## Optimization History

In [8]:
optuna.visualization.plot_optimization_history(study)

## Hyperparameter Slices

The slice plots give you an idea of the affect of each hyperparameter individually on the outcome of the objective.

In [9]:
optuna.visualization.plot_slice(study)

Or you can look at hyperparameters individually.

In [10]:
optuna.visualization.plot_slice(study, ['lgbm__learning_rate'])

# Evaluation

Now to predict the test set with the final model and submit predictions.

In [11]:
final_mod = model.set_params(**study.best_params)
final_mod.fit(X_train, y_train)

# Preprocessing of test data, get predictions
X_test_ids = test.id
X_test = test.drop('id', axis=1)
preds = final_mod.predict_proba(X_test)
preds = preds[:,1].reshape(len(preds))

#export predictions
output = pd.DataFrame({'id': X_test_ids,
                       'claim': preds})
output.to_csv('submission.csv', index=False)

[LightGBM] [Info] Number of positive: 477515, number of negative: 480404
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 30276
[LightGBM] [Info] Number of data points in the train set: 957919, number of used features: 120
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.498492 -> initscore=-0.006032
[LightGBM] [Info] Start training from score -0.006032
