# Extreme Fine Tuning of LGBM using Incremental training

The purpose of this notebook is to implement the advanced LightGBM model training with a certain incremental approach. The trick is executed in following steps:

* Find the best parameters for your LGBM, either manually or using optimization methods of your choice.
* Train the model to the best RMSE you can get in one training round using high early stopping.
* Train the model for 1 or 2 rounds with reduced learning rate.
* Once the first few rounds are over, start reducing regularization params by a factor at each incremental training iteration, you will start observing improvements in 5th decimal place... which is enough to get 5th decimal improvement on your models leaderboard score.

The inspiration for such an approach has been obtained from https://www.kaggle.com/awwalmalhi/extreme-fine-tuning-lgbm-using-7-step-training

The initial settings for a manually tuned LightGBM model have been inherited from my ensembling kernel per https://www.kaggle.com/gvyshnya/ensemble-lgb-xgb-catboost-optimized

In [1]:
import pandas as pd
import numpy as np

import datetime as dt
from typing import Tuple, List, Dict

from sklearn.model_selection import KFold, GridSearchCV, cross_validate, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

from lightgbm import LGBMRegressor

import optuna
from functools import partial

import warnings
warnings.filterwarnings('ignore')

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.IPBC74C7KURV7CB2PKT5Z5FNR3SIBV4J.gfortran-win_amd64.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
  stacklevel=1)

Bad key "text.kerning_factor" on line 4 in
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


In [2]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

Started at  2021-02-21 11:55:39.405680


In [3]:
# read data
in_kaggle = False


def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-feb-2021/train.csv'
        test_path = '../input/tabular-playground-series-feb-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-feb-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path


In [4]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

train = pd.read_csv(train_set_path)
test = pd.read_csv(test_set_path)
target = train.target

subm = pd.read_csv(sample_subm_path)

Wall time: 1.54 s


In [5]:
X_train = train.drop(['id', 'target'], axis=1)
y_train = train.target
X_test = test.drop(['id'], axis=1)

In [6]:
cat_cols = [feature for feature in train.columns if 'cat' in feature]

def label_encoder(df):
    for feature in cat_cols:
        le = LabelEncoder()
        le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

In [7]:
X_train = label_encoder(X_train)
X_test = label_encoder(X_test)

In [8]:
split = KFold(n_splits=5, random_state=2)

In [9]:
# ------------------------------------------------------------------------------
# Parameters
# ------------------------------------------------------------------------------
N_FOLDS = 10
N_ESTIMATORS = 30000
SEED = 2021
BAGGING_SEED = 48

# ------------------------------------------------------------------------------
# LightGBM: training and inference
# ------------------------------------------------------------------------------
lgbm_params = {'random_state': SEED,
          'metric': 'rmse',
          'n_estimators': N_ESTIMATORS,
          'n_jobs': -1,
          'cat_feature': [x for x in range(len(cat_cols))],
          'bagging_seed': SEED,
          'feature_fraction_seed': SEED,
          'learning_rate': 0.003899156646724397,
          'max_depth': 99,
          'num_leaves': 63,
          'reg_alpha': 9.562925363678952,
          'reg_lambda': 9.355810045480153,
          'colsample_bytree': 0.2256038826485174,
          'min_child_samples': 290,
          'subsample_freq': 1,
          'subsample': 0.8805303688019942,
          'max_bin': 882,
          'min_data_per_group': 127,
          'cat_smooth': 96,
          'cat_l2': 19
          }

In [10]:
preds_list_base = []
preds_list_final_iteration = []
preds_list_all = []

for train_idx, val_idx in split.split(X_train):
            X_tr = X_train.iloc[train_idx]
            X_val = X_train.iloc[val_idx]
            y_tr = y_train.iloc[train_idx]
            y_val = y_train.iloc[val_idx]
            
            Model = LGBMRegressor(**lgbm_params).fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                          eval_metric=['rmse'],
                          early_stopping_rounds=250, 
                          categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                          #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
                          verbose=0)
            
            preds_list_base.append(Model.predict(X_test))
            preds_list_all.append(Model.predict(X_test))
            print(f'RMSE for Base model is {np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))}')
            first_rmse = np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))
            params = lgbm_params.copy()
            
            for i in range(1, 8):
                if i >2:    
                    
                    # reducing regularizing params if 
                    
                    params['reg_lambda'] *= 0.9
                    params['reg_alpha'] *= 0.9
                    params['num_leaves'] += 40
                    
                params['learning_rate'] = 0.003
                Model = LGBMRegressor(**params).fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
                          eval_metric=['rmse'],
                          early_stopping_rounds=200, 
                          categorical_feature=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                          #callbacks=[optuna.integration.LightGBMPruningCallback(trial, metric='rmse')],
                          verbose=0,
                          init_model=Model)
                
                preds_list_all.append(Model.predict(X_test))
                print(f'RMSE for Incremental trial {i} model is {np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))}')
            last_rmse = np.sqrt(mean_squared_error(y_val, Model.predict(X_val)))
            print('',end='\n\n')
            print(f'Improvement of : {first_rmse - last_rmse}')
            print('-' * 100)
            preds_list_final_iteration.append(Model.predict(X_test))

RMSE for Base model is 0.8417355453854034
RMSE for Incremental trial 1 model is 0.8417186407974255
RMSE for Incremental trial 2 model is 0.8417135133545762
RMSE for Incremental trial 3 model is 0.841706269869924
RMSE for Incremental trial 4 model is 0.8417002476913066
RMSE for Incremental trial 5 model is 0.8416970624752245
RMSE for Incremental trial 6 model is 0.8416931611517247
RMSE for Incremental trial 7 model is 0.8416884166464895


Improvement of : 4.712873891388192e-05
----------------------------------------------------------------------------------------------------
RMSE for Base model is 0.8413132573162314
RMSE for Incremental trial 1 model is 0.8413130886575063
RMSE for Incremental trial 2 model is 0.8413128719772601
RMSE for Incremental trial 3 model is 0.8413117466785476
RMSE for Incremental trial 4 model is 0.8413102499838527
RMSE for Incremental trial 5 model is 0.841307735709768
RMSE for Incremental trial 6 model is 0.8413019772685878
RMSE for Incremental trial 7 model 

As we can see, there are some further improvement in all the folds. Below are some specific findings:

* The first few iterations are just using very low learning_rate.. after the 2nd iteration we can see that there are iterations with very good improvement, observed by reducing regularization.
* There are also iterations where loss increased at later iterations slightly compared to previous iteration, showing that we have reached the limit in few iterations before the max iteration.
* If you try setting verbose=1, you will observe that these improvements are observed only in first few trees created... after that loss starts to increase, LGBM keeps the best model. But reducing regularization does improve loss for the first few trees.

In [11]:
y_preds_final_iteration = np.array(preds_list_final_iteration).mean(axis=0)
y_preds_final_iteration

array([7.6122141 , 7.77927529, 7.60381899, ..., 7.53936653, 7.47794379,
       7.27151483])

In [12]:
# public LB score 0.84192
submission = pd.DataFrame({'id':test.id,
              'target':y_preds_final_iteration})

In [13]:
submission.to_csv('gv_extreme_lgb_7_steps.csv', index=False)

In [14]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)

We are done. That is all, folks!
Finished at  2021-02-21 13:43:29.518634
Elapsed time:  1:47:50.112954
