https://stats.stackexchange.com/questions/453540/how-does-lightgbm-deals-with-incremental-learning-and-concept-drift

1. LightGBM will add more trees if we update it through continued training (e.g. through BoosterUpdateOneIter). Assuming we use refit we will be using existing tree structures to update the output of the leaves based on the new data. It is faster than re-training from scratch, since we do not have to re-discover the optimal tree structures. Nevertheless, please note that almost certainly it will have worse performance (on the combined old and new data) than doing a full retrain from scratch on them.
2. Any online learning algorithm will be designed to adapt to changes. That said, LighyGBM's performance will depend on the training parameters we will use and how we will validate our predictions (e.g. how much we care to disregard previous data points). Assuming we properly train our booster, without having a relevant baseline (e.g. a ridge regression trained on an incremental manner) it does not make sense to say "LightGBM is good (or bad)" for dealing with concept drift.

In [1]:
import os

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statistics

import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
INPUT_PATH = '../data/input'

In [3]:
def read_data(year):
    df = pd.read_csv(os.path.join(INPUT_PATH, f'juyo-{year}.txt'),  skiprows=1, encoding='shift-jis')
    df_temp = pd.read_csv(os.path.join(INPUT_PATH, f'temp-{year}.csv'),  skiprows=3, encoding='shift-jis')
    df_precipitation = pd.read_csv(os.path.join(INPUT_PATH, f'precipitation-{year}.csv'),  skiprows=3, encoding='shift-jis')
    df['DATE_TIME'] = df['DATE'] + ' ' + df['TIME'] 
    df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
    df = df.rename(columns={'実績(万kW)': 'usage'})
    df['temperature'] = df_temp['気温(℃)']
    df['precipitation'] = df_precipitation['降水量(mm)']
    return df


def preprocess(df):
    df['month'] = df['DATE_TIME'].dt.month
    df['hour'] = df['DATE_TIME'].dt.hour
    df['minute'] = df['DATE_TIME'].dt.minute
    df['dayofweek'] = df['DATE_TIME'].dt.dayofweek
    df = df.drop(['DATE', 'TIME', 'DATE_TIME'], axis=1)
    return df


def make_dataset(X, y):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=0)
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_valid = lgb.Dataset(X_valid, y_valid)
    
    return lgb_train, lgb_valid


def train_model(lgb_train, lgb_valid, init_model=None):
    params = {
        'objective': 'rmse',
        'learning_rate': 0.01, 
        'num_leaves': 31, 
        'importance_type': 'gain',
    }

    model = lgb.train(params, lgb_train, num_boost_round=10000,
                      valid_sets=lgb_valid, verbose_eval=100, early_stopping_rounds=100,
                      init_model=init_model)
    return model


def train_cv(X, y):
    n_splits = 5
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=0)

    scores_tmp = []
    for train_index, valid_index in kf.split(X, y):
        X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
        y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

        lgb_train = lgb.Dataset(X_train, y_train)
        lgb_valid = lgb.Dataset(X_valid, y_valid)

        model = train_model(lgb_train, lgb_valid)
        y_pred = model.predict(X_test,  num_iteration=model.best_iteration)
        scores_tmp.append(mean_squared_error(y_test, y_pred, squared=False))
    return np.mean(scores_tmp), np.std(scores_tmp)


# def plot_pred(y_test, y_pred):
#     plt.figure(figsize=(5,5))
#     plt.plot(y_test, y_pred, '.')
#     plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '-')
    
#     print(mean_squared_error(y_test, y_pred, squared=False))

## データ読み込み

In [4]:
df_2017 = preprocess(read_data(2017))
df_2018 = preprocess(read_data(2018))
df_2019 = preprocess(read_data(2019))

In [5]:
X_test = df_2019.drop('usage', axis=1)
y_test = df_2019['usage'].astype('float32')

In [6]:
scores = {}

## パターン1

2017年のデータで学習し、2019年のデータを予測する。

In [7]:
X = df_2017.drop('usage', axis=1)
y = df_2017['usage'].astype('float32')

In [8]:
scores['pattern1'] = train_cv(X, y)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 308.139
[200]	valid_0's rmse: 207.618
[300]	valid_0's rmse: 175.461
[400]	valid_0's rmse: 163.172
[500]	valid_0's rmse: 157.354
[600]	valid_0's rmse: 154.202
[700]	valid_0's rmse: 152.705
[800]	valid_0's rmse: 152.109
[900]	valid_0's rmse: 151.428
[1000]	valid_0's rmse: 150.978
[1100]	valid_0's rmse: 150.495
[1200]	valid_0's rmse: 150.228
[1300]	valid_0's rmse: 149.623
[1400]	valid_0's rmse: 149.291
[1500]	valid_0's rmse: 149.011
[1600]	valid_0's rmse: 148.84
[1700]	valid_0's rmse: 148.703
[1800]	valid_0's rmse: 148.511
[1900]	valid_0's rmse: 148.38
[2000]	valid_0's rmse: 148.214
[2100]	valid_0's rmse: 148.147
[2200]	valid_0's rmse: 148.107
[2300]	valid_0's rmse: 148.026
[2400]	valid_0's rmse: 148.028
Early stopping, best iteration is:
[2343]	valid_0's rmse: 147.994
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 308.044
[200]	valid_0's rmse: 208.728
[300]	valid_0's r

## パターン2

2018年のデータで学習し、2019年のデータを予測する。

In [9]:
X = df_2018.drop('usage', axis=1)
y = df_2018['usage'].astype('float32')

In [10]:
scores['pattern2'] = train_cv(X, y)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 338.023
[200]	valid_0's rmse: 232.876
[300]	valid_0's rmse: 200.769
[400]	valid_0's rmse: 188.154
[500]	valid_0's rmse: 180.745
[600]	valid_0's rmse: 177.393
[700]	valid_0's rmse: 175.052
[800]	valid_0's rmse: 173.527
[900]	valid_0's rmse: 172.495
[1000]	valid_0's rmse: 171.974
[1100]	valid_0's rmse: 171.086
[1200]	valid_0's rmse: 170.417
[1300]	valid_0's rmse: 170.088
[1400]	valid_0's rmse: 169.832
[1500]	valid_0's rmse: 169.724
Early stopping, best iteration is:
[1470]	valid_0's rmse: 169.677
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 341.741
[200]	valid_0's rmse: 230.265
[300]	valid_0's rmse: 193.803
[400]	valid_0's rmse: 179.691
[500]	valid_0's rmse: 172.698
[600]	valid_0's rmse: 168.865
[700]	valid_0's rmse: 166.552
[800]	valid_0's rmse: 165.291
[900]	valid_0's rmse: 164.51
[1000]	valid_0's rmse: 163.717
[1100]	valid_0's rmse: 163.373
[1200]	valid_0's rmse: 

## パターン3

2017年と2018年のデータで学習して2019年のデータを予測する。

In [11]:
X = pd.concat([df_2017, df_2018], sort=False).drop('usage', axis=1)
y = pd.concat([df_2017, df_2018], sort=False)['usage'].astype('float32')

In [12]:
scores['pattern3'] = train_cv(X, y)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 329.215
[200]	valid_0's rmse: 225.715
[300]	valid_0's rmse: 192.11
[400]	valid_0's rmse: 179.773
[500]	valid_0's rmse: 173.602
[600]	valid_0's rmse: 170.316
[700]	valid_0's rmse: 168.552
[800]	valid_0's rmse: 167.664
[900]	valid_0's rmse: 167.044
[1000]	valid_0's rmse: 166.525
[1100]	valid_0's rmse: 166.097
[1200]	valid_0's rmse: 165.854
[1300]	valid_0's rmse: 165.521
[1400]	valid_0's rmse: 165.371
[1500]	valid_0's rmse: 165.077
[1600]	valid_0's rmse: 164.73
[1700]	valid_0's rmse: 164.386
[1800]	valid_0's rmse: 164.064
[1900]	valid_0's rmse: 163.813
[2000]	valid_0's rmse: 163.589
[2100]	valid_0's rmse: 163.437
[2200]	valid_0's rmse: 163.184
[2300]	valid_0's rmse: 163.062
Early stopping, best iteration is:
[2289]	valid_0's rmse: 163.054
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 328.708
[200]	valid_0's rmse: 224.475
[300]	valid_0's rmse: 190.165
[400]	valid_0's rm

## パターン4

2017年のデータで学習し、2018年のデータで追加学習して2019年のデータを予測する。

In [13]:
X = df_2017.drop('usage', axis=1)
y = df_2017['usage'].astype('float32')

X2= df_2018.drop('usage', axis=1)
y2 = df_2018['usage'].astype('float32')

In [14]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=0)

scores_tmp = []
for train_index, valid_index in kf.split(X, y):
    X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_valid = lgb.Dataset(X_valid, y_valid)

    model = train_model(lgb_train, lgb_valid)
    
    X2_train, X2_valid = X2.iloc[train_index], X2.iloc[valid_index]
    y2_train, y2_valid = y2.iloc[train_index], y2.iloc[valid_index]
    
    lgb_train2 = lgb.Dataset(X2_train, y2_train)
    lgb_valid2 = lgb.Dataset(X2_valid, y2_valid)
    
    model2 = train_model(lgb_train2, lgb_valid2, init_model=model)

    y_pred = model2.predict(X_test,  num_iteration=model2.best_iteration)
    scores_tmp.append(mean_squared_error(y_test, y_pred, squared=False))

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 308.139
[200]	valid_0's rmse: 207.618
[300]	valid_0's rmse: 175.461
[400]	valid_0's rmse: 163.172
[500]	valid_0's rmse: 157.354
[600]	valid_0's rmse: 154.202
[700]	valid_0's rmse: 152.705
[800]	valid_0's rmse: 152.109
[900]	valid_0's rmse: 151.428
[1000]	valid_0's rmse: 150.978
[1100]	valid_0's rmse: 150.495
[1200]	valid_0's rmse: 150.228
[1300]	valid_0's rmse: 149.623
[1400]	valid_0's rmse: 149.291
[1500]	valid_0's rmse: 149.011
[1600]	valid_0's rmse: 148.84
[1700]	valid_0's rmse: 148.703
[1800]	valid_0's rmse: 148.511
[1900]	valid_0's rmse: 148.38
[2000]	valid_0's rmse: 148.214
[2100]	valid_0's rmse: 148.147
[2200]	valid_0's rmse: 148.107
[2300]	valid_0's rmse: 148.026
[2400]	valid_0's rmse: 148.028
Early stopping, best iteration is:
[2343]	valid_0's rmse: 147.994
Training until validation scores don't improve for 100 rounds
[2400]	valid_0's rmse: 187.549
[2500]	valid_0's rmse: 178.802
[2600]	valid_0'

In [15]:
scores['pattern4'] = np.mean(scores_tmp), np.std(scores_tmp)

## パターン5

2017年のデータで学習し、ツリー構造は変えずに2018年のデータで葉の重みを修正して2019年のデータを予測する。

In [16]:
X = df_2017.drop('usage', axis=1)
y = df_2017['usage'].astype('float32')

X2= df_2018.drop('usage', axis=1)
y2 = df_2018['usage'].astype('float32')

In [17]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=0)

scores_tmp = []
for train_index, valid_index in kf.split(X, y):
    X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_valid = lgb.Dataset(X_valid, y_valid)

    model = train_model(lgb_train, lgb_valid)    
    model2 = model.refit(X2, y2, decay_rate=0.9)

    y_pred = model2.predict(X_test,  num_iteration=model2.best_iteration)
    scores_tmp.append(mean_squared_error(y_test, y_pred, squared=False))

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 308.139
[200]	valid_0's rmse: 207.618
[300]	valid_0's rmse: 175.461
[400]	valid_0's rmse: 163.172
[500]	valid_0's rmse: 157.354
[600]	valid_0's rmse: 154.202
[700]	valid_0's rmse: 152.705
[800]	valid_0's rmse: 152.109
[900]	valid_0's rmse: 151.428
[1000]	valid_0's rmse: 150.978
[1100]	valid_0's rmse: 150.495
[1200]	valid_0's rmse: 150.228
[1300]	valid_0's rmse: 149.623
[1400]	valid_0's rmse: 149.291
[1500]	valid_0's rmse: 149.011
[1600]	valid_0's rmse: 148.84
[1700]	valid_0's rmse: 148.703
[1800]	valid_0's rmse: 148.511
[1900]	valid_0's rmse: 148.38
[2000]	valid_0's rmse: 148.214
[2100]	valid_0's rmse: 148.147
[2200]	valid_0's rmse: 148.107
[2300]	valid_0's rmse: 148.026
[2400]	valid_0's rmse: 148.028
Early stopping, best iteration is:
[2343]	valid_0's rmse: 147.994
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 308.044
[200]	valid_0's rmse: 208.728
[300]	valid_0's r

In [18]:
scores['pattern5'] = np.mean(scores_tmp), np.std(scores_tmp)

## スコア一覧

In [19]:
print('{:.2f}'.format(scores['pattern1'][0]), '{:.2f}'.format(scores['pattern1'][1]), '--', '2017年のデータで学習し、2019年のデータを予測する。')
print('{:.2f}'.format(scores['pattern2'][0]), '{:.2f}'.format(scores['pattern2'][1]), '--', '2018年のデータで学習し、2019年のデータを予測する。')
print('{:.2f}'.format(scores['pattern3'][0]), '{:.2f}'.format(scores['pattern3'][1]), '--', '2017年と2018年のデータで学習して2019年のデータを予測する。')
print('{:.2f}'.format(scores['pattern4'][0]), '{:.2f}'.format(scores['pattern4'][1]), '--', '2017年のデータで学習し、2018年のデータで追加学習して2019年のデータを予測する。')
print('{:.2f}'.format(scores['pattern5'][0]), '{:.2f}'.format(scores['pattern5'][1]), '--', '2017年のデータで学習し、ツリー構造は変えずに2018年のデータで葉の重みを修正して2019年のデータを予測する。')

209.78 0.83 -- 2017年のデータで学習し、2019年のデータを予測する。
215.33 2.54 -- 2018年のデータで学習し、2019年のデータを予測する。
205.51 0.91 -- 2017年と2018年のデータで学習して2019年のデータを予測する。
217.77 1.73 -- 2017年のデータで学習し、2018年のデータで追加学習して2019年のデータを予測する。
211.20 0.81 -- 2017年のデータで学習し、ツリー構造は変えずに2018年のデータで葉の重みを修正して2019年のデータを予測する。
