## 5.2. Hyperparameters
The LightGBM algorithm has a huge number of [hyperparameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html), devided into many types. Some of them have aliases, which makes LightGBM compatible with other libraries.

*Boosting parameters*. The configurations for boosting.
- <code style='font-size:13px; color:#BA2121'>boosting</code>: the ensemble method, defaults to *gdbt* (traditional GBM). Other options are *rf* (Random Forest), *goss* (Gradient-based One-Side Sampling) and *dart* (Dropouts meets Multiple Additive Regression Trees).
- <code style='font-size:13px; color:#BA2121'>n_estimators</code>: the number of boosting stages ($T$), defaults to *100*. Larger is usually better, but should go with a lower *learning_rate* and an *early_stopping_round* provided. Lower can speed up training.
- <code style='font-size:13px; color:#BA2121'>learning_rate</code>: the learning rate ($\eta$), defaults to *0.1*. Same usage as in GBM.
- <code style='font-size:13px; color:#BA2121'>early_stopping_round</code>: the maximum number of iterations without improvements, defaults to *0* (disabled). Keeping a low enough value may make boosting stop earlier, thus reduces the overall training time. Usually be set around $10\%$ of *n_estimators*.
- <code style='font-size:13px; color:#BA2121'>tree_learner</code>: the parallelization strategy, defaults to *serial* (single machine, no parallel). Other options are *feature*, *data* and *voting*.

*Bagging parameters*. LightGBM even includes more bagging parameters compared to Scikit-learn.
- <code style='font-size:13px; color:#BA2121'>bagging_fraction</code>: the ratio of data (instances) used in each tree, defaults to *1*. A lower value will increase the randomness between trees that deals with overfitting and may speed up training.
- <code style='font-size:13px; color:#BA2121'>bagging_freq</code>: the iteration frequency to perform bagging, defaults to *0* (disable bagging). A positive value will decrease the randomness between trees.
- <code style='font-size:13px; color:#BA2121'>pos_bagging_fraction</code> and <code style='font-size:13px; color:#BA2121'>neg_bagging_fraction</code>: the ratio of positive/negative samples used in each tree, both default to *1*. This pair of parameters should be used together to handle imbalance binary classification problems.
- <code style='font-size:13px; color:#BA2121'>feature_fraction</code>: the ratio of features used in each tree, defaults to *1*. A lower value will increase the randomness between trees that deals with overfitting and may speed up training.
- <code style='font-size:13px; color:#BA2121'>feature_fraction_bynode</code>: same as *feature_fraction*, but the sampling is done on tree nodes instead of trees; also defaults to *1*. A lower value can reduce overfitting but cannot speed up training.
- <code style='font-size:13px; color:#BA2121'>extra_trees</code>: whether to use Extremely Randomized Trees or not, defaults to *False*. This parameter increases the randomness, thus can deal with overfitting.

*Tree learning parameters*. Most parameters in this group are for prunning trees in order to deal with overfitting. Due to the fact that pruned trees are shallower, the training is also faster.
- <code style='font-size:13px; color:#BA2121'>num_leaves</code>: the maximum number of leaves in each tree, defaults to *31*. Since LightGBM grows trees leaf-wise, this is the main hyperparameter to control the complexity of trees. Optimal values range from $50\%$ to $100\%$ of $2^{\text{max_depth}}$.
- <code style='font-size:13px; color:#BA2121'>min_samples_leaf</code>: the minimum number of data a leaf must have, defaults to *20*. The next important parameter to prevent overfitting, its optimal value depends on training data size and *num_leaves*. Practical values for large datasets range from hundreds to thousands.
- <code style='font-size:13px; color:#BA2121'>max_depth</code>: the maximum depth of each tree, defaults to *-1* (no depth limitation). Another important parameter that controls overfitting; however, it is less effective on leaf-wise compared to level-wise implementations. A value from 3 to 13 works well for most datasets.
- <code style='font-size:13px; color:#BA2121'>min_sum_hessian_in_leaf</code>: the minimum sum of Hessian (the second derivative of the objective function for each observation) of each leaf, defaults to *0.001*. When the loss function is MSE, its second derivative is $1$; the sum of Hessian in this case equals to the number of data. For other loss function, this parameter has different meanings and also has different optimal values. Thus, unless you know what you are doing, this parameter should be left alone.
- <code style='font-size:13px; color:#BA2121'>min_gain_to_split</code>: the minimum information gain required to perform a split, defaults to *0*. In practice, very small improvements in the training loss have no meaningful impact on the generalization error of the model. A small value of this parameter is enough if used.
- <code style='font-size:13px; color:#BA2121'>reg_alpha</code> and <code style='font-size:13px; color:#BA2121'>reg_lambda</code>: the L1 and L2 regularization terms, both default to *0*. Optimal values are $10^k$ where $k$ is around $0$.
- <code style='font-size:13px; color:#BA2121'>linear_tree</code>: whether to use Piece-Wise Linear Regression Trees, defaults to *False*.
- <code style='font-size:13px; color:#BA2121'>linear_lambda</code>: the coefficient for linear tree regularization, defaults to *0*.

*Categorical split finding parameters*.
- <code style='font-size:13px; color:#BA2121'>categorical_feature</code>: specify categorical features, defaults to *auto*.
- <code style='font-size:13px; color:#BA2121'>min_data_per_group</code>: the minimum number of data per categorical group, defaults to *100*.
- <code style='font-size:13px; color:#BA2121'>cat_smooth</code>: the coefficient for categorical smoothing, defaults to *10*. Can reduce the effect of noises in categorical features, especially for ones with few data.
- <code style='font-size:13px; color:#BA2121'>max_cat_threshold</code>: the maximum number of splits considered for categorical features, defaults to *32*. Higher means more split points and larger search space. Lower reduces training time.
- <code style='font-size:13px; color:#BA2121'>cat_l2</code>: L2 regularization in categorical split, defaults to *10*.
- <code style='font-size:13px; color:#BA2121'>max_cat_to_onehot</code>: maximum number of categories of a feature to use one-hot encoding, otherwise the Fisher's split finding will be used, defaults to *4*.

*Histogram building parameters*.
- <code style='font-size:13px; color:#BA2121'>max_bin</code> and
<code style='font-size:13px; color:#BA2121'>max_bin_by_feature</code>: the maximum number of bins when building histograms, defaults to *255*. The later parameter takes a list of intergers to specify the max number of bins for each feature. Smaller reduces training time but may hurt the accuracy.
- <code style='font-size:13px; color:#BA2121'>min_data_in_bin</code>: the minimum bin size, defaults to *3*. This parameter prevents bins from having a small number of data, as using their boundaries as splits isnâ€™t likely to change the final model very much. Higher value reduces training time.
- <code style='font-size:13px; color:#BA2121'>bin_construct_sample_cnt</code>: the number of observations being sampled to determine bins, defaults to *200,000*. LightGBM only uses a part of data to find histogram boundaries, thus this parameter should not be set to a lower value. A higher value obviously improves prediction power but also leads to a longer data loading time.

*DART's parameters*
- <code style='font-size:13px; color:#BA2121'>drop_rate</code>: the fraction of previous trees to drop during the dropout, defaults to *0.1*.
- <code style='font-size:13px; color:#BA2121'>max_drop</code>: the max number of dropped trees during one boosting iteration, defaults to *50*.
- <code style='font-size:13px; color:#BA2121'>skip_drop</code>: the probability of skipping the dropout procedure during a boosting iteration, defaults to *0.5*.
- <code style='font-size:13px; color:#BA2121'>uniform_drop</code>: whether to use uniform drop or not, defaults to *False*.
- <code style='font-size:13px; color:#BA2121'>xgboost_dart_mode</code>: whether to enable XGBoost DART mode, which uses a bit different shrinkage rate, defaults to *False*.

*GOSS's parameters*
- <code style='font-size:13px; color:#BA2121'>top_rate</code>: the sampling ratio of large gradient data, defaults to *0.2*.
- <code style='font-size:13px; color:#BA2121'>other_rate</code>: the sampling ratio of small gradient data, defaults to *0.1*.

*Preprocessing parameteres*
- <code style='font-size:13px; color:#BA2121'>enable_bundle</code>: whether to use the EFB algorithm or not, defaults to *True*.
- <code style='font-size:13px; color:#BA2121'>enable_sparse</code>: whether to use sparse optimization, defaults to *True*.
- <code style='font-size:13px; color:#BA2121'>use_missing</code>: whether to use special handle of missing values, defaults to *True*.
- <code style='font-size:13px; color:#BA2121'>feature_pre_filter</code>: whether to ignore unsplittable features based on *min_samples_leaf*, defaults to *True*. As *min_samples_leaf* was set, some features will perform a split results in a leaf not having enough minimum number of data. Such features will be filtered out once before training. Also remember to tune *min_samples_leaf* before this parameter.

In [3]:
from bokeh.layouts import row
from bokeh.plotting import figure, output_file, save
from bokeh.io import show, output_notebook
output_notebook()

In [5]:
output_file(filename="output/custom_filename.html", title="Static HTML")

# prepare some data
x = list(range(11))
y0 = x
y1 = [10 - i for i in x]
y2 = [abs(i - 5) for i in x]

# create three plots with one renderer each
s1 = figure(width=250, height=250, background_fill_color="#fafafa")
s1.circle(x, y0, size=12, color="#53777a", alpha=0.8)

s2 = figure(width=250, height=250, background_fill_color="#fafafa")
s2.triangle(x, y1, size=12, color="#c02942", alpha=0.8)

s3 = figure(width=250, height=250, background_fill_color="#fafafa")
s3.square(x, y2, size=12, color="#d95b43", alpha=0.8)

# put the results in a row and show

save(row(s1, s2, s3))

from IPython.display import HTML
HTML(filename="output/custom_filename.html")

In [None]:
# udf.py
# utility
from toolz import pipe, thread_first
from contextlib import suppress

# data manipulation
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, Band
from bokeh.plotting import figure
from bokeh.io import show

# config
plt.style.use(['seaborn', 'seaborn-whitegrid'])
output_notebook()
import warnings; warnings.filterwarnings('ignore')
    
def plot_forecasting(x, yTrue, yPred, ci90=None, ci95=None, ci99=None, history=True):
    
    cut = yPred.shape[0]
    df = x.append(yTrue).to_frame()
    df[['forecast', 'lower90', 'upper90', 'lower95', 'upper95', 'lower99', 'upper99']] = np.nan
    df.iloc[-cut:, 1] = yPred
    with suppress(AttributeError):
        df.iloc[-cut:, 2:4] = ci90.clip(min=0)
        df.iloc[-cut:, 4:6] = ci95.clip(min=0)
        df.iloc[-cut:, 6:8] = ci99.clip(min=0)
    df = df.reset_index()
    df.columns = ['date', 'ground_truth', 'forecast', 'lower90', 'upper90', 'lower95', 'upper95', 'lower99', 'upper99']
    df = df if history else df.iloc[-cut:, :]

    source = ColumnDataSource(df)

    fig = figure(plot_width=1000, plot_height=300, x_axis_type='datetime')
    fig.line(source=source, x='date', y='ground_truth', color='grey', legend_label='GroundTruth')
    fig.line(source=source, x='date', y='forecast', color='red', legend_label='Forecast')
    fig.circle(source=source, x='date', y='ground_truth', color='grey')
    fig.circle(source=source, x='date', y='forecast', color='red')
    
    fig.varea(source=source, x='date', y1='lower90', y2='upper90', fill_alpha=0.15, fill_color='grey', legend_label='CI 90%')
    fig.varea(source=source, x='date', y1='lower95', y2='upper95', fill_alpha=0.10, fill_color='grey', legend_label='CI 95%')
    fig.varea(source=source, x='date', y1='lower99', y2='upper99', fill_alpha=0.05, fill_color='grey', legend_label='CI 99%')
    
    fig.legend.location = 'top_left'
    fig.yaxis.formatter = bokeh.models.NumeralTickFormatter(format='0a')

    show(fig)

In [55]:
# manipulation
import numpy as np
import pandas as pd
from statsmodels import tsa
from toolz import pipe, thread_first

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from udf import plot_forecasting

# hypothesis testing
from statsmodels.tsa.stattools import adfuller

# algorithms
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import VAR, SVAR, VARMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing as HW
from statsmodels.tsa.exponential_smoothing.ets import ETSModel as ETS
from pmdarima.arima import auto_arima as AutoARIMA
# from prophet import Prophet # cannot install

# evaluation
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_squared_error as MSE, mean_absolute_error as MAE, r2_score as R2

# configurations
plt.style.use(['seaborn', 'seaborn-whitegrid'])
%config InlineBackend.figure_format = 'retina'
output_notebook()
import warnings; warnings.filterwarnings('ignore')

In [2]:
dfWithdraw = pd.read_excel('032_withdraw_daily.xlsx')
dfWithdraw = dfWithdraw[['day', 'withdraw']]
dfWithdraw = dfWithdraw.set_index('day').asfreq('d').reset_index()
dfWithdraw['withdraw'] = dfWithdraw['withdraw'].fillna(1e-6) # add a smoothing coefficient
dfWithdraw.head()

Unnamed: 0,day,withdraw
0,2021-09-18,216625500.0
1,2021-09-19,1040161000.0
2,2021-09-20,9851213000.0
3,2021-09-21,9933537000.0
4,2021-09-22,15892200000.0


# 2. Modeling

In [3]:
df = dfWithdraw.copy()

# configurations
nTrain, nTest = 24, 24

# cut points: start --- cutTrain --- cutTest --- end
n = df.shape[0]
cutTest = n - nTest
cutTrain = cutTest - nTrain

# train-test split, no validation
df = dfWithdraw.copy()
s = df.set_index('day').withdraw
xTrain, yTrain = s[0:cutTrain], s[cutTrain:cutTest]
xTest , yTest  = s[0:cutTest ], s[cutTest:]

## ARIMA

In [21]:
p, d, q = 7, 1, 2

modelTrain = ARIMA(xTrain, order=(p,d,q)).fit()
yTrainPred, _, ciTrain99 = modelTrain.forecast(nTrain, alpha=0.01)
yTrainPred, _, ciTrain95 = modelTrain.forecast(nTrain, alpha=0.05)
yTrainPred, _, ciTrain90 = modelTrain.forecast(nTrain, alpha=0.1)

rmseTrain = MSE(yTrain, yTrainPred, squared=False) / 1e9
maeTrain = MAE(yTrain, yTrainPred) / 1e9
r2Train = R2(yTrain, yTrainPred)

print(f'ARIMA | Train | RMSE={rmseTrain:.2f}b | MAE={maeTrain:.2f}b | R2={r2Train:.4f}\n')

plot_forecasting(xTrain, yTrain, yTrainPred, ciTrain90, ciTrain95, ciTrain99, history=False)

ARIMA | Train | RMSE=145.99b | MAE=114.74b | R2=0.5011



In [22]:
modelTest = ARIMA(xTest, order=(p,d,q)).fit()
yTestPred, _, ciTest99 = modelTest.forecast(nTest, alpha=0.01)
yTestPred, _, ciTest95 = modelTest.forecast(nTest, alpha=0.05)
yTestPred, _, ciTest90 = modelTest.forecast(nTest, alpha=0.1)

rmseTest = MSE(yTest, yTestPred, squared=False) / 1e9
maeTest = MAE(yTest, yTestPred) / 1e9
r2Test = R2(yTest, yTestPred)

print(f'ARIMA | Test | RMSE={rmseTest:.2f}b | MAE={maeTest:.2f}b | R2={r2Test:.4f}\n')

plot_forecasting(xTest, yTest, yTestPred, ciTest90, ciTest95, ciTest99, history=False)

ARIMA | Test | RMSE=182.47b | MAE=137.86b | R2=0.3414



## Auto ARIMA

In [24]:
modelAutoArima = AutoARIMA(
    xTrain, test='adf', m=7,
    start_p=0, max_p=7,
    start_q=0, max_q=7,
    d=1, max_d=2,
    start_P=0, max_P=5,
    start_Q=0, max_Q=5,
    D=0, max_D=2
)

In [25]:
modelAutoArima.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,196.0
Model:,"SARIMAX(1, 1, 2)x(1, 0, [1], 7)",Log Likelihood,-5211.604
Date:,"Wed, 25 May 2022",AIC,10435.209
Time:,10:21:09,BIC,10454.847
Sample:,0,HQIC,10443.16
,- 196,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.7415,0.345,2.150,0.032,0.066,1.417
ma.L1,-1.3971,0.461,-3.031,0.002,-2.301,-0.494
ma.L2,0.4055,0.440,0.921,0.357,-0.458,1.269
ar.S.L7,0.9563,0.050,19.115,0.000,0.858,1.054
ma.S.L7,-0.5832,0.175,-3.332,0.001,-0.926,-0.240
sigma2,1.846e+22,4.84e-24,3.81e+45,0.000,1.85e+22,1.85e+22

0,1,2,3
Ljung-Box (L1) (Q):,0.23,Jarque-Bera (JB):,64.21
Prob(Q):,0.63,Prob(JB):,0.0
Heteroskedasticity (H):,2.91,Skew:,0.02
Prob(H) (two-sided):,0.0,Kurtosis:,5.81


In [26]:
modelAutoArima.get_params()

{'maxiter': 50,
 'method': 'lbfgs',
 'order': (1, 1, 2),
 'out_of_sample_size': 0,
 'scoring': 'mse',
 'scoring_args': {},
 'seasonal_order': (1, 0, 1, 7),
 'start_params': None,
 'trend': None,
 'with_intercept': False}

In [28]:
yTrainPred, ciTrain90 = modelAutoArima.predict(24, alpha=0.10, return_conf_int=True)
yTrainPred, ciTrain95 = modelAutoArima.predict(24, alpha=0.05, return_conf_int=True)
yTrainPred, ciTrain99 = modelAutoArima.predict(24, alpha=0.01, return_conf_int=True)

rmseTrain = MSE(yTrain, yTrainPred, squared=False) / 1e9
maeTrain = MAE(yTrain, yTrainPred) / 1e9
r2Train = R2(yTrain, yTrainPred)

print(f'AutoARIMA | Train | RMSE={rmseTrain:.2f}b | MAE={maeTrain:.2f}b | R2={r2Train:.4f}\n')

plot_forecasting(xTrain, yTrain, yTrainPred, ciTrain90, ciTrain95, ciTrain99, False)

AutoARIMA | Train | RMSE=134.80b | MAE=93.20b | R2=0.5746



In [29]:
yTestPred, ciTest90 = modelAutoArima.fit_predict(xTest, n_periods=24, alpha=0.10, return_conf_int=True)
yTestPred, ciTest95 = modelAutoArima.fit_predict(xTest, n_periods=24, alpha=0.05, return_conf_int=True)
yTestPred, ciTest99 = modelAutoArima.fit_predict(xTest, n_periods=24, alpha=0.01, return_conf_int=True)

mseTest = MSE(yTest, yTestPred, squared=False) / 1e9
maeTest = MAE(yTest, yTestPred) / 1e9
r2Test = R2(yTest, yTestPred)

print(f'AutoARIMA | Test | RMSE={rmseTest:.2f}b | MAE={maeTest:.2f}b | R2={r2Test:.4f}\n')

plot_forecasting(xTest, yTest, yTestPred, ciTest90, ciTest95, ciTest99, False)

AutoARIMA | Test | RMSE=182.47b | MAE=117.75b | R2=0.4630



## Exponential Smoothing

In [39]:
paramsGrid = ParameterGrid({
    'error': ['add', 'mul'],
    'trend': ['add', 'mul'],
    'seasonal': ['add', 'mul'],
    'damped_trend': [True, False],
    'seasonal_periods': [7],
})

listRmse = []
for params in paramsGrid:
    model = ETS(xTrain, **params).fit()
    yTrainPred = model.get_prediction(cutTrain, cutTest-1).summary_frame()['mean']
    rmseTrain = MSE(yTrain, yTrainPred, squared=False) / 1e9
    listRmse.append(rmseTrain)

idxBest = pipe(listRmse, np.array, np.argmax)
paramsBest = paramsGrid[idxBest]

In [53]:
print(paramsBest)
print(min(listRmse))

{'trend': 'mul', 'seasonal_periods': 7, 'seasonal': 'mul', 'error': 'add', 'damped_trend': False}
124.16323359532464


In [58]:
modelEts = ETS(xTrain, **paramsBest).fit()
predictor95 = modelEts.get_prediction(cutTrain, cutTest-1).summary_frame()
yTrainPred = predictor95.iloc[:, 0]

ciTrain95 = predictor95.iloc[:, 2:].values

plot_forecasting(xTrain, yTrain, yTrainPred, ciTrain95, history=False)

In [56]:
modelEts = ETS(xTest, **paramsBest).fit()
predictor95 = modelEts.get_prediction(cutTest, n-1).summary_frame()
yTestPred = predictor95.iloc[:, 0]

ciTest95 = predictor95.iloc[:, 2:].values

plot_forecasting(xTest, yTest, yTestPred, ciTest95, history=False)

In [240]:
modelEts = ETS(xTrain, error='add', trend='add', damped_trend=True, seasonal='add', seasonal_periods=7).fit()
predictionEts = modelEts.get_prediction(cutTrain, cutTest-1, alpha=0.05)
predictionEts.summary_frame().head()

Unnamed: 0,mean,pi_lower,pi_upper
2022-04-02,101791700000.0,-46463480000.0,250046900000.0
2022-04-03,67211410000.0,-86319070000.0,220741900000.0
2022-04-04,588185200000.0,429553800000.0,746816600000.0
2022-04-05,406180100000.0,242605800000.0,569754300000.0
2022-04-06,365977700000.0,197604800000.0,534350700000.0


In [244]:
modelEts = ETS(xTrain, trend='add', damped_trend=True, seasonal='add', seasonal_periods=7).fit()
predictionEts = modelEts.get_prediction(cutTrain, cutTest-1, alpha=0.1)
predictionEts.summary_frame().head()

Unnamed: 0,mean,pi_lower,pi_upper
2022-04-02,101791700000.0,-46463480000.0,250046900000.0
2022-04-03,67211410000.0,-86319070000.0,220741900000.0
2022-04-04,588185200000.0,429553800000.0,746816600000.0
2022-04-05,406180100000.0,242605800000.0,569754300000.0
2022-04-06,365977700000.0,197604800000.0,534350700000.0
