### XGBoost for Timeseries

#### Boosting

Ensemble models are a standard tool for predictive modeling and boosting is one technique to create ensemble models.

Boosting fits a series of models and fits each successive model in order to minimize the error of the previous models.

There are a couple of variants of this concept, one being gradient boosting.

#### XGBoost

https://xgboost.readthedocs.io/

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.

XGBoost is an ensemble of decision trees where new trees fix errors of the trees that are already part of the model. Trees are added until no further improvements can be made to the model.

Requirements to use XGBoost for time series:
- evaluate the model via walk-forward validation, instead of k-fold cross validation, as k-fold would have biased results.



In [None]:
#!pipenv install scikit-learn xgboost --skip-lock

In [1]:
from IPython.core.debugger import set_trace

%load_ext nb_black

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import time

plt.style.use(style="seaborn")
%matplotlib inline

<IPython.core.display.Javascript object>

In [2]:
df = pd.read_csv("data/MSFT-1Y-Hourly.csv")

<IPython.core.display.Javascript object>

In [3]:
df.head(5)

Unnamed: 0,date,open,high,low,close,volume,average,barCount
0,2019-08-07 14:30:00,133.8,133.83,131.82,132.89,35647,132.701,17523
1,2019-08-07 15:00:00,132.87,135.2,132.64,134.75,48757,134.043,26974
2,2019-08-07 16:00:00,134.74,134.92,133.52,133.88,28977,134.147,17853
3,2019-08-07 17:00:00,133.89,134.06,133.07,133.9,21670,133.618,13497
4,2019-08-07 18:00:00,133.89,135.24,133.83,134.83,22648,134.653,12602


<IPython.core.display.Javascript object>

In [4]:
df = df[["close"]].copy()

<IPython.core.display.Javascript object>

In [5]:
df.head(5)

Unnamed: 0,close
0,132.89
1,134.75
2,133.88
3,133.9
4,134.83


<IPython.core.display.Javascript object>

#### Transform this to a supervised learning problem.

In [6]:
df["target"] = df.close.shift(-1)

<IPython.core.display.Javascript object>

In [7]:
df.dropna(inplace=True)

<IPython.core.display.Javascript object>

In [8]:
df.head(5)

Unnamed: 0,close,target
0,132.89,134.75
1,134.75,133.88
2,133.88,133.9
3,133.9,134.83
4,134.83,135.48


<IPython.core.display.Javascript object>

#### Train test split

In [9]:
def train_test_split(data, perc):
    data = data.values
    n = int(len(data) * (1 - perc))
    return data[:n], data[n:]

<IPython.core.display.Javascript object>

In [10]:
train, test = train_test_split(df, 0.2)

<IPython.core.display.Javascript object>

In [11]:
print(len(df))
print(len(train))
print(len(test))

1752
1401
351


<IPython.core.display.Javascript object>

We'll use the XGBRegressor class to make a prediction. XGBRegressor is an implementation of the scikit-learn API for XGBoost regression.

We'll take the train set and test input row as input, fit a model, and make a prediction.

In [12]:
X = train[:, :-1]
y = train[:, -1]

<IPython.core.display.Javascript object>

In [14]:
y

array([134.75, 133.88, 133.9 , ..., 183.53, 183.51, 184.08])

<IPython.core.display.Javascript object>

In [15]:
from xgboost import XGBRegressor

model = XGBRegressor(objective="reg:squarederror", n_estimators=1000)
model.fit(X, y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=1000, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

<IPython.core.display.Javascript object>

In [16]:
test[0]

array([184.08, 183.74])

<IPython.core.display.Javascript object>

In [17]:
val = np.array(test[0, 0]).reshape(1, -1)

pred = model.predict(val)
print(pred[0])

184.6561


<IPython.core.display.Javascript object>

#### Predict
Train on train set and predict one sample at a time

In [18]:
def xgb_predict(train, val):
    train = np.array(train)
    X, y = train[:, :-1], train[:, -1]
    model = XGBRegressor(objective="reg:squarederror", n_estimators=1000)
    model.fit(X, y)

    val = np.array(val).reshape(1, -1)
    pred = model.predict(val)
    return pred[0]

<IPython.core.display.Javascript object>

In [19]:
xgb_predict(train, test[0, 0])

184.6561

<IPython.core.display.Javascript object>

#### Walk-forward validation

Since we are making a one step forward prediction, in this case an hourly prediction we will predict the first record in the test dataset. 

Afterwards we add the real observation from the test set to the train set, refit the model, then predict the next step in the test dataset.

We'll evaluate the model with the RMSE metric.

In [21]:
from sklearn.metrics import mean_squared_error


def validate(data, perc):
    predictions = []

    train, test = train_test_split(data, perc)

    history = [x for x in train]

    for i in range(len(test)):
        test_X, test_y = test[i, :-1], test[i, -1]

        pred = xgb_predict(history, test_X[0])
        predictions.append(pred)

        history.append(test[i])

    error = mean_squared_error(test[:, -1], predictions, squared=False)

    return error, test[:, -1], predictions

<IPython.core.display.Javascript object>

In [22]:
%%time
rmse, y, pred = validate(df, 0.2)

print(rmse)

  "because it will generate extra copies and increase " +


1.7967091070446082
CPU times: user 57min 19s, sys: 1min 20s, total: 58min 40s
Wall time: 3min 49s


<IPython.core.display.Javascript object>