# [Random Forest for Time Series Forecasting](https://machinelearningmastery.com/random-forest-for-time-series-forecasting/)

Random Forest can also be used for time series forecasting, although it requires that the time series dataset be transformed into a supervised learning problem first. It also requires the use of a specialized technique for evaluating the model called [walk-forward validation](https://en.wikipedia.org/wiki/Walk_forward_optimization), as evaluating the model using k-fold cross validation would result in optimistically biased results.


In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [2]:
# get data
TEST_CSV = 'test.csv'  # no data snooping!
TRAIN_CSV = 'train.csv'
train_df = pd.read_csv(TRAIN_CSV)
train_df

Unnamed: 0,row_id,date,country,store,product,num_sold
0,0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663
1,1,2017-01-01,Belgium,KaggleMart,Kaggle Getting Started,615
2,2,2017-01-01,Belgium,KaggleMart,Kaggle Recipe Book,480
3,3,2017-01-01,Belgium,KaggleMart,Kaggle for Kids: One Smart Goose,710
4,4,2017-01-01,Belgium,KaggleRama,Kaggle Advanced Techniques,240
...,...,...,...,...,...,...
70123,70123,2020-12-31,Spain,KaggleMart,Kaggle for Kids: One Smart Goose,614
70124,70124,2020-12-31,Spain,KaggleRama,Kaggle Advanced Techniques,215
70125,70125,2020-12-31,Spain,KaggleRama,Kaggle Getting Started,158
70126,70126,2020-12-31,Spain,KaggleRama,Kaggle Recipe Book,135


## [Convert Time Series Data to Supervised Representation](https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/)

We need to reframe the dataset into features from previous day, including previous day target, mapping it to next days target.

This is a little trickier for this dataset since we're not looking at features for a single book and predicting sales for it overtime. This is more complex because we're predicting sales for four different book products, sold across two different stores, and six countries. So to reframe as supervised model, we will need to convert 4 x 2 x 6 different subsets into single reframed dataset for model training across four years. The input for future predictions will need to handle any of the different feature types to predict appropriately.

In [3]:
def series_to_supervised(train_df: pd.DataFrame) -> pd.DataFrame:
    df = train_df.copy()
    df['uuid'] = [
        f'{row["country"]}_{row["store"]}_{row["product"]}'
        for _, row in df.iterrows()
    ]
    df.sort_values(by=['uuid', 'date'], inplace=True)
    mod_df = df.shift(1).rename(columns={
        'date': 'prev_date', 'num_sold': 'prev_num_sold'
    })
    mod_df = pd.concat([mod_df, df.loc[:, ['num_sold', 'date']]], axis=1)
    mod_df = mod_df.loc[mod_df.date != '2017-01-01', :]
    return mod_df

In [4]:
mod_train_df = series_to_supervised(train_df)
mod_train_df

Unnamed: 0,row_id,prev_date,country,store,product,prev_num_sold,uuid,num_sold,date
48,0.0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663.0,Belgium_KaggleMart_Kaggle Advanced Techniques,514,2017-01-02
96,48.0,2017-01-02,Belgium,KaggleMart,Kaggle Advanced Techniques,514.0,Belgium_KaggleMart_Kaggle Advanced Techniques,549,2017-01-03
144,96.0,2017-01-03,Belgium,KaggleMart,Kaggle Advanced Techniques,549.0,Belgium_KaggleMart_Kaggle Advanced Techniques,477,2017-01-04
192,144.0,2017-01-04,Belgium,KaggleMart,Kaggle Advanced Techniques,477.0,Belgium_KaggleMart_Kaggle Advanced Techniques,447,2017-01-05
240,192.0,2017-01-05,Belgium,KaggleMart,Kaggle Advanced Techniques,447.0,Belgium_KaggleMart_Kaggle Advanced Techniques,431,2017-01-06
...,...,...,...,...,...,...,...,...,...
69935,69887.0,2020-12-26,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,187.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,204,2020-12-27
69983,69935.0,2020-12-27,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,204.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,212,2020-12-28
70031,69983.0,2020-12-28,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,212.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,242,2020-12-29
70079,70031.0,2020-12-29,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,242.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,239,2020-12-30


In [5]:
mod_train_df.reset_index().head(1461)

Unnamed: 0,index,row_id,prev_date,country,store,product,prev_num_sold,uuid,num_sold,date
0,48,0.0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663.0,Belgium_KaggleMart_Kaggle Advanced Techniques,514,2017-01-02
1,96,48.0,2017-01-02,Belgium,KaggleMart,Kaggle Advanced Techniques,514.0,Belgium_KaggleMart_Kaggle Advanced Techniques,549,2017-01-03
2,144,96.0,2017-01-03,Belgium,KaggleMart,Kaggle Advanced Techniques,549.0,Belgium_KaggleMart_Kaggle Advanced Techniques,477,2017-01-04
3,192,144.0,2017-01-04,Belgium,KaggleMart,Kaggle Advanced Techniques,477.0,Belgium_KaggleMart_Kaggle Advanced Techniques,447,2017-01-05
4,240,192.0,2017-01-05,Belgium,KaggleMart,Kaggle Advanced Techniques,447.0,Belgium_KaggleMart_Kaggle Advanced Techniques,431,2017-01-06
...,...,...,...,...,...,...,...,...,...,...
1456,69936,69888.0,2020-12-27,Belgium,KaggleMart,Kaggle Advanced Techniques,574.0,Belgium_KaggleMart_Kaggle Advanced Techniques,625,2020-12-28
1457,69984,69936.0,2020-12-28,Belgium,KaggleMart,Kaggle Advanced Techniques,625.0,Belgium_KaggleMart_Kaggle Advanced Techniques,597,2020-12-29
1458,70032,69984.0,2020-12-29,Belgium,KaggleMart,Kaggle Advanced Techniques,597.0,Belgium_KaggleMart_Kaggle Advanced Techniques,632,2020-12-30
1459,70080,70032.0,2020-12-30,Belgium,KaggleMart,Kaggle Advanced Techniques,632.0,Belgium_KaggleMart_Kaggle Advanced Techniques,616,2020-12-31


In [6]:
mod_train_df.reset_index().tail(1461)

Unnamed: 0,index,row_id,prev_date,country,store,product,prev_num_sold,uuid,num_sold,date
68619,70126,70078.0,2020-12-30,Spain,KaggleRama,Kaggle Recipe Book,153.0,Spain_KaggleRama_Kaggle Recipe Book,135,2020-12-31
68620,95,47.0,2017-01-01,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,181.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,123,2017-01-02
68621,143,95.0,2017-01-02,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,123.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,125,2017-01-03
68622,191,143.0,2017-01-03,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,125.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,110,2017-01-04
68623,239,191.0,2017-01-04,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,110.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,104,2017-01-05
...,...,...,...,...,...,...,...,...,...,...
70075,69935,69887.0,2020-12-26,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,187.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,204,2020-12-27
70076,69983,69935.0,2020-12-27,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,204.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,212,2020-12-28
70077,70031,69983.0,2020-12-28,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,212.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,242,2020-12-29
70078,70079,70031.0,2020-12-29,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,242.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,239,2020-12-30


## One-hot Encode Categorical Cols

Need to one-hot encode `country`, `store`, and `product` columns for used in modeling.

In [7]:
ohe_train_df = pd.get_dummies(
    data=mod_train_df, prefix=['country', 'store', 'product'],
    columns=['country', 'store', 'product']
)
ohe_train_df

Unnamed: 0,row_id,prev_date,prev_num_sold,uuid,num_sold,date,country_Belgium,country_France,country_Germany,country_Italy,country_Poland,country_Spain,store_KaggleMart,store_KaggleRama,product_Kaggle Advanced Techniques,product_Kaggle Getting Started,product_Kaggle Recipe Book,product_Kaggle for Kids: One Smart Goose
48,0.0,2017-01-01,663.0,Belgium_KaggleMart_Kaggle Advanced Techniques,514,2017-01-02,1,0,0,0,0,0,1,0,1,0,0,0
96,48.0,2017-01-02,514.0,Belgium_KaggleMart_Kaggle Advanced Techniques,549,2017-01-03,1,0,0,0,0,0,1,0,1,0,0,0
144,96.0,2017-01-03,549.0,Belgium_KaggleMart_Kaggle Advanced Techniques,477,2017-01-04,1,0,0,0,0,0,1,0,1,0,0,0
192,144.0,2017-01-04,477.0,Belgium_KaggleMart_Kaggle Advanced Techniques,447,2017-01-05,1,0,0,0,0,0,1,0,1,0,0,0
240,192.0,2017-01-05,447.0,Belgium_KaggleMart_Kaggle Advanced Techniques,431,2017-01-06,1,0,0,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69935,69887.0,2020-12-26,187.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,204,2020-12-27,0,0,0,0,0,1,0,1,0,0,0,1
69983,69935.0,2020-12-27,204.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,212,2020-12-28,0,0,0,0,0,1,0,1,0,0,0,1
70031,69983.0,2020-12-28,212.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,242,2020-12-29,0,0,0,0,0,1,0,1,0,0,0,1
70079,70031.0,2020-12-29,242.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,239,2020-12-30,0,0,0,0,0,1,0,1,0,0,0,1


## Generate Model Ready Dataset

Now that we have supervised representation along with one-hot-encoding of necessary categorical features, before we can apply to random forest model, we need drop all unnecessary columns and convert to matrix (numpy array): `indices`, `row_id`, `prev_date`, `uuid`, and `date`.

The only thing that should remain along with one-hot-encoded catecorigal features are:
- `prev_num_sold` which represents X, which now completes the sliding window inputs needed for walk-forward validation
- `num_sold` which represents y, the target for each sliding window input

In [8]:
# re-order columns so it will be easy to subset into: uuid - time group, features, & target
COL_MUNGED_DATA = [
    'uuid', 'prev_date', 'date',
    'prev_num_sold', 'country_Belgium', 'country_France', 'country_Germany', 'country_Italy',
    'country_Poland', 'country_Spain', 'store_KaggleMart',
    'store_KaggleRama', 'product_Kaggle Advanced Techniques',
    'product_Kaggle Getting Started', 'product_Kaggle Recipe Book',
    'product_Kaggle for Kids: One Smart Goose', 'num_sold'
]
COL_MUNGED_DATA

['uuid',
 'prev_date',
 'date',
 'prev_num_sold',
 'country_Belgium',
 'country_France',
 'country_Germany',
 'country_Italy',
 'country_Poland',
 'country_Spain',
 'store_KaggleMart',
 'store_KaggleRama',
 'product_Kaggle Advanced Techniques',
 'product_Kaggle Getting Started',
 'product_Kaggle Recipe Book',
 'product_Kaggle for Kids: One Smart Goose',
 'num_sold']

In [9]:
# create index to column name mapping to facilitate reconciling
IDX_COL_NAME_MAP = {i: col for i, col in enumerate(COL_MUNGED_DATA)}
IDX_COL_NAME_MAP

{0: 'uuid',
 1: 'prev_date',
 2: 'date',
 3: 'prev_num_sold',
 4: 'country_Belgium',
 5: 'country_France',
 6: 'country_Germany',
 7: 'country_Italy',
 8: 'country_Poland',
 9: 'country_Spain',
 10: 'store_KaggleMart',
 11: 'store_KaggleRama',
 12: 'product_Kaggle Advanced Techniques',
 13: 'product_Kaggle Getting Started',
 14: 'product_Kaggle Recipe Book',
 15: 'product_Kaggle for Kids: One Smart Goose',
 16: 'num_sold'}

Based on indices, columns:
- 0 - 2: identifier and date group
- 3 - 15: feature set
- 16: target

In [10]:
munged_df = ohe_train_df.loc[:, COL_MUNGED_DATA]
munged_df

Unnamed: 0,uuid,prev_date,date,prev_num_sold,country_Belgium,country_France,country_Germany,country_Italy,country_Poland,country_Spain,store_KaggleMart,store_KaggleRama,product_Kaggle Advanced Techniques,product_Kaggle Getting Started,product_Kaggle Recipe Book,product_Kaggle for Kids: One Smart Goose,num_sold
48,Belgium_KaggleMart_Kaggle Advanced Techniques,2017-01-01,2017-01-02,663.0,1,0,0,0,0,0,1,0,1,0,0,0,514
96,Belgium_KaggleMart_Kaggle Advanced Techniques,2017-01-02,2017-01-03,514.0,1,0,0,0,0,0,1,0,1,0,0,0,549
144,Belgium_KaggleMart_Kaggle Advanced Techniques,2017-01-03,2017-01-04,549.0,1,0,0,0,0,0,1,0,1,0,0,0,477
192,Belgium_KaggleMart_Kaggle Advanced Techniques,2017-01-04,2017-01-05,477.0,1,0,0,0,0,0,1,0,1,0,0,0,447
240,Belgium_KaggleMart_Kaggle Advanced Techniques,2017-01-05,2017-01-06,447.0,1,0,0,0,0,0,1,0,1,0,0,0,431
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69935,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,2020-12-26,2020-12-27,187.0,0,0,0,0,0,1,0,1,0,0,0,1,204
69983,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,2020-12-27,2020-12-28,204.0,0,0,0,0,0,1,0,1,0,0,0,1,212
70031,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,2020-12-28,2020-12-29,212.0,0,0,0,0,0,1,0,1,0,0,0,1,242
70079,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,2020-12-29,2020-12-30,242.0,0,0,0,0,0,1,0,1,0,0,0,1,239


In [11]:
munged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70080 entries, 48 to 70127
Data columns (total 17 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   uuid                                      70080 non-null  object 
 1   prev_date                                 70080 non-null  object 
 2   date                                      70080 non-null  object 
 3   prev_num_sold                             70080 non-null  float64
 4   country_Belgium                           70080 non-null  uint8  
 5   country_France                            70080 non-null  uint8  
 6   country_Germany                           70080 non-null  uint8  
 7   country_Italy                             70080 non-null  uint8  
 8   country_Poland                            70080 non-null  uint8  
 9   country_Spain                             70080 non-null  uint8  
 10  store_KaggleMart                 

## Walk-forward Validation

The next step is to define evaluation methodology for this dataset, while keeping in mind the complexity and non-linearity between `uuid`s.

Recall, the task is to predict the corresponding item sales for each date-country-store-item combination. So for each date in 2021, we want to predict the corresponding country-store-item (uuid) sales using the dataset spanning 2016 - 2020.

Note:
In walk-forward validation, the dataset is first split into train and test sets by selecting a cut point, e.g. all data except the last 12 months is used for training and the last 12 months is used for testing.

If we are interested in making a one-step forecast, e.g. one month, then we can evaluate the model by training on the training dataset and predicting the first step in the test dataset. We can then add the real observation from the test set to the training dataset, refit the model, then have the model predict the second step in the test dataset.

Repeating this process for the entire test dataset will give a one-step prediction for the entire test dataset from which an error measure can be calculated to evaluate the skill of the model.

In [12]:
# total number of date-country-store-item combination for given year
len(munged_df.uuid.unique()) * 365

17520

### Naive Approach

The simplest approach is to use 2016 - 2019 dataset as training set and 2020 data as the test set. Then apply the walk-forward validation approach for each day in 2020 to train the model for predicting each date-country-store-item combination in 2021.

For this approach, we can think of single training epoch as walking through all date-country-store-item combinations in 2020. Which is 17,520 steps.

In [13]:
def train_test_split(data: pd.DataFrame, date_cut: str) -> tuple:
    train_df = data[data.date < date_cut]
    test_df = data[data.date >= date_cut]
    np_train = train_df.iloc[: , 3:].to_numpy()
    np_test = test_df.iloc[:, 3:].to_numpy()
    return np_train, np_test

In [14]:
def random_forest_forecast(train: list, test_x: np.array) -> float:
    train = np.asarray(train)
    train_x, train_y = train[:, :-1], train[:, -1]
    # fit model
    model = RandomForestRegressor(n_estimators=1000)
    model.fit(train_x, train_y)
    # make one-step prediction
    y_hat = model.predict([test_x])
    return y_hat[0]

In [15]:
def walk_forward_validation(data: pd.DataFrame, date_cut: str) -> tuple:
    predictions = []
    train, test = train_test_split(data, date_cut)
    # seed history with training dataset
    history = [x for x in train]
    # perform walk forward steps
    for i in range(len(test)):
        test_x, test_y = test[i, :-1], test[i, -1]
        y_hat = random_forest_forecast(history, test_x)
        predictions.append(y_hat)
        # add actual observation to history for the next loop
        history.append(test[i])
        print('expected=%.1f, predicted=%.1f' % (test_y, y_hat))
    # estimate prediction error
    error = mean_absolute_error(test[:, -1], predictions)
    return error, test[:, 1], predictions

In [None]:
error, targets, predictions = walk_forward_validation(munged_df, '2020-01-01')
error, targets, predictions

expected=501.0, predicted=530.9
expected=452.0, predicted=529.5
expected=452.0, predicted=475.6
expected=517.0, predicted=472.1
expected=503.0, predicted=516.0
expected=445.0, predicted=511.4
expected=445.0, predicted=450.9
expected=375.0, predicted=449.1
expected=386.0, predicted=375.3
expected=444.0, predicted=376.9
expected=441.0, predicted=451.2
expected=488.0, predicted=446.1
expected=385.0, predicted=501.9
expected=417.0, predicted=398.2
expected=399.0, predicted=416.1
expected=383.0, predicted=379.5
expected=439.0, predicted=388.1
expected=440.0, predicted=439.3
expected=473.0, predicted=455.4
expected=383.0, predicted=458.7
expected=405.0, predicted=398.3
expected=382.0, predicted=405.9
expected=433.0, predicted=324.1
expected=429.0, predicted=394.6
expected=452.0, predicted=435.8
expected=462.0, predicted=476.5
expected=397.0, predicted=483.5
expected=406.0, predicted=397.1
expected=386.0, predicted=437.0
expected=416.0, predicted=391.1
expected=437.0, predicted=436.1
expected

### Complex Approach

Take more iterative approach, using only the 2016 data as the training and 2017 - 2020 is the training set; walking forward, in hopes of generating a better understanding of patterns.

What is an epoch in this approach?