# [Random Forest for Time Series Forecasting](https://machinelearningmastery.com/random-forest-for-time-series-forecasting/)

Random Forest can also be used for time series forecasting, although it requires that the time series dataset be transformed into a supervised learning problem first. It also requires the use of a specialized technique for evaluating the model called [walk-forward validation](https://en.wikipedia.org/wiki/Walk_forward_optimization), as evaluating the model using k-fold cross validation would result in optimistically biased results.


In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor

In [2]:
# get data
TEST_CSV = 'test.csv'  # no data snooping!
TRAIN_CSV = 'train.csv'
train_df = pd.read_csv(TRAIN_CSV)
train_df

Unnamed: 0,row_id,date,country,store,product,num_sold
0,0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663
1,1,2017-01-01,Belgium,KaggleMart,Kaggle Getting Started,615
2,2,2017-01-01,Belgium,KaggleMart,Kaggle Recipe Book,480
3,3,2017-01-01,Belgium,KaggleMart,Kaggle for Kids: One Smart Goose,710
4,4,2017-01-01,Belgium,KaggleRama,Kaggle Advanced Techniques,240
...,...,...,...,...,...,...
70123,70123,2020-12-31,Spain,KaggleMart,Kaggle for Kids: One Smart Goose,614
70124,70124,2020-12-31,Spain,KaggleRama,Kaggle Advanced Techniques,215
70125,70125,2020-12-31,Spain,KaggleRama,Kaggle Getting Started,158
70126,70126,2020-12-31,Spain,KaggleRama,Kaggle Recipe Book,135


## [Convert Time Series Data to Supervised Representation](https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/)

We need to reframe the dataset into features from previous day, including previous day target, mapping it to next days target.

This is a little trickier for this dataset since we're not looking at features for a single book and predicting sales for it overtime. This is more complex because we're predicting sales for four different book products, sold across two different stores, and six countries. So to reframe as supervised model, we will need to convert 4 x 2 x 6 different subsets into single reframed dataset for model training across four years. The input for future predictions will need to handle any of the different feature types to predict appropriately.

In [3]:
def series_to_supervised(train_df: pd.DataFrame) -> pd.DataFrame:
    df = train_df.copy()
    df['uuid'] = [
        f'{row["country"]}_{row["store"]}_{row["product"]}'
        for _, row in df.iterrows()
    ]
    df.sort_values(by=['uuid', 'date'], inplace=True)
    mod_df = df.shift(1).rename(columns={
        'date': 'prev_date', 'num_sold': 'prev_num_sold'
    })
    mod_df = pd.concat([mod_df, df.loc[:, ['num_sold', 'date']]], axis=1)
    mod_df = mod_df.loc[mod_df.date != '2017-01-01', :]
    return mod_df

In [4]:
mod_train_df = series_to_supervised(train_df)
mod_train_df

Unnamed: 0,row_id,prev_date,country,store,product,prev_num_sold,uuid,num_sold,date
48,0.0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663.0,Belgium_KaggleMart_Kaggle Advanced Techniques,514,2017-01-02
96,48.0,2017-01-02,Belgium,KaggleMart,Kaggle Advanced Techniques,514.0,Belgium_KaggleMart_Kaggle Advanced Techniques,549,2017-01-03
144,96.0,2017-01-03,Belgium,KaggleMart,Kaggle Advanced Techniques,549.0,Belgium_KaggleMart_Kaggle Advanced Techniques,477,2017-01-04
192,144.0,2017-01-04,Belgium,KaggleMart,Kaggle Advanced Techniques,477.0,Belgium_KaggleMart_Kaggle Advanced Techniques,447,2017-01-05
240,192.0,2017-01-05,Belgium,KaggleMart,Kaggle Advanced Techniques,447.0,Belgium_KaggleMart_Kaggle Advanced Techniques,431,2017-01-06
...,...,...,...,...,...,...,...,...,...
69935,69887.0,2020-12-26,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,187.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,204,2020-12-27
69983,69935.0,2020-12-27,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,204.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,212,2020-12-28
70031,69983.0,2020-12-28,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,212.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,242,2020-12-29
70079,70031.0,2020-12-29,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,242.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,239,2020-12-30


In [14]:
mod_train_df.reset_index().head(1461)

Unnamed: 0,index,row_id,prev_date,country,store,product,prev_num_sold,uuid,num_sold,date
0,48,0.0,2017-01-01,Belgium,KaggleMart,Kaggle Advanced Techniques,663.0,Belgium_KaggleMart_Kaggle Advanced Techniques,514,2017-01-02
1,96,48.0,2017-01-02,Belgium,KaggleMart,Kaggle Advanced Techniques,514.0,Belgium_KaggleMart_Kaggle Advanced Techniques,549,2017-01-03
2,144,96.0,2017-01-03,Belgium,KaggleMart,Kaggle Advanced Techniques,549.0,Belgium_KaggleMart_Kaggle Advanced Techniques,477,2017-01-04
3,192,144.0,2017-01-04,Belgium,KaggleMart,Kaggle Advanced Techniques,477.0,Belgium_KaggleMart_Kaggle Advanced Techniques,447,2017-01-05
4,240,192.0,2017-01-05,Belgium,KaggleMart,Kaggle Advanced Techniques,447.0,Belgium_KaggleMart_Kaggle Advanced Techniques,431,2017-01-06
...,...,...,...,...,...,...,...,...,...,...
1456,69936,69888.0,2020-12-27,Belgium,KaggleMart,Kaggle Advanced Techniques,574.0,Belgium_KaggleMart_Kaggle Advanced Techniques,625,2020-12-28
1457,69984,69936.0,2020-12-28,Belgium,KaggleMart,Kaggle Advanced Techniques,625.0,Belgium_KaggleMart_Kaggle Advanced Techniques,597,2020-12-29
1458,70032,69984.0,2020-12-29,Belgium,KaggleMart,Kaggle Advanced Techniques,597.0,Belgium_KaggleMart_Kaggle Advanced Techniques,632,2020-12-30
1459,70080,70032.0,2020-12-30,Belgium,KaggleMart,Kaggle Advanced Techniques,632.0,Belgium_KaggleMart_Kaggle Advanced Techniques,616,2020-12-31


In [15]:
mod_train_df.reset_index().tail(1461)

Unnamed: 0,index,row_id,prev_date,country,store,product,prev_num_sold,uuid,num_sold,date
68619,70126,70078.0,2020-12-30,Spain,KaggleRama,Kaggle Recipe Book,153.0,Spain_KaggleRama_Kaggle Recipe Book,135,2020-12-31
68620,95,47.0,2017-01-01,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,181.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,123,2017-01-02
68621,143,95.0,2017-01-02,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,123.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,125,2017-01-03
68622,191,143.0,2017-01-03,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,125.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,110,2017-01-04
68623,239,191.0,2017-01-04,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,110.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,104,2017-01-05
...,...,...,...,...,...,...,...,...,...,...
70075,69935,69887.0,2020-12-26,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,187.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,204,2020-12-27
70076,69983,69935.0,2020-12-27,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,204.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,212,2020-12-28
70077,70031,69983.0,2020-12-28,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,212.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,242,2020-12-29
70078,70079,70031.0,2020-12-29,Spain,KaggleRama,Kaggle for Kids: One Smart Goose,242.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,239,2020-12-30


## One-hot Encode Categorical Cols

Need to one-hot encode `country`, `store`, and `product` columns for used in modeling.

In [16]:
ohe_train_df = pd.get_dummies(
    data=mod_train_df, prefix=['country', 'store', 'product'],
    columns=['country', 'store', 'product']
)
ohe_train_df

Unnamed: 0,row_id,prev_date,prev_num_sold,uuid,num_sold,date,country_Belgium,country_France,country_Germany,country_Italy,country_Poland,country_Spain,store_KaggleMart,store_KaggleRama,product_Kaggle Advanced Techniques,product_Kaggle Getting Started,product_Kaggle Recipe Book,product_Kaggle for Kids: One Smart Goose
48,0.0,2017-01-01,663.0,Belgium_KaggleMart_Kaggle Advanced Techniques,514,2017-01-02,1,0,0,0,0,0,1,0,1,0,0,0
96,48.0,2017-01-02,514.0,Belgium_KaggleMart_Kaggle Advanced Techniques,549,2017-01-03,1,0,0,0,0,0,1,0,1,0,0,0
144,96.0,2017-01-03,549.0,Belgium_KaggleMart_Kaggle Advanced Techniques,477,2017-01-04,1,0,0,0,0,0,1,0,1,0,0,0
192,144.0,2017-01-04,477.0,Belgium_KaggleMart_Kaggle Advanced Techniques,447,2017-01-05,1,0,0,0,0,0,1,0,1,0,0,0
240,192.0,2017-01-05,447.0,Belgium_KaggleMart_Kaggle Advanced Techniques,431,2017-01-06,1,0,0,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69935,69887.0,2020-12-26,187.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,204,2020-12-27,0,0,0,0,0,1,0,1,0,0,0,1
69983,69935.0,2020-12-27,204.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,212,2020-12-28,0,0,0,0,0,1,0,1,0,0,0,1
70031,69983.0,2020-12-28,212.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,242,2020-12-29,0,0,0,0,0,1,0,1,0,0,0,1
70079,70031.0,2020-12-29,242.0,Spain_KaggleRama_Kaggle for Kids: One Smart Goose,239,2020-12-30,0,0,0,0,0,1,0,1,0,0,0,1


## Reframe Dataset

Shift previous days features into next days num_sold column so can use walk-forward validation>

In [10]:
mod_train_df.shift(1).reset_index()

Unnamed: 0,index,row_id,date,num_sold,country_Belgium,country_France,country_Germany,country_Italy,country_Poland,country_Spain,store_KaggleMart,store_KaggleRama,product_Kaggle Advanced Techniques,product_Kaggle Getting Started,product_Kaggle Recipe Book,product_Kaggle for Kids: One Smart Goose
0,0,,,,,,,,,,,,,,,
1,1,0.0,2017-01-01,663.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,2,1.0,2017-01-01,615.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,3,2.0,2017-01-01,480.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,4,3.0,2017-01-01,710.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70123,70123,70122.0,2020-12-31,384.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
70124,70124,70123.0,2020-12-31,614.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
70125,70125,70124.0,2020-12-31,215.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
70126,70126,70125.0,2020-12-31,158.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
