The purpose of this notebook is to demonstrate how you can incorporate forecast-ready event features into your demand forecasting model. 

Make sure you have a predefined set of locations with corresponding demand data and event features ready before running this notebook. We will train an XGBoost model to predict restaurant demand using both event and existing features, as well as using existing features alone. You are encouraged to adapt this approach for your own demand forecasting workflow. 

# Background

This notebook uses demand data from multiple restaurant locations across the US, spanning two years from 2017 to 2018. We will train the model using the first 80% of this data and make predictions with the remaining 20%. The performance of the model using both event and existing features versus using only existing features will be compared.

# Steps

* [Setup](#setup)
* [Step 1. Prepare data](#step-1-prepare-data)
* [Step 2. Merge Features and Demand](#step-2-merge-features-and-demand)
* [Step 3. Train Model](#step-3-train-model)

# Setup

Complete the following steps before proceeding:

1. Install `requirements.txt`
2. Update `DATA_DIR` and `OUTPUT_DIR` as necessary
3. Replace `ACCESS_TOKEN` with a valid token (for help creating an access token, see [the API Quickstart](https://docs.predicthq.com/getting-started/api-quickstart))

In [1]:
# install requirements
# %pip install --user -r requirements.txt

In [2]:
import pandas as pd
import numpy as np
import os

import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error
import plotly.graph_objects as go

from predicthq import Client
import beam_api_utils as bau

In [3]:
DATA_DIR = "data"
OUTPUT_DIR = "output"

ACCESS_TOKEN = "REPLACE_WITH_ACCESS_TOKEN"

In [4]:
phq = Client(access_token=ACCESS_TOKEN)

# Step 1. Prepare Data

Prepare the following information:

1. Demand data

    a. One csv file with columns for `location`, `date` and `demand` 

2. Features data

    a. One csv file with columns for `location`, `date` and event features

    b. One csv file with columns for `location`, `date` and existing features (optional)

In [5]:
# read and inspect demand file
demand_df = pd.read_csv(os.path.join(DATA_DIR, "demand.csv"))
demand_df.head()

Unnamed: 0,location,date,demand
0,store_0,2017-01-02,3350.804294
1,store_0,2017-01-03,7974.534129
2,store_0,2017-01-04,7274.021429
3,store_0,2017-01-05,7504.479021
4,store_0,2017-01-06,7091.141396


In [6]:
# read and inspect event features file
event_features_df = pd.read_csv(os.path.join(OUTPUT_DIR, "features.csv"))
event_features_df.head()

Unnamed: 0,location,date,phq_attendance_community,phq_attendance_concerts,phq_attendance_conferences,phq_attendance_expos,phq_attendance_festivals,phq_attendance_performing_arts,phq_attendance_school_holidays,phq_attendance_sports,...,phq_impact_severe_weather_dust_retail,phq_impact_severe_weather_dust_storm_retail,phq_impact_severe_weather_flood_retail,phq_impact_severe_weather_heat_wave_retail,phq_impact_severe_weather_hurricane_retail,phq_impact_severe_weather_thunderstorm_retail,phq_impact_severe_weather_tornado_retail,phq_impact_severe_weather_tropical_storm_retail,phq_rank_academic_exam,phq_rank_academic_holiday
0,store_0,2017-01-02,7104.0,32173.0,0.0,0.0,0.0,13562.0,0.0,19812.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
1,store_0,2017-01-03,7229.0,32403.0,0.0,0.0,0.0,27192.0,0.0,18006.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
2,store_0,2017-01-04,7481.0,780.0,0.0,0.0,0.0,38684.0,0.0,19812.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
3,store_0,2017-01-05,8004.0,3538.0,0.0,0.0,0.0,26021.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
4,store_0,2017-01-06,11340.0,6204.0,0.0,0.0,0.0,29882.0,0.0,19500.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,


In [7]:
# read and inspect existing features file
existing_features_df = pd.read_csv(os.path.join(DATA_DIR, "existing_features.csv"))
existing_features_df.head()

Unnamed: 0,location,date,day_of_week,week_of_year,month_of_year
0,store_0,2017-01-02,0,1,1
1,store_0,2017-01-03,1,1,1
2,store_0,2017-01-04,2,1,1
3,store_0,2017-01-05,3,1,1
4,store_0,2017-01-06,4,1,1


# Step 2. Merge Features and Demand

In [8]:
df = demand_df.merge(event_features_df, on=["location", "date"], how="left")
if existing_features_df is not None:
    df = df.merge(existing_features_df, on=["location", "date"], how="left")

df.head()

Unnamed: 0,location,date,demand,phq_attendance_community,phq_attendance_concerts,phq_attendance_conferences,phq_attendance_expos,phq_attendance_festivals,phq_attendance_performing_arts,phq_attendance_school_holidays,...,phq_impact_severe_weather_heat_wave_retail,phq_impact_severe_weather_hurricane_retail,phq_impact_severe_weather_thunderstorm_retail,phq_impact_severe_weather_tornado_retail,phq_impact_severe_weather_tropical_storm_retail,phq_rank_academic_exam,phq_rank_academic_holiday,day_of_week,week_of_year,month_of_year
0,store_0,2017-01-02,3350.804294,7104.0,32173.0,0.0,0.0,0.0,13562.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,0,1,1
1,store_0,2017-01-03,7974.534129,7229.0,32403.0,0.0,0.0,0.0,27192.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,1,1,1
2,store_0,2017-01-04,7274.021429,7481.0,780.0,0.0,0.0,0.0,38684.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,2,1,1
3,store_0,2017-01-05,7504.479021,8004.0,3538.0,0.0,0.0,0.0,26021.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,3,1,1
4,store_0,2017-01-06,7091.141396,11340.0,6204.0,0.0,0.0,0.0,29882.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,4,1,1


# Step 3. Train Model

In [9]:
def split_data(df, location):
    df["date"] = pd.to_datetime(df["date"])
    loc_df = df[df["location"] == location].sort_values("date")
    split_index = int(len(loc_df) * 0.8)
    train = loc_df[:split_index]
    test = loc_df[split_index:]
    return train, test


def train_model(X_train, y_train):
    model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)
    model.fit(X_train, y_train)
    return model


def plot_results(train, test, y_train, y_test, y_pred, location, feature_set):
    train_trace = go.Scatter(
        x=train["date"],
        y=y_train,
        mode="lines+markers",
        name="Train Actual",
        line=dict(color="lightseagreen"),
    )
    test_trace = go.Scatter(
        x=test["date"],
        y=y_test,
        mode="lines+markers",
        name="Test Actual",
        line=dict(color="LightSkyBlue"),
    )
    predicted_trace = go.Scatter(
        x=test["date"],
        y=y_pred,
        mode="lines+markers",
        name="Test Predicted",
        line=dict(color="lightcoral"),
    )

    fig = go.Figure()
    fig.add_trace(train_trace)
    fig.add_trace(test_trace)
    fig.add_trace(predicted_trace)

    # add vertical line for train/test split
    fig.add_vline(
        x=test["date"].iloc[0], line_width=2, line_dash="dot", line_color="lightgray"
    )

    fig.update_layout(
        title=f"<b>Actual vs. Predicted Demand</b><br><sub>For {location}, feature set: {feature_set}</sub>",
        xaxis_title="Date",
        yaxis_title="Demand",
        legend_title="Type",
    )

    fig.show()


def calculate_metrics(y_test, y_pred):
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return mae, rmse


def calculate_percentage_change(old_metrics, new_metrics):
    mae_change = ((new_metrics["MAE"] - old_metrics["MAE"]) / old_metrics["MAE"]) * 100
    rmse_change = (
        (new_metrics["RMSE"] - old_metrics["RMSE"]) / old_metrics["RMSE"]
    ) * 100
    return {"MAE Percentage Change": mae_change, "RMSE Percentage Change": rmse_change}

In [10]:
results = {}

for location in df["location"].unique():
    results[location] = {}
    train, test = split_data(df, location)

    for use_event_features in [True, False]:
        if use_event_features:
            feature_columns = df.columns.difference(["location", "date", "demand"])
            feature_set = "all"
        else:
            feature_columns = [
                col
                for col in df.columns
                if col not in ["location", "date", "demand"]
                and not col.startswith("phq_")
            ]
            feature_set = "existing only"

        X_train = train[feature_columns]
        y_train = train["demand"]
        X_test = test[feature_columns]
        y_test = test["demand"]

        model = train_model(X_train, y_train)
        y_pred = model.predict(X_test)

        plot_results(train, test, y_train, y_test, y_pred, location, feature_set)
        mae, rmse = calculate_metrics(y_test, y_pred)

        results[location][feature_set] = {"MAE": mae, "RMSE": rmse}

    print(f"Location: {location}")
    for features_set in results[location]:
        print(
            f"--- Feature set: {features_set}, MAE: {results[location][features_set]['MAE']:.2f}, RMSE: {results[location][features_set]['RMSE']:.2f}"
        )
    percentage_changes = calculate_percentage_change(
        results[location]["existing only"], results[location]["all"]
    )
    print(
        f"--- Percentage change: MAE change: {percentage_changes['MAE Percentage Change']:.2f}%, RMSE change: {percentage_changes['RMSE Percentage Change']:.2f}%"
    )
    print()

Location: store_0
--- Feature set: all, MAE: 577.52, RMSE: 778.26
--- Feature set: existing only, MAE: 654.43, RMSE: 1101.90
--- Percentage change: MAE change: -11.75%, RMSE change: -29.37%



Location: store_1
--- Feature set: all, MAE: 910.49, RMSE: 1174.40
--- Feature set: existing only, MAE: 1201.36, RMSE: 1698.64
--- Percentage change: MAE change: -24.21%, RMSE change: -30.86%



Location: store_2
--- Feature set: all, MAE: 259.72, RMSE: 334.15
--- Feature set: existing only, MAE: 275.54, RMSE: 358.31
--- Percentage change: MAE change: -5.74%, RMSE change: -6.74%

