# 06 Train Test Split

## Notebook Overview

This notebook performs a **chronological 80/20 split** of the preprocessed dataset to prepare for time series modeling.

**Key Steps:**

* **Input:** Encoded, hourly-resampled dataset with engineered features
* **Sorting:** Ensures time index is strictly ordered
* **Split:** Reserves the final 20% of data as the forecasting holdout set
* **Enrichment:** Applies feature engineering after the split to ensure no target leakage
* **Output:** Saves `train.csv` and `forecast.csv` for baseline and ML model training

> Purpose: Respect temporal order for causal integrity and prevent lookahead bias in forecasting models.

### Thoughts, Tradeoffs & Considerations

* **No shuffle allowed:** Time series models break if past and future are mixed. Chronological order is strictly preserved—this is **not optional** in forecasting tasks.
* **Static 80/20 ratio:** Chose a fixed 80% train / 20% forecast split to simulate realistic deployment scenarios. Could be adjusted later depending on seasonality span or cross-validation design.
* **Forecasting window:** With hourly data, 20% gives \~2.4 months of holdout—enough to assess robustness across time patterns (e.g., day/night, weekday/weekend, weather shifts).
* **Index integrity check:** Sorting the datetime index before the split was critical. Found minor time discontinuities earlier; now fully handled upstream.
* **No temporal leakage:** Features like lagged values, weather, and time-based encodings must be computed **only from past data** during model training, this split enforces that discipline.
* **Future tweak:** Could later introduce **rolling window** validation or walk-forward retraining, but for now a single static split is sufficient for baseline modeling and prototyping.

> Main concern was **preserving causality and temporal realism**. Splitting randomly would give better metrics—but lie about deploy-time performance.

In [41]:
import pandas as pd
import numpy as np
from typing import List

In [42]:
# Show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# widen the column width and overall display width
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 0)

In [43]:
df: pd.DataFrame = pd.read_csv("../data/interim/data_encoded.csv", parse_dates=["time"], index_col="time")

In [44]:
df = df.sort_index()

In [45]:
# Define split point
split_index = int(len(df) * 0.9)

# Split
train_df = df.iloc[:split_index]
forecast_df = df.iloc[split_index:]

In [46]:
def create_lag_features(df: pd.DataFrame, lags: List[int], roll_windows: List[int]) -> pd.DataFrame:
    df = df.copy()
    for lag in lags:
        df[f"lag_{lag}"] = df["use_house_overall"].shift(lag)
    for win in roll_windows:
        df[f"roll_mean_{win}"] = df["use_house_overall"].shift(1).rolling(window=win).mean()
    return df

In [47]:
train_df = create_lag_features(train_df, lags=[1, 2, 3, 6, 12, 24, 48], roll_windows=[3, 6, 12, 24])
forecast_df = create_lag_features(pd.concat([train_df.tail(12), forecast_df]), lags=[1, 2, 3], roll_windows=[3, 6, 12])
forecast_df = forecast_df.loc[forecast_df.index.difference(train_df.index)]  # only keep new rows   

In [49]:
def enrich_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Appliance sum
    appliance_cols = ["winecellar", "barn", "fridge", "well", "dishwasher", "microwave"]
    df["appliance_sum"] = df[appliance_cols].sum(axis=1)

    # Furnace binary flag
    df["furnace_on"] = (df["furnace"] > 0).astype(int)

    # Net energy consumption
    df["net_energy_lag_1"] = df["lag_1"] - df["generated_solar"].shift(1)

    # Hour block (e.g. 0–3 = 0, 4–7 = 1, ..., 20–23 = 5)
    df["hour_block"] = df["hour"] // 4

    # Weekend flag
    df["is_weekend"] = df[["wd_Saturday", "wd_Sunday"]].sum(axis=1).clip(upper=1)

    # Night flag (e.g. 0–6, 22–23)
    df["is_night"] = df["hour"].isin([0, 1, 2, 3, 4, 5, 6, 22, 23]).astype(int)

    # Winter flag (Dec, Jan, Feb)
    df["is_winter"] = df["month"].isin([12, 1, 2]).astype(int)

    # Day of year (for seasonal cycles)
    df["dayofyear"] = df.index.dayofyear
    df["dayofyear_sin"] = np.sin(2 * np.pi * df["dayofyear"] / 365)
    df["dayofyear_cos"] = np.cos(2 * np.pi * df["dayofyear"] / 365)

    return df

In [None]:
train_df = enrich_features(train_df)
forecast_df = enrich_features(forecast_df)

In [50]:
train_df.to_csv("../data/interim/train.csv", index=True)
forecast_df.to_csv("../data/interim/forecast.csv", index=True)

print(f"Train shape: {train_df.shape}, Forecast shape: {forecast_df.shape}")

Train shape: (7559, 61), Forecast shape: (840, 61)
