# 06 Train Test Split

## Notebook Overview

This notebook performs a **chronological 80/20 split** of the preprocessed dataset to prepare for time series modeling.

**Key Steps:**

* **Input:** Encoded, hourly-resampled dataset with engineered features
* **Sorting:** Ensures time index is strictly ordered
* **Split:** Reserves the final 20% of data as the forecasting holdout set
* **Output:** Saves `train.csv` and `forecast.csv` for baseline and ML model training

> Purpose: Respect temporal order for causal integrity and prevent lookahead bias in forecasting models.

### Thoughts, Tradeoffs & Considerations

* **No shuffle allowed:** Time series models break if past and future are mixed. Chronological order is strictly preserved—this is **not optional** in forecasting tasks.
* **Static 80/20 ratio:** Chose a fixed 80% train / 20% forecast split to simulate realistic deployment scenarios. Could be adjusted later depending on seasonality span or cross-validation design.
* **Forecasting window:** With hourly data, 20% gives \~2.4 months of holdout—enough to assess robustness across time patterns (e.g., day/night, weekday/weekend, weather shifts).
* **Index integrity check:** Sorting the datetime index before the split was critical. Found minor time discontinuities earlier; now fully handled upstream.
* **No temporal leakage:** Features like lagged values, weather, and time-based encodings must be computed **only from past data** during model training, this split enforces that discipline.
* **Future tweak:** Could later introduce **rolling window** validation or walk-forward retraining, but for now a single static split is sufficient for baseline modeling and prototyping.

> Main concern was **preserving causality and temporal realism**. Splitting randomly would give better metrics—but lie about deploy-time performance.

In [1]:
import pandas as pd

In [2]:
# Show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# widen the column width and overall display width
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 0)

In [3]:
df: pd.DataFrame = pd.read_csv("../data/interim/data_encoded.csv", parse_dates=["time"], index_col="time")

In [4]:
df = df.sort_index()

In [5]:
# Define split point
split_index = int(len(df) * 0.8)

# Split
train_df = df.iloc[:split_index]
forecast_df = df.iloc[split_index:]

In [6]:
train_df.to_csv("../data/interim/train.csv", index=True)
forecast_df.to_csv("../data/interim/forecast.csv", index=True)

print(f"Train shape: {train_df.shape}, Forecast shape: {forecast_df.shape}")

Train shape: (112, 36), Forecast shape: (28, 36)
