# 04. Optimized Forecasting: Multi-Step & Cross-Validation

**Objective**: Address the limitations of the previous "Next Hour" model by implementing a production-grade 24-hour forecasting pipeline using **Recursive Forecasting** and validating it with **Time Series Cross-Validation**.

## 1. Imports & Robust Data Loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
import warnings
warnings.filterwarnings('ignore')

pd.options.plotting.backend = "plotly"

# --- 1. Robust Data Loading Strategy ---
def load_and_clean_data(filepath, city='London'):
    df = pd.read_csv(filepath)
    df['last_updated'] = pd.to_datetime(df['last_updated'])
    df_city = df[df['location_name'] == city].sort_values('last_updated').set_index('last_updated')
    
    # Resample to Hourly
    df_hourly = df_city[['temperature_celsius']].resample('H').mean()
    
    # OPTIMIZATION: Forward Fill with Limit=3 (Avoids long flat lines)
    df_clean = df_hourly.ffill(limit=3).dropna()
    
    print(f"Initial Rows: {len(df_hourly)}, Cleaned Rows: {len(df_clean)}")
    return df_clean

df = load_and_clean_data('../data/raw/GlobalWeatherRepository.csv')

Initial Rows: 14087, Cleaned Rows: 2337


## 2. Advanced Feature Engineering

We add **Interaction Terms** (e.g., how the daily cycle interacts with seasonal trends) and standard lags.

In [2]:
def create_features(data):
    df = data.copy()
    
    # Basic Time Features
    df['hour'] = df.index.hour
    df['month'] = df.index.month
    
    # Cyclical Encoding
    df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
    df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
    
    # Lags (Autiregressive)
    for lag in [1, 2, 3, 24, 48, 168]:
        df[f'lag_{lag}'] = df['temperature_celsius'].shift(lag)
        
    # Rolling Windows (Trend)
    df['rolling_mean_24'] = df['temperature_celsius'].rolling(24).mean()
    df['rolling_std_24'] = df['temperature_celsius'].rolling(24).std()
    
    # OPTIMIZATION: Interaction Features
    # Capture "Morning in Winter" vs "Morning in Summer"
    df['hour_x_month'] = df['hour'] * df['month']
    
    return df.dropna()

df_features = create_features(df)

## 3. Time Series Cross-Validation

Instead of a single random split, we use a Sliding Window approach to simulate real-world retraining over time.

In [3]:
tscv = TimeSeriesSplit(n_splits=5)
features = [c for c in df_features.columns if c != 'temperature_celsius']
target = 'temperature_celsius'

mae_scores = []
fold = 1

for train_index, test_index in tscv.split(df_features):
    train = df_features.iloc[train_index]
    test = df_features.iloc[test_index]
    
    model = XGBRegressor(n_estimators=500, learning_rate=0.05, n_jobs=-1)
    model.fit(train[features], train[target])
    
    preds = model.predict(test[features])
    score = mean_absolute_error(test[target], preds)
    mae_scores.append(score)
    
    print(f"Fold {fold}: MAE = {score:.4f} °C")
    fold += 1

print(f"\nAverage CV MAE: {np.mean(mae_scores):.4f} °C")

Fold 1: MAE = 3.9749 °C
Fold 2: MAE = 2.0508 °C
Fold 3: MAE = 1.1915 °C
Fold 4: MAE = 1.0392 °C
Fold 5: MAE = 1.9095 °C

Average CV MAE: 2.0332 °C


## 4. Multi-Step Forecasting (Recursive Approach)

The core problem with "Next Hour" models is they can't see far ahead. We simulate a 24-hour forecast by feeding predictions back into the model loop.

In [4]:
# Train on full dataset up to the last 24 hours (Hold out last day for demo)
train_full = df_features.iloc[:-24]
test_future = df_features.iloc[-24:].copy()

final_model = XGBRegressor(n_estimators=1000, learning_rate=0.01, n_jobs=-1)
final_model.fit(train_full[features], train_full[target])

# Start with the last known data point from training
current_input = train_full.iloc[-1:].copy()
predictions = []

for i in range(24):
    # 1. Predict Next Step
    pred_temp = final_model.predict(current_input[features])[0]
    predictions.append(pred_temp)
    
    # 2. Update Features for Next Step (Recursive Feedback)
    # Shift Lags: Lag 1 becomes current prediction, Lag 2 becomes old Lag 1, etc.
    next_input = current_input.copy()
    next_input['lag_2'] = next_input['lag_1']
    next_input['lag_1'] = pred_temp  # Feed prediction back!
    
    # Update Time Features
    next_input.index = next_input.index + pd.Timedelta(hours=1)
    next_input['hour'] = next_input.index.hour
    next_input['hour_sin'] = np.sin(2 * np.pi * next_input['hour'] / 24)
    next_input['hour_cos'] = np.cos(2 * np.pi * next_input['hour'] / 24)
    
    # NOTE: Updating Rolling Means recursively is complex, 
    # for simplicity in this demo we assume they stay relatively stable or update simply.
    # In production, we would re-calculate rolling window vector.
    
    current_input = next_input

# Visualize 24h Horizon
comparison = pd.DataFrame({
    'Actual': test_future['temperature_celsius'].values,
    'Forecast_24h_Recursive': predictions
}, index=test_future.index)

fig = px.line(comparison, title="24-Hour Recursive Forecast vs Actuals")
fig.show()