### Advanced Models for ETTh1 Forecasting Benchmark

This example demonstrates advanced forecasting models applied to the ETTh1 (Electricity Transformer Temperature, Hourly) benchmark, a widely adopted standard in the forecasting community. The ETTh1 dataset presents significant challenges with its high-frequency electricity transformer measurements, capturing complex temporal dynamics including daily and seasonal cycles, alongside irregular fluctuations that test state-of-the-art forecasting capabilities.

To ensure computational efficiency in this notebook, we use a subset of 1,000 rows and reduced `forecasting_horizon` and `observation_length` parameters compared to standard benchmark configurations.

### Prerequisites

In [None]:
import sys
import pathlib

import time

import pandas as pd
from tqdm.auto import tqdm

sys.path.append(pathlib.Path().resolve().parent.as_posix())

from inait import predict_test, score_test, plot, load_credentials

base_url, auth_key = load_credentials("../credentials.txt")

### Load the data and split it into train/test 

The ETTh1 dataset (source: Zhou, Haoyi, et al. ["Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"](https://arxiv.org/pdf/2012.07436), 2021) contains hourly electricity transformer measurements across 7 variables, including electricity loads and temperatures. Our goal is to predict all 7 variables simultaneously for the next 12 hours using three progressively sophisticated models: `Inait-basic`, `Inait-advanced`, and `Inait-best`. Note that higher-performing models require longer computation times. 

Model evaluation follows standard machine learning practices: we reserve a portion of the dataset as a test set for performance assessment, while using the remaining data for training. The test set remains unseen during training to provide an unbiased evaluation. We measure performance using Mean Absolute Error (MAE); lower MAE indicates better accuracy.

In [None]:
data_path = "../data/etth1_small.csv"
data = pd.read_csv(data_path, index_col=0)
data = data[sorted(data.columns)]
plot(historical_data=data)

In [None]:
# Configure prediction parameters
target_columns = data.columns.tolist()  # Use all columns as targets

forecasting_horizon = 12  # Predict 12 hours ahead
observation_length = 24  # Use last 24 hours as historical context

test_size = 5  # we will evaluate the model performances on the last 5 steps

models = ["inait-basic", "inait-advanced", "inait-best"]

**Note:** The next cell may take a few minutes to run. 

In [None]:
scores, predictions = {}, {}
for model in tqdm(models, leave=True, postfix=f"Evaluating models {', '.join(models)}"):
    prediction = predict_test(
        base_url=base_url,
        auth_key=auth_key,
        data=data,
        target_columns=target_columns,
        forecasting_horizon=forecasting_horizon,
        observation_length=observation_length,
        model=model,
        test_size=test_size,
    )["predictions"]
    predictions[model] = prediction
    scores[model] = score_test(predictions=prediction, ground_truth=data, metric="mae")
    time.sleep(1)

scores_df = pd.DataFrame.from_dict(scores, orient="index", columns=["MAE"])
scores_df

### Comparison against open-source baseline models

From the Mean Absolute Error of the three models shown above, we can already see that more complex approaches tend to yield better results.

To put our results into perspective, we compare the inait models against traditional forecasting baselines implemented with open-source libraries. We evaluate two common baselines:
- Seasonal Naive model; simply repeats the last observed window.
- Linear regression model; fits a linear relationship to past observations.

While these tools are freely available, implementing them effectively still requires solid forecasting and data science expertise as you’ll notice from the multiple lines of code in the next cell.

**Note:** These notebooks are designed to run in seconds. The delay you're seeing with the scikit-learn model is simply due to the shared MyBinder server we're using for this zero-setup demo. For comparison, our tests on a laptop and in GitHub Codespaces show a runtime of less than 5 seconds.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.base import BaseEstimator, RegressorMixin


class SeasonalNaiveBaseline(BaseEstimator, RegressorMixin):
    """Seasonal Naive baseline that repeats the last forecasting_horizon observed values"""

    def __init__(self, strategy="last"):
        self.strategy = strategy

    def fit(self, X, y):
        # For naive baseline, we don't need to fit anything
        return self

    def predict(self, X):
        if self.strategy == "last":
            # X shape: (n_samples, obs_len * n_features)
            n_samples, n_features_flat = X.shape
            n_features = len(target_columns)  # Assuming target_columns is available
            obs_len = n_features_flat // n_features

            # Reshape X to (n_samples, obs_len, n_features)
            X_reshaped = X.reshape(n_samples, obs_len, n_features)

            # Take the last forecasting_horizon observations for each sample
            last_obs = X_reshaped[
                :, -forecasting_horizon:, :
            ]  # (n_samples, forecasting_horizon, n_features)

            # Flatten to (n_samples, forecasting_horizon * n_features)
            predictions = last_obs.reshape(n_samples, forecasting_horizon * n_features)

            return predictions

        return np.zeros((X.shape[0], X.shape[1]))  # Fallback


def predict_sklearn(
    data,
    target_columns,
    forecasting_horizon,
    observation_length,
    estimator=None,
    train_size=None,
    test_size=None,
):
    """
    Forecast using sklearn estimators with column-wise window standardization
    """
    if estimator == "naive":
        estimator = SeasonalNaiveBaseline()
    else:
        estimator = LinearRegression(tol=0.001)

    if train_size is not None and test_size is not None:
        raise ValueError(
            "Both train_size and test_size cannot be specified at the same time. Please specify only one of them."
        )
    if train_size is not None:
        split_idx = int(len(data) * train_size)
    elif test_size is not None:
        split_idx = len(data) - test_size - forecasting_horizon
    else:
        train_size = 0.8  # Default to 80% training data
        split_idx = int(len(data) * train_size)

    train_data = data.iloc[:split_idx]

    # Create sequences from training data with column-wise standardization
    def create_sequences(data, obs_len, horizon):
        X, y = [], []

        for i in range(len(data) - obs_len - horizon + 1):
            # Get observation window
            window = data.iloc[i : i + obs_len].values

            X.append(window.flatten())
            y.append(data.iloc[i + obs_len : i + obs_len + horizon].values)
        return np.array(X), np.array(y)

    # Train model on training sequences only
    X_train, y_train = create_sequences(
        train_data[target_columns], observation_length, forecasting_horizon
    )
    y_train_flat = y_train.reshape(y_train.shape[0], -1)

    # Fit model
    if isinstance(estimator, SeasonalNaiveBaseline):
        model = estimator
        model.fit(X_train, y_train_flat)
    else:
        model = MultiOutputRegressor(estimator)
        model.fit(X_train, y_train_flat)

    # Generate predictions for test period
    predictions = []
    start_test_idx = split_idx

    for t in tqdm(range(start_test_idx + 1, len(data) - forecasting_horizon + 1)):
        # Get test window
        test_window = data.iloc[t - observation_length : t].values

        X_test = test_window.flatten().reshape(1, -1)

        y_pred_flat = model.predict(X_test)
        y_pred = y_pred_flat.reshape(forecasting_horizon, len(target_columns))

        # Create prediction DataFrame
        pred_df = pd.DataFrame(
            y_pred,
            columns=target_columns,
            index=data.index[t : t + forecasting_horizon],
        )
        predictions.append(pred_df)

    estimator_name = (
        "seasonal_naive"
        if isinstance(estimator, SeasonalNaiveBaseline)
        else type(estimator).__name__
    )
    session_ids = [f"{estimator_name}_session_{i}" for i in range(len(predictions))]

    return predictions, session_ids


# Seasonal Naive Baseline
print("Running sklearn for Seasonal Naive model...")
naive_predictions, naive_sessions = predict_sklearn(
    data=data,
    target_columns=target_columns,
    forecasting_horizon=forecasting_horizon,
    observation_length=observation_length,
    estimator="naive",
    test_size=test_size,
)


# Linear Regression
print("Running sklearn for Linear model...")
linear_predictions, linear_sessions = predict_sklearn(
    data=data,
    target_columns=target_columns,
    forecasting_horizon=forecasting_horizon,
    observation_length=observation_length,
    estimator=LinearRegression(),
    test_size=test_size,
)

# Concatenate predictions and scores into a single DataFrame to match the format of inait predictions
predictions["Seasonal Naive from scratch"] = [
    df.add_suffix("_predicted") for df in naive_predictions
]
predictions["Linear from scratch"] = [
    df.add_suffix("_predicted") for df in linear_predictions
]

scores_df = pd.concat(
    [
        pd.DataFrame(
            score_test(predictions=naive_predictions, ground_truth=data, metric="mae"),
            columns=["MAE"],
            index=["Seasonal Naive from scratch"],
        ),
        pd.DataFrame(
            score_test(predictions=linear_predictions, ground_truth=data, metric="mae"),
            columns=["MAE"],
            index=["Linear from scratch"],
        ),
        scores_df,
    ],
    axis=0,
).round(4)

### Performance comparison visualization

The plot below compares the inait models with open-source baseline implementations. For clarity, we show only the last prediction for each model.

In [None]:
plot(
    historical_data=data.loc[
        : predictions[models[0]][-1].index[-1], :
    ],  # Show all historical data up to the last prediction
    predicted_data={
        key: values[-1] for key, values in predictions.items()
    },  # Get the last test set for each model
)

Let us now look at performances in terms of Mean Absolute Error.

In [None]:
import plotly.express as px

# Create vertical bar plot with green color scale based on values
fig = px.bar(
    scores_df,
    x=scores_df.index,
    y="MAE",
    color="MAE",
    color_continuous_scale="Greens_r",  # Reversed greens (darker for lower values)
    title="Model Performance Comparison",
    labels={"x": "Models", "y": "MAE"},
    text="MAE",  # Add values on bars
)

# Update layout for better readability
fig.update_layout(
    xaxis_title="Models",
    yaxis_title="MAE (lower is better)",
    showlegend=False,
    yaxis=dict(
        range=[
            scores_df["MAE"].min() * 0.9,
            scores_df["MAE"].max() * 1.1,
        ]
    ),
    coloraxis_showscale=False,
)

fig.update_traces(texttemplate="%{text:.4f}", textposition="outside")

fig.show()

### Comments on results

The models successfully capture the ground truth patterns for most variables, demonstrating strong forecasting performance. While some variables show larger prediction errors, it's important to note that this is a simplified simulation optimized for notebook execution time.

Running the full benchmark configuration would yield the comprehensive results shown in the following comparison plot against state-of-the-art pretrained models from leading competitors.


<div align="center">
<img src="../assets/benchmark_etth1_inait.png" alt="Benchmark comparison results" style="width: 60%;">
</div>