# Q2 – Which DGP Processes Are Best Captured by Each Model?

This notebook investigates which data-generating processes (DGPs) are best approximated by each model's forecast distribution, using KL divergence as the evaluation metric.

### Objective
- Identify the DGPs that each model captures most accurately.
- Compare model fit across processes and context lengths.

### Key Outputs
- 📄 Tables:
  - KL divergence per DGP and day, sorted by model and average KL
  - Separate tables for prices and returns

- 📊 Plots:
  - KL divergence across context lengths

### Notes
- Models listed in `selected_model_names` only
- KL divergence is computed between model forecasts and ground-truth DGP samples
- For prices, returns are first computed before evaluating KL
- Forecast horizons: `Day 2`, `Day 12`, `Day 22`
- Results are sorted within each model by average KL (per DGP)

In [1]:
# Packages
import pickle
import numpy as np
import pandas as pd
from pathlib import Path

from utils.evaluation import (
    compute_kl_divergence,
    format_pivot_table,
    dataframe_to_latex
)

from utils.plotting import plot_kl_vs_context

# Needed to avoid issues with numpy for TimesFM 2.5
import sys, numpy.core.numeric as numeric
sys.modules['numpy._core.numeric'] = numeric

### Models List

Models can be added or removed from the followng list.

In [2]:
# Selected Models for Analysis
selected_model_names = [
    "chronos_model_tiny",
    "chronos_model_mini",
    "chronos_model_base",
    "lag_llama_model",
    "moirai_model_small",
    "moirai_model_base",
    "moirai_model_small_2_0",   # NEW
    "moirai_model_small_1_1",   # NEW
    "moirai_model_base_1_1",    # NEW
    "toto_model",
    "tirex_model",
    "timesfm_model_small",
    "timesfm_model_large",
    "timesfm_model_2_5"        # NEW
]

In [3]:
# Paths and Setup
results_dir = Path("results_q2_processes")
tables_dir = results_dir / "tables"
plots_dir = {
    name: results_dir / name for name in ["plots_context"]
}

for folder in [tables_dir, *plots_dir.values()]:
    folder.mkdir(parents=True, exist_ok=True)

forecast_dir = Path("forecasts")
run_dir = Path("runfiles")
datasets_dir = Path("datasets")

selected_days = [0, 10, 20]
context_lengths = [22, 66, 252]

dgp_types_kl = ["gbm_low_vol", "gbm_high_vol", "garch", "t_garch", "mixture_normal", "seasonal"]

### Loading the Forecasts

We load the forecasts and retrieve the specifics.

In [4]:
# Load Forecasts
forecast_files = sorted(forecast_dir.glob("forecast_*.pkl"))
results = []

for forecast_file in forecast_files:
    run_name = forecast_file.stem
    run_file = run_dir / f"{run_name}.txt"
    if not run_file.exists():
        continue

    run_config = {}
    with open(run_file, "r") as f:
        for line in f:
            if "=" in line:
                key, value = [x.strip() for x in line.strip().split("=", 1)]
                try:
                    run_config[key] = eval(value)
                except:
                    run_config[key] = value.strip("\"'").strip("'")

    try:
        with open(forecast_file, "rb") as f:
            forecast_result = pickle.load(f)
            low, median, high, samples, base_price = forecast_result
    except Exception:
        continue

    results.append({
        "run_name": run_name,
        "model_name": run_config["model_name"],
        "dgp_type": run_config["dataset_name"],
        "target_type": run_config["target_type"],
        "context_length": run_config["context_length"],
        "samples": samples,
        "low": low,
        "median": median,
        "high": high,
        "base_price": base_price
    })

# Filter Results by Selected Models
results = [r for r in results if r["model_name"] in selected_model_names]

price_results = [r for r in results if r["target_type"] == "prices"]
return_results = [r for r in results if r["target_type"] == "returns"]

In [5]:
print("Unique model names found in runfiles:")
for name in sorted(set(r["model_name"] for r in results)):
    print(f"'{name}'")

Unique model names found in runfiles:
'chronos_model_base'
'chronos_model_mini'
'chronos_model_tiny'
'lag_llama_model'
'moirai_model_base'
'moirai_model_base_1_1'
'moirai_model_small'
'moirai_model_small_1_1'
'moirai_model_small_2_0'
'timesfm_model_2_5'
'timesfm_model_large'
'timesfm_model_small'
'tirex_model'
'toto_model'


### Defining Functions

We define 2 new special functions to save tables and compute the KL divergence compatible with this notebook setup.

In [6]:
# Compute KL Divergence and Store for Sorting
def compute_kl_with_average(results_subset):
    df_rows = []

    for item in results_subset:
        if item["dgp_type"] not in dgp_types_kl:
            continue

        is_price = item["target_type"] == "prices"
        model_returns = item["samples"]
        if is_price:
            model_returns = model_returns[:, 1:] / model_returns[:, :-1] - 1

        dgp_path = datasets_dir / f"{item['dgp_type']}_returns_paths.npy"
        if not dgp_path.exists():
            continue

        dgp_returns = np.load(dgp_path)

        kl_values = {}
        for day_index in selected_days:
            try:
                p = dgp_returns[:, day_index]
                q = model_returns[:, day_index]
                kl = compute_kl_divergence(p, q)
                df_rows.append({
                    "context_length": item["context_length"],
                    "dgp_type": item["dgp_type"],
                    "model_name": item["model_name"],
                    "day": f"Day {day_index + 2}",
                    "kl_divergence": kl
                })
                kl_values[day_index] = kl
            except:
                continue

    df_kl = pd.DataFrame(df_rows).round(4)

    # Compute average KL across selected days
    df_avg = df_kl.groupby(["context_length", "dgp_type", "model_name"])["kl_divergence"].mean().reset_index()
    df_avg.rename(columns={"kl_divergence": "avg_kl"}, inplace=True)

    return df_kl, df_avg

In [7]:
# Generate Model-wise Sorted KL Tables
def save_sorted_kl_tables_by_model(df_kl, df_avg_kl, label):
    for context in context_lengths:
        df_filtered = df_kl[df_kl["context_length"] == context]
        avg_filtered = df_avg_kl[df_avg_kl["context_length"] == context]

        # KL pivot table: (context, dgp, model) × day
        pivot = df_filtered.pivot_table(
            index=["context_length", "dgp_type", "model_name"],
            columns="day",
            values="kl_divergence"
        )

        # Build fully sorted index per model
        sorted_blocks = []

        for model_name in avg_filtered["model_name"].unique():
            df_model_avg = avg_filtered[avg_filtered["model_name"] == model_name]
            df_model_avg = df_model_avg.sort_values("avg_kl")

            sorted_index = [(context, row["dgp_type"], model_name) for _, row in df_model_avg.iterrows()]
            index_in_pivot = [idx for idx in sorted_index if idx in pivot.index]

            if index_in_pivot:  # Only if there's data
                group_sorted = pivot.loc[index_in_pivot]
                sorted_blocks.append(group_sorted)

        if not sorted_blocks:
            print(f"[SKIP] No data to save for context length {context} and label {label}")
            continue  # Skip this context safely

        pivot_sorted = pd.concat(sorted_blocks)

        # Format & Save
        formatted = format_pivot_table(pivot_sorted, selected_days)
        filename = f"q2_kl_sorted_{label}_context{context}.tex"
        dataframe_to_latex(formatted, tables_dir / filename, preserve_index_order=True)


In [8]:
# Compute and Save KL Tables (by model, correctly sorted)
df_kl_prices, df_avg_prices = compute_kl_with_average(price_results)
df_kl_returns, df_avg_returns = compute_kl_with_average(return_results)

save_sorted_kl_tables_by_model(df_kl_prices, df_avg_prices, "prices")
save_sorted_kl_tables_by_model(df_kl_returns, df_avg_returns, "returns")

### Plotting

Only the specific figure is here plotted.

In [9]:
# Context & Bar Plots
plot_kl_vs_context(df_kl_prices, plots_dir["plots_context"], "prices")
plot_kl_vs_context(df_kl_returns, plots_dir["plots_context"], "returns")

### Interpretation: Which DGPs Are Best Captured by the Models?

We analyze KL divergence at context length 22 to assess which data-generating processes (DGPs) are most effectively captured by the models. Lower KL indicates better alignment between forecasted and true distributions. The analysis includes both price and return forecasts across selected DGPs.

**Price Forecasts**

- gbm_high_vol is the most reliably captured process. Most models, including Chronos variants, Toto, Tirex, and Lag-Llama, achieve very low KL divergence—often below 0.1. This indicates that even relatively simple architectures can model this DGP well, even with short historical context.

- seasonal and mixture_normal are captured with moderate success. Chronos models handle these processes reasonably well, while models like Moirai small show higher divergence, especially at intermediate horizons. Tirex and Toto offer strong and stable performance across all forecast days.

- t_garch is the most difficult process for price-level modeling. KL values are consistently high, regardless of architecture. Chronos performs particularly poorly, while Toto and Tirex manage to reduce divergence slightly at shorter horizons, though no model excels on this DGP.

- gbm_low_vol displays inconsistent results. Some models, such as Chronos mini and base, show sharp KL spikes at later forecast days, whereas others like Tirex and Toto remain more stable. This suggests that certain architectures may overfit or misrepresent low-volatility dynamics when predicting prices.

**Return Forecasts**

- Lag-Llama is the top performer. It achieves near-zero KL values across all DGPs except for a modest increase on t_garch. Its ability to capture both simple (gbm_low_vol, gbm_high_vol) and complex (mixture_normal, seasonal) return processes from short contexts is unmatched.

- Toto performs strongly across the board. It handles mixture_normal, seasonal, and gbm_low_vol with KL values consistently below 0.1, showing excellent generalization even at short context length.

- Tirex offers solid results across nearly all DGPs. While it struggles mildly with t_garch, it maintains low and stable KL divergence elsewhere, reflecting strong return modeling capabilities.

- Moirai (base and small) achieves reasonable alignment on simpler DGPs like gbm_high_vol and seasonal. Although it does not reach the same performance level as Lag-Llama or Toto, it improves steadily across forecast days and maintains moderate KL values overall.

- Chronos (base, mini, tiny) falls short in return forecasting. KL divergence is high across all DGPs, particularly for gbm_high_vol, mixture_normal, and t_garch. This suggests that the Chronos architecture may not effectively capture return dynamics with limited historical input.

**Conclusion**

At context length 22, gbm_high_vol emerges as the most learnable DGP, showing low KL across nearly all models. In contrast, t_garch proves most challenging, especially for models like Chronos and Moirai. Lag-Llama, Toto, and Tirex stand out as robust performers, consistently delivering accurate forecasts across both simple and complex DGPs in the return and price spaces.
