# Simulating Data-Generating Processes (DGPs)

This notebook defines and simulates the data-generating processes (DGPs) used throughout the experiments. Each DGP produces both a **single historical trajectory** and **multiple future trajectories**. These outputs form the foundation for evaluating model forecasts through KL divergence comparisons.

For each DGP:
- We simulate a **price series** of 1000 trading days, serving as historical context.
- We compute and save the **daily returns** for that path.
- From the final price, we generate **1000 sample paths of 22 forecast days** to represent the ground-truth distribution.
- Return paths for forecasted prices are also computed and stored.

All outputs are saved locally for reproducibility and reusability across notebooks.


In [None]:
# Packages and local modules
import numpy as np
from pathlib import Path
from utils.simulations import *
import inspect

### Global Simulation Parameters

We fix the following parameters across all DGPs to ensure consistency:

- `seed = 42`: ensures reproducibility of all stochastic outputs.
- `trading_days = 1000`: number of daily steps in the historical time series.
- `forecast_days = 22`: length of each forecast horizon.
- `n_samples = 1000`: number of Monte Carlo paths generated for each DGP at forecast time.
- `initial_price = 100.0`: starting point for every historical path.

All files are saved under the `datasets/` directory. Each DGP results in four saved artifacts:
- Historical price series
- Historical return series
- Forecast price paths
- Forecast return paths


In [None]:
# Global simulation settings
seed = 42
trading_days = 1000
forecast_days = 22
n_samples = 1000
initial_price = 100.0

# Output folder
output_folder = Path("datasets")
output_folder.mkdir(exist_ok=True)

### Volatility Structure of the DGPs

The DGPs are designed to span a wide range of statistical behavior, from deterministic trends to highly volatile stochastic processes. They are grouped and ordered by increasing volatility.

- **constant**: A flat, deterministic process. No randomness, no drift.
- **linear**: A deterministic trend with constant daily return. Still no noise.
- **gbm_low_vol**: A geometric Brownian motion with very low volatility (~8% annualized). Captures mild stochasticity.
- **mixture_normal**: A random walk where 90% of steps follow a low-volatility normal distribution, and 10% follow a higher-volatility regime. Designed to simulate rare jumps.
- **seasonal**: A combination of sinusoidal trend and medium white noise. Models structured periodicity with moderate uncertainty.
- **t_garch**: A conditional volatility process with heavy tails and volatility clustering. Volatility is time-varying and responds to recent shocks.
- **gbm_high_vol**: A high-volatility GBM (~80% annualized). Purely random, unstructured, and highly uncertain.

This volatility hierarchy is central to later analysis, where we examine how forecasting models behave under varying degrees of noise and structure.


In [3]:
dgp_list = [

    # ------------------------------------------------------------------------------
    # 1. Deterministic DGPs (no volatility)
    # ------------------------------------------------------------------------------

    {
        "name": "constant",
        "type": "constant",
        "params": {},
        "generate_paths": False
    },

    {
        "name": "linear",
        "type": "linear",
        "params": {
            "daily_return": 0.0005
        },
        "generate_paths": False
    },

    # ------------------------------------------------------------------------------
    # 2. Stochastic DGPs ordered by volatility
    # ------------------------------------------------------------------------------

    {
        "name": "gbm_low_vol",
        "type": "gbm",
        "params": {
            "drift": 0.0,
            "volatility": 0.005   # ★ Very Low Volatility (≈8% annualized)
        },
        "forecast_params": {
            "drift": 0.0,
            "volatility": 0.005
        }
    },

    {
        "name": "mixture_normal",
        "type": "mixture_normal",
        "params": {
            "means": [0.0, -0.002],
            "std_devs": [0.007, 0.015],  # ★ Low Volatility
            "weights": [0.9, 0.1]
        },
        "forecast_params": {
            "means": [0.0, -0.002],
            "std_devs": [0.007, 0.015],
            "weights": [0.9, 0.1]
        }
    },

    {
        "name": "seasonal",
        "type": "seasonal",
        "params": {
            "amplitude": 0.01,
            "frequency": 0.02,          # 1/50 days
            "trend": 0.00005,
            "noise_std": 0.018          # ★ Medium Volatility
        },
        "forecast_params": {
            "amplitude": 0.01,
            "frequency": 0.02,
            "trend": 0.00005,
            "noise_std": 0.018
        }
    },

    {
        "name": "t_garch",
        "type": "t_garch",
        "params": {
            "omega": 0.00001,
            "alpha": 0.15,
            "beta": 0.8,
            "volatility_start": 0.03,   # ★ High Volatility
            "degrees_freedom": 3
        },
        "forecast_params": {
            "omega": 0.00001,
            "alpha": 0.15,
            "beta": 0.8,
            "degrees_freedom": 3
        }
    },

    {
        "name": "gbm_high_vol",
        "type": "gbm",
        "params": {
            "drift": 0.0,
            "volatility": 0.05   # ★ Very High Volatility (≈80% annualized)
        },
        "forecast_params": {
            "drift": 0.0,
            "volatility": 0.05
        }
    }

]

In [4]:
# Run simulations and save outputs
for dgp in dgp_list:
    dgp_name = dgp["name"]
    dgp_type = dgp["type"]
    dgp_params = dgp.get("params", {})
    forecast_params = dgp.get("forecast_params", {})
    generate_paths = dgp.get("generate_paths", True)

    print(f"\n[DGP] Generating: {dgp_name} (type = {dgp_type})")

    # Simulate price series
    simulate_func = globals()[f"simulate_{dgp_type}_prices"]
    simulate_signature = inspect.signature(simulate_func)

    if "seed" in simulate_signature.parameters:
        price_series = simulate_func(trading_days, initial_price, seed=seed, **dgp_params)
    else:
        price_series = simulate_func(trading_days, initial_price, **dgp_params)

    # Save price series
    price_file = output_folder / f"{dgp_name}_prices.csv"
    price_series.to_csv(price_file, index=False, float_format="%.8f")
    print(f"[SAVED] {price_file}")

    # Save returns
    return_series = price_series.pct_change().dropna().reset_index(drop=True)
    return_file = output_folder / f"{dgp_name}_returns.csv"
    return_series.to_csv(return_file, index=False, float_format="%.8f")
    print(f"[SAVED] {return_file}")

    if not generate_paths:
        continue

    # Simulate paths from last price
    last_price = price_series.iloc[-1]
    forecast_func = globals()[f"forecast_{dgp_type}_paths"]

    # Special GARCH case
    if dgp_type == "t_garch":
        last_return = return_series.iloc[-1]
        last_volatility = dgp_params.get("volatility_start", 0.01)
        forecast_params["last_return"] = last_return
        forecast_params["last_volatility"] = last_volatility

    price_paths = forecast_func(
        last_price,
        forecast_days,
        n_samples,
        seed=seed,
        **forecast_params
    )

    price_paths_file = output_folder / f"{dgp_name}_paths.npy"
    np.save(price_paths_file, price_paths.astype(np.float64))
    print(f"[SAVED] {price_paths_file}")

    return_paths = (price_paths[:, 1:] / price_paths[:, :-1]) - 1
    return_paths_file = output_folder / f"{dgp_name}_returns_paths.npy"
    np.save(return_paths_file, return_paths.astype(np.float64))
    print(f"[SAVED] {return_paths_file}")


[DGP] Generating: constant (type = constant)
[SAVED] datasets/constant_prices.csv
[SAVED] datasets/constant_returns.csv

[DGP] Generating: linear (type = linear)
[SAVED] datasets/linear_prices.csv
[SAVED] datasets/linear_returns.csv

[DGP] Generating: gbm_low_vol (type = gbm)
[SAVED] datasets/gbm_low_vol_prices.csv
[SAVED] datasets/gbm_low_vol_returns.csv
[SAVED] datasets/gbm_low_vol_paths.npy
[SAVED] datasets/gbm_low_vol_returns_paths.npy

[DGP] Generating: mixture_normal (type = mixture_normal)
[SAVED] datasets/mixture_normal_prices.csv
[SAVED] datasets/mixture_normal_returns.csv
[SAVED] datasets/mixture_normal_paths.npy
[SAVED] datasets/mixture_normal_returns_paths.npy

[DGP] Generating: seasonal (type = seasonal)
[SAVED] datasets/seasonal_prices.csv
[SAVED] datasets/seasonal_returns.csv
[SAVED] datasets/seasonal_paths.npy
[SAVED] datasets/seasonal_returns_paths.npy

[DGP] Generating: t_garch (type = t_garch)
[SAVED] datasets/t_garch_prices.csv
[SAVED] datasets/t_garch_returns.csv
