## Data Loader Playground

The goal of this file is to play around with the data loader and its configurations.

The ideal loader will create a simulated stream that mimics real-world conditions. This includes inducing controlled nonstationarity in the simulated stream. The nonstationarity tests our learning algorithm on its ability to adapt to varying degrees of concept drift.

In [3]:
from data_loader import get_synthetic_linear_stream

data_stream = get_synthetic_linear_stream()

# Synthetic Linear Stream: Concept Drift Playground

This notebook simulates and visualizes streams generated by `get_synthetic_linear_stream()` under different concept drift regimes. It also documents the parameters and how they shape the iterator and diagnostics.

## Function: `get_synthetic_linear_stream()`
Produces an infinite iterator that yields event records with the unified schema:
- `x`: feature vector (`np.ndarray`)
- `y`: target (`float`)
- `sample_id`: stable id
- `event_id`: increasing int
- `segment_id`: int (0 by default)
- `metrics`: dictionary with diagnostics (e.g., `x_norm`, `delta_P`, `w_star_norm`, `noise`, `lambda_est`, `P_T_true`)

### Parameters and Effects
- `dim`: Feature dimensionality. Larger `dim` changes geometry; affects norms and regression target variability.
- `seed`: RNG seed for reproducibility. Impacts parameter path init, covariance, and noise draws.
- `noise_std`: Standard deviation of additive Gaussian noise in `y`. Larger values increase target variance and reduce signal-to-noise ratio.
- `use_event_schema`: When `True` (recommended), yields event dicts with diagnostics; when `False`, returns a legacy `(x, y)` stream without extra metrics.

Covariance and feature scaling:
- `eigs`: Explicit eigenvalues of the Gaussian feature covariance Σ. Controls anisotropy; larger spread concentrates variance along top PCs.
- `cond_number`: If `eigs` is not provided, generates geometrically spaced eigenvalues with condition number `cond_number` (Σ spectrum from 1 to 1/cond_number).
- `rand_orth_seed`: Seed for the random orthogonal matrix defining eigenvectors of Σ.
- `feature_scale`: Post-covariance scalar applied to all features. Scales norms and effective step sizes for learners.

Ground-truth parameter path (concept drift):
- `path_type`: One of `"static"`, `"rotating"`, or `"drift"`.
  - `static`: Fixed parameter; no drift (`delta_P = 0`, `P_T_true` flat).
  - `rotating`: Applies a small rotation per step in a 2D subspace; smooth cyclic drift.
  - `drift`: Adds small random drift per step; stochastic wandering.
- `path_control`: Enable/disable controlled path machinery. If `False`, parameters are resampled i.i.d.; drift metrics become trivial.
- `rotate_angle`: Radians per step for `rotating`. Larger angle increases instantaneous drift (`delta_P`) and cumulative path length (`P_T_true`).
- `drift_rate`: Standard deviation of per-step drift for `drift`. Larger values yield more volatile and faster-growing `P_T_true`.
- `w_scale`: If provided and `fix_w_norm=True`, initializes and maintains ‖w*‖ at this value. Useful to isolate orientation drift from norm changes.
- `fix_w_norm`: When `True`, renormalizes the drifting parameter to `w_scale` (or initial norm) each step; when `False`, the norm may grow/shrink.

Diagnostics and estimation:
- `strong_convexity_estimation`: When `True`, includes an online estimate `lambda_est` via a secant-style update using a simple learner surrogate. Expect noisy early estimates that stabilize with time.

### Key Metrics in `metrics`
- `delta_P`: Per-step path increment ‖w*_t − w*_{t−1}‖ (0 if static).
- `P_T_true`: Cumulative path length up to time t.
- `x_norm`: ‖x‖; influenced by Σ spectrum and `feature_scale`.
- `w_star_norm`: ‖w*‖; constant if `fix_w_norm=True`.
- `noise`: The scalar noise draw added to `x·w*` to produce `y`.
- `lambda_est`: Online strong convexity estimate (if enabled).

Below, we run controlled experiments to visualize how these parameters affect the stream.

In [None]:
# Setup and Imports
import sys, os
from pathlib import Path

# Ensure repo root on sys.path so our 'code' package resolves before stdlib 'code'
def add_repo_root_to_path():
    candidates = [
        Path.cwd(),
        Path.cwd().parent,
        *list(Path.cwd().parents)
    ]
    for cand in candidates:
        if (cand / "code" / "__init__.py").exists():
            sys.path.insert(0, str(cand))
            return cand
    return Path.cwd()

proj_root = add_repo_root_to_path()
print("Using project root:", proj_root)

# Ensure plotting/data packages are available
def ensure_packages(pkgs):
    import importlib
    missing = []
    for name, pip_name in pkgs:
        try:
            importlib.import_module(name)
        except ImportError:
            missing.append(pip_name)
    if missing:
        import subprocess
        python = sys.executable
        print("Installing missing packages:", missing)
        subprocess.check_call([python, "-m", "pip", "install", *missing])

ensure_packages([
    ("pandas", "pandas"),
    ("matplotlib", "matplotlib"),
    ("seaborn", "seaborn"),
])

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Inline plots
%matplotlib inline
sns.set_context("talk")
sns.set_style("whitegrid")

# Import the synthetic stream generator
from code.data_loader.linear import get_synthetic_linear_stream
from code.data_loader.event_schema import parse_event_record

print("Imports OK.")

In [None]:
# Utilities: sampling and plotting
from typing import Iterator, Dict, Any


def take_events(stream: Iterator[Dict[str, Any]], n: int):
    """Collect n event records from the stream."""
    out = []
    for _ in range(n):
        out.append(next(stream))
    return out


def events_to_frame(events):
    """Convert event list to a tidy DataFrame of diagnostics."""
    rows = []
    for e in events:
        m = e.get("metrics", {})
        rows.append({
            "event_id": e.get("event_id"),
            "x_norm": m.get("x_norm"),
            "w_star_norm": m.get("w_star_norm"),
            "delta_P": m.get("delta_P"),
            "P_T_true": m.get("P_T_true"),
            "noise": m.get("noise"),
            "lambda_est": m.get("lambda_est"),
            "y": e.get("y"),
        })
    import pandas as pd
    return pd.DataFrame(rows)


def plot_drift_summary(df, title: str):
    fig, axes = plt.subplots(2, 2, figsize=(12, 8), constrained_layout=True)
    ax = axes[0,0]
    sns.lineplot(df, x="event_id", y="x_norm", ax=ax)
    ax.set_title("Feature Norm ‖x‖")

    ax = axes[0,1]
    sns.lineplot(df, x="event_id", y="w_star_norm", ax=ax)
    ax.set_title("Parameter Norm ‖w*‖")

    ax = axes[1,0]
    sns.lineplot(df, x="event_id", y="delta_P", ax=ax)
    ax.set_title("Per-step Drift ΔP")

    ax = axes[1,1]
    sns.lineplot(df, x="event_id", y="P_T_true", ax=ax)
    ax.set_title("Cumulative Path Length P_T")

    fig.suptitle(title)
    return fig, axes

In [None]:
# Demo 1: Rotating parameter path
N = 2000
stream = get_synthetic_linear_stream(
    dim=20,
    seed=7,
    noise_std=0.1,
    use_event_schema=True,
    cond_number=50.0,           # moderate anisotropy
    feature_scale=1.0,
    path_type="rotating",
    path_control=True,
    rotate_angle=0.01,          # radians per step
    w_scale=2.0,
    fix_w_norm=True,
    strong_convexity_estimation=True,
)

rot_events = take_events(stream, N)
rot_df = events_to_frame(rot_events)
fig, axes = plot_drift_summary(rot_df, title="Rotating Drift: ΔP and Path Length")
plt.show()

rot_df.head()

In [None]:
# Demo 2: Stochastic linear drift in parameters
N = 2000
stream = get_synthetic_linear_stream(
    dim=20,
    seed=7,
    noise_std=0.1,
    use_event_schema=True,
    cond_number=50.0,
    feature_scale=1.0,
    path_type="drift",
    path_control=True,
    drift_rate=0.003,           # per-step drift scale
    w_scale=2.0,
    fix_w_norm=True,
    strong_convexity_estimation=True,
)

drift_events = take_events(stream, N)
drift_df = events_to_frame(drift_events)
fig, axes = plot_drift_summary(drift_df, title="Random Drift: ΔP and Path Length")
plt.show()

drift_df.head()

## Explore Further
- Increase `rotate_angle` (e.g., 0.03) to amplify per-step drift in the rotating path; `P_T_true` should grow near-linearly with a higher slope.
- Increase `drift_rate` (e.g., 0.01) to make `delta_P` noisier and `P_T_true` grow faster in the random drift case.
- Toggle `fix_w_norm` to see how changing ‖w*‖ interacts with targets and diagnostics; when `False`, expect `w_star_norm` to vary.
- Adjust `cond_number` to shape `x_norm` behavior via anisotropy; larger values concentrate variance along top directions.
- Set `feature_scale` > 1 to scale feature norms and widen the range of `y`.
- Disable `strong_convexity_estimation` to remove `lambda_est` and reduce overhead if not needed.

Tip: The iterator is infinite—use small `N` for quick experiments and scale up as needed.