# 11 – Time Series EDA & Window Design

This notebook focuses on **exploratory data analysis (EDA)** and **window design** for time-series data using the `ml_tabular` template.

We will:

1. Load configuration and raw time-series data
2. Inspect schema, coverage, and frequency
3. Analyze target behaviour (trend, seasonality, volatility)
4. Inspect missing timestamps and gaps
5. Explore candidate window parameters (`lookback`, `horizon`, `stride`)
6. Prototype windows with `TimeSeriesSequenceDataset`
7. Export a **candidate window config** back to YAML

The goal is to treat **windowing as an explicit design decision**, not a magic number in code. This notebook helps you:

- Justify your choice of lookback/horizon
- Understand whether you have enough history per series
- Identify frequency / seasonality assumptions
- Feed those decisions back into your configuration and pipelines.

## 0. Setup

Assumptions:

- Project installed (from repo root):

  ```bash
  pip install -e .[dev]
  ```

- Time-series baseline config exists at:

  - `configs/time_series/train_ts_baseline.yaml`

- `TimeSeriesSequenceDataset` is implemented and importable from `ml_tabular.torch.datasets.time_series`.

We run this notebook from the **project root** so relative paths resolve correctly.

In [None]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ml_tabular import (
    get_config,
    get_paths,
    get_logger,
    TimeSeriesSequenceDataset,
)

LOGGER = get_logger(__name__)
PROJECT_ROOT = Path.cwd()
CONFIG_PATH = PROJECT_ROOT / "configs" / "time_series" / "train_ts_baseline.yaml"
assert CONFIG_PATH.exists(), f"Config not found: {CONFIG_PATH}"

cfg = get_config(config_path=CONFIG_PATH, env="dev", force_reload=True)
paths = get_paths(config_path=CONFIG_PATH, env="dev", force_reload=True)

cfg_dict = cfg.to_dict()
ts_cfg = cfg_dict.get("time_series", {})
ts_cfg

We assume `time_series` config fields similar to:

```yaml
time_series:
  dataset_csv: "timeseries.csv"
  id_column: "series_id"        # or null for single series
  time_column: "timestamp"
  target_column: "y"
  feature_columns: ["x1", "x2", ...]
  lookback: 24
  horizon: 1
  val_fraction: 0.2
  # (optional) stride: 1
```

## 1. Load data

We read the configured CSV and ensure key columns exist, then parse the time column and sort appropriately.

In [None]:
dataset_csv = ts_cfg["dataset_csv"]
data_path = paths.data_dir / dataset_csv
assert data_path.exists(), f"Dataset not found: {data_path}"

df = pd.read_csv(data_path)
LOGGER.info("Loaded dataset with shape: %s", df.shape)
df.head()

In [None]:
id_col = ts_cfg.get("id_column")  # may be None for single series
time_col = ts_cfg["time_column"]
target_col = ts_cfg["target_column"]
feature_cols = ts_cfg.get("feature_columns") or []

print("ID column:", id_col)
print("Time column:", time_col)
print("Target column:", target_col)
print("Feature columns:", feature_cols)

missing = [c for c in [time_col, target_col] + feature_cols if c not in df.columns]
print("Missing columns:", missing)
assert not missing, f"Config references columns not found in dataset: {missing}"

In [None]:
df[time_col] = pd.to_datetime(df[time_col], errors="raise")

if id_col is not None:
    df = df.sort_values([id_col, time_col]).reset_index(drop=True)
else:
    df = df.sort_values(time_col).reset_index(drop=True)

df[[c for c in [id_col, time_col, target_col] if c is not None]].head(10)

## 2. Coverage and series overview

We examine:

- Number of series
- Time coverage per series
- Number of points per series

        
This helps us understand whether our planned lookback/horizon are realistic given per-series history length.

In [None]:
if id_col is not None:
    n_series = df[id_col].nunique()
    print("Number of series:", n_series)
    coverage = df.groupby(id_col)[time_col].agg(["min", "max", "count"]).rename(
        columns={"min": "time_min", "max": "time_max", "count": "n_points"}
    )
    coverage.head(10)

In [None]:
if id_col is None:
    print("Single series dataset.")
    print("Time range:", df[time_col].min(), "->", df[time_col].max())
    print("Number of points:", len(df))
else:
    print("Global time range:", df[time_col].min(), "->", df[time_col].max())
    print("Total points:", len(df))

> **Questions to ask:**
> - Do all series have enough points to support your planned `lookback` and `horizon`?
> - Are some series too short and likely to be dropped when windowing?
> - Is the time coverage roughly consistent, or do some series start later/end earlier?

## 3. Time index and frequency analysis

We now inspect the **time step distribution** to understand the implicit sampling frequency:

- Are timestamps regularly spaced (e.g., hourly, daily, minutely)?
- Are there large gaps?
- Is frequency consistent across series?

This strongly influences **how you choose lookback/horizon**:

- 24-step lookback at hourly resolution ≈ 1 day of history
- 24-step lookback at daily resolution ≈ 24 days

We’ll compute deltas in seconds and look at their distribution.

In [None]:
def compute_time_deltas(df: pd.DataFrame, time_col: str, id_col: str | None = None) -> pd.Series:
    """Compute time deltas (in seconds) between consecutive rows.

    For multi-series data, deltas are computed **within each series** and concatenated.
    """
    if id_col is None:
        return df[time_col].diff().dropna().dt.total_seconds()
    else:
        deltas = []
        for _, group in df.groupby(id_col, sort=False):
            d = group[time_col].diff().dropna().dt.total_seconds()
            deltas.append(d)
        if not deltas:
            return pd.Series(dtype=float)
        return pd.concat(deltas, ignore_index=True)

deltas_sec = compute_time_deltas(df, time_col=time_col, id_col=id_col)
print("Number of deltas:", len(deltas_sec))
deltas_sec.describe()

In [None]:
if len(deltas_sec) > 0:
    plt.figure(figsize=(6, 4))
    plt.hist(deltas_sec, bins=30, edgecolor="black", alpha=0.7)
    plt.xlabel("Delta between timestamps (seconds)")
    plt.ylabel("Count")
    plt.title("Distribution of time deltas")
    plt.grid(axis="y", linestyle="--", alpha=0.5)
    plt.show()

    common_deltas = deltas_sec.value_counts().head(10)
    print("Most common deltas (seconds):")
    display(common_deltas.to_frame("count"))
else:
    print("Not enough points to compute time deltas.")

> **Interpretation tips:**
> - A single dominant delta (e.g., 3600 seconds) suggests a clear base frequency (hourly).
> - Multiple modes may indicate irregular sampling or mixed frequencies.
> - Large outliers in deltas suggest big gaps (e.g., outages, missing days).

## 4. Missing timestamps and gaps

We look for **large gaps** in the time series where:

- Delta between consecutive timestamps exceeds some threshold
- E.g., allowed frequency × 1.5

This is especially important for forecasting models: large gaps might break stationarity assumptions or imply the need for interpolation or masking.

In [None]:
if len(deltas_sec) > 0:
    median_delta = deltas_sec.median()
    threshold = median_delta * 1.5
    large_gaps = deltas_sec[deltas_sec > threshold]

    print("Median delta (sec):", median_delta)
    print("Gap threshold (sec):", threshold)
    print("Number of large gaps:", len(large_gaps))

    if len(large_gaps) > 0:
        display(large_gaps.describe())
else:
    print("Skipping gap analysis; insufficient deltas.")

> **Actionable outcomes:**
> - Decide whether to **interpolate**, **mask**, or **drop** segments around large gaps.
> - Potentially add a **config flag** like `drop_large_gaps: true` or `max_gap_seconds` in `time_series` config for your data pipeline.

## 5. Target behaviour: trend and seasonality

We now look more closely at the target series:

- Raw time-series plots for sample series
- Rolling means / standard deviations
- Simple seasonal patterns (e.g. day-of-week, hour-of-day) if applicable

This helps justify the **lookback window**: you want enough history to capture relevant patterns (seasonality, trend) without exploding model complexity.

In [None]:
def plot_series(df_series: pd.DataFrame, time_col: str, target_col: str, title: str = "") -> None:
    plt.figure(figsize=(10, 3))
    plt.plot(df_series[time_col], df_series[target_col], marker=".", linestyle="-", alpha=0.7)
    plt.xlabel("Time")
    plt.ylabel(target_col)
    plt.title(title or f"Series: {target_col}")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.show()

if id_col is None:
    plot_series(df, time_col, target_col, title="Target over time (single series)")
else:
    # Plot a few random series
    unique_ids = df[id_col].unique()
    n_plot = min(3, len(unique_ids))
    for sid in unique_ids[:n_plot]:
        subset = df[df[id_col] == sid]
        plot_series(subset, time_col, target_col, title=f"Series {sid} – target over time")

### 5.1 Rolling statistics

We compute rolling mean and standard deviation to see how the **local level and volatility** change over time. This can influence:

- Whether a **fixed lookback** is adequate
- Whether we should consider **differencing** or **detrending**

We’ll demo on a single series (or the single global series).

In [None]:
def plot_rolling(df_series: pd.DataFrame, time_col: str, target_col: str, window: int = 24) -> None:
    s = df_series[target_col].astype(float)
    roll_mean = s.rolling(window=window, min_periods=1).mean()
    roll_std = s.rolling(window=window, min_periods=1).std()

    plt.figure(figsize=(10, 4))
    plt.plot(df_series[time_col], s, label="target", alpha=0.5)
    plt.plot(df_series[time_col], roll_mean, label=f"rolling mean (window={window})")
    plt.fill_between(
        df_series[time_col].values,
        (roll_mean - roll_std).values,
        (roll_mean + roll_std).values,
        color="gray",
        alpha=0.2,
        label="rolling ±1 std",
    )
    plt.xlabel("Time")
    plt.ylabel(target_col)
    plt.title("Rolling statistics of target")
    plt.legend()
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.show()

example_series = df if id_col is None else df[df[id_col] == df[id_col].iloc[0]]
plot_rolling(example_series, time_col, target_col, window=min(48, len(example_series)))

### 5.2 Simple seasonal patterns (optional)

If your data has a clear base frequency (e.g. hourly, daily), you can inspect target by:

- Hour of day
- Day of week

This helps identify whether `lookback` should span a full **daily or weekly season**.

> These plots assume that such notions make sense for your data. If your timestamps are irregular or sparse, they may be less informative.

In [None]:
series_for_seasonality = df.copy()
series_for_seasonality["hour"] = series_for_seasonality[time_col].dt.hour
series_for_seasonality["dow"] = series_for_seasonality[time_col].dt.dayofweek

hourly = series_for_seasonality.groupby("hour")[target_col].mean()
dow = series_for_seasonality.groupby("dow")[target_col].mean()

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].plot(hourly.index, hourly.values, marker="o")
axes[0].set_xlabel("Hour of day")
axes[0].set_ylabel(target_col)
axes[0].set_title("Average target by hour")
axes[0].grid(True, linestyle="--", alpha=0.4)

axes[1].plot(dow.index, dow.values, marker="o")
axes[1].set_xlabel("Day of week (0=Mon)")
axes[1].set_ylabel(target_col)
axes[1].set_title("Average target by day of week")
axes[1].grid(True, linestyle="--", alpha=0.4)

plt.tight_layout()
plt.show()

> **Interpretation tips:**
> - If the pattern repeats daily, you might want `lookback` ≥ 24 (hourly) or ≥ 7 (daily) to capture a full cycle.
> - For multiple overlapping seasonalities (e.g. daily + weekly), you may need longer lookback or specialized architectures (e.g. seasonal encodings).

## 6. Window design: exploring lookback & horizon

We now focus on **window design**:

- `lookback`: how many past time steps the model sees
- `horizon`: how many future steps the model predicts
- (Optional) `stride`: how far we move the window each time

Design tradeoffs:

- Larger lookback → more context, but more parameters and risk of overfitting
- Longer horizon → more difficult prediction, may need more sophisticated model
- Smaller stride → more training samples (but more correlation between samples)

We’ll try a few candidate `(lookback, horizon)` pairs and inspect:

- Number of training sequences generated
- Sample shapes
- Whether short series are dropped.

In [None]:
candidate_windows = [
    {"lookback": int(ts_cfg.get("lookback", 24)), "horizon": int(ts_cfg.get("horizon", 1))},
    {"lookback": int(ts_cfg.get("lookback", 24)) * 2, "horizon": int(ts_cfg.get("horizon", 1))},
    {"lookback": int(ts_cfg.get("lookback", 24)), "horizon": max(1, int(ts_cfg.get("horizon", 1)) * 2)},
]

candidate_windows

In [None]:
results = []
for win in candidate_windows:
    lb = win["lookback"]
    hz = win["horizon"]
    try:
        ds = TimeSeriesSequenceDataset.from_dataframe(
            df,
            id_column=id_col,
            time_column=time_col,
            feature_columns=feature_cols,
            target_column=target_col,
            lookback=lb,
            horizon=hz,
        )
        n_seq = len(ds)
        example_x, example_y = ds[0]
        results.append({
            "lookback": lb,
            "horizon": hz,
            "n_sequences": n_seq,
            "x_shape": tuple(example_x.shape),
            "y_shape": tuple(example_y.shape),
        })
    except Exception as exc:
        LOGGER.warning("Failed to build dataset for (lookback=%d, horizon=%d): %s", lb, hz, exc)
        results.append({
            "lookback": lb,
            "horizon": hz,
            "n_sequences": 0,
            "x_shape": None,
            "y_shape": None,
        })

pd.DataFrame(results)

> **How to read this:**
> - If `n_sequences` collapses to a very small number for a candidate, that window is probably too ambitious.
> - Check that `x_shape` and `y_shape` match your expectations (e.g., `(lookback, n_features)` and `(horizon,)`).
> - Choose a **primary** `(lookback, horizon)` based on a balance between context and dataset size.

### 6.1 Visualizing a few windows

To build intuition, we can visualize how windows look for a given `(lookback, horizon)`:

- Plot the input sequence (past) and target (future) on the same axis
- Show a few consecutive windows to see how they overlap

In [None]:
chosen_lb = int(ts_cfg.get("lookback", 24))
chosen_hz = int(ts_cfg.get("horizon", 1))

ds_chosen = TimeSeriesSequenceDataset.from_dataframe(
    df,
    id_column=id_col,
    time_column=time_col,
    feature_columns=feature_cols,
    target_column=target_col,
    lookback=chosen_lb,
    horizon=chosen_hz,
)

print("Chosen lookback:", chosen_lb)
print("Chosen horizon:", chosen_hz)
print("Number of sequences (chosen):", len(ds_chosen))

n_preview = min(3, len(ds_chosen))
for idx in range(n_preview):
    x_seq, y_target = ds_chosen[idx]
    x_seq = x_seq.numpy()
    y_target = y_target.numpy()

    # For visualization, assume univariate target and use first feature if many
    past = x_seq[:, 0]  # first feature
    future = y_target

    t_past = np.arange(len(past))
    t_future = np.arange(len(past), len(past) + len(future))

    plt.figure(figsize=(8, 3))
    plt.plot(t_past, past, marker="o", label="input (feature 0)")
    plt.plot(t_future, future, marker="x", label="target future")
    plt.xlabel("Relative time index")
    plt.ylabel("Value")
    plt.title(f"Window #{idx} – lookback={chosen_lb}, horizon={chosen_hz}")
    plt.legend()
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.show()

> This visualization helps you see exactly what the model will observe and what it is asked to predict, using your chosen window configuration.

## 7. Export a candidate window configuration

Based on the EDA and window exploration, you can now record your decisions in a small YAML file, e.g.:

- Confirmed `lookback` and `horizon`
- Optional `stride`
- Notes on frequency / seasonality

You can then merge this into your main `train_ts_baseline.yaml` or keep it as a separate profile. Below is a helper to write a **candidate** config.

In [None]:
import yaml

window_config = {
    "lookback": chosen_lb,
    "horizon": chosen_hz,
    # You can add stride or other options if your dataset supports it
    "stride": int(ts_cfg.get("stride", 1)),
    # Free-form notes to remind future-you why you chose these values
    "notes": {
        "frequency_hint": "e.g., hourly/daily – infer from deltas_sec distribution",
        "seasonality_considered": "e.g., daily/weekly; see section 5.2 plots",
        "design_rationale": "describe why this lookback/horizon are appropriate given data coverage",
    },
}

output_config_path = PROJECT_ROOT / "configs" / "time_series" / "time_series_windows_candidate.yaml"
output_config_path.parent.mkdir(parents=True, exist_ok=True)

with output_config_path.open("w", encoding="utf-8") as f:
    yaml.safe_dump(window_config, f, sort_keys=False)

LOGGER.info("Wrote candidate time-series window config to: %s", output_config_path)
output_config_path

## 8. Summary

In this notebook, you:

- Loaded time-series config and data using the same configuration system as your training code
- Analyzed sampling frequency and time coverage per series
- Inspected missing timestamps and large gaps
- Explored target behaviour via raw plots, rolling stats, and simple seasonal patterns
- Experimented with different `(lookback, horizon)` window designs using `TimeSeriesSequenceDataset`
- Visualized example windows to sanity-check model inputs and targets
- Exported a **candidate window configuration** to YAML for use in training scripts

This makes your time-series modelling process **explicit, reproducible, and defensible**:

- Window choices are backed by EDA, not guesswork
- Config and code remain in sync
- Notebooks tell a coherent story to reviewers and hiring managers about how you think about temporal data.