# 90 – Dev Scratchpad

This notebook is a **developer scratchpad** for the `ml_tabular` project.

It is intentionally flexible and unpolished from a "story" point of view, but very
structured from an **engineering** point of view. The goals are:

- Give you a safe place to **experiment** with code, configs, datasets, and models
- Provide a few **ready-made smoke tests** for key components
- Make it easy to debug issues without editing your core package

You can freely modify cells in this notebook during development. If it gets too messy,
you can always reset it from version control.

> Guiding principle: this notebook is **for you**, not for end users. Use it to move fast
> and debug deeply, while keeping your main notebooks clean and narrative-focused.

## 0. Setup & imports

We enable autoreload so that edits to the `ml_tabular` package (and submodules) are picked
up without restarting the kernel.

We also import the key building blocks that you are most likely to poke at:

- Config & paths helpers: `get_config`, `get_paths`
- Logging helper: `get_logger`
- Datasets: `TabularDataset`, `TimeSeriesSequenceDataset`
- Models: `TabularMLP`
- Training utilities: `train_one_epoch`, `evaluate`, `fit`, `EarlyStopping`
- Optional MLOps helpers: MLflow utilities (if available)

You can extend this section with any additional utilities you add later.

In [None]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt

from ml_tabular import (
    get_config,
    get_paths,
    get_logger,
    TabularDataset,
    TimeSeriesSequenceDataset,
    TabularMLP,
    train_one_epoch,
    evaluate,
    fit,
    EarlyStopping,
)

try:
    from ml_tabular.mlops.mlflow_utils import (
        is_mlflow_available,
        mlflow_run,
        log_params,
        log_metrics,
        log_artifact,
    )
    HAS_MLFLOW_UTILS = True
except Exception:
    HAS_MLFLOW_UTILS = False

LOGGER = get_logger(__name__)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LOGGER.info("Scratchpad using device: %s", DEVICE)

PROJECT_ROOT = Path.cwd()
print("PROJECT_ROOT:", PROJECT_ROOT)

TABULAR_CONFIG = PROJECT_ROOT / "configs" / "tabular" / "train_tabular_baseline.yaml"
TS_CONFIG = PROJECT_ROOT / "configs" / "time_series" / "train_ts_baseline.yaml"
print("Tabular config:", TABULAR_CONFIG)
print("Time-series config:", TS_CONFIG)

print("Tabular config exists:", TABULAR_CONFIG.exists())
print("Time-series config exists:", TS_CONFIG.exists())

> **Tip:** If the config paths above print `False` for `exists`, either:
>
> - Adjust `TABULAR_CONFIG` / `TS_CONFIG` to match your layout, or
> - Create the baseline configs from your template files before using this notebook.

## 1. Quick config inspection

This section is for quickly inspecting your config objects and confirming that environment
variables / YAML are being applied as you expect.

- Load tabular config (if present)
- Print out key sections
        
- Inspect resolved paths

You can freely tweak and re-run this cell as you iterate on config design.

In [None]:
if TABULAR_CONFIG.exists():
    cfg_tab = get_config(config_path=TABULAR_CONFIG, env="dev", force_reload=True)
    paths_tab = get_paths(config_path=TABULAR_CONFIG, env="dev", force_reload=True)

    print("[TABULAR CONFIG]")
    print("env:", cfg_tab.env)
    print("log_level:", cfg_tab.log_level)
    print("experiment_name:", cfg_tab.experiment_name)
    print("paths:", cfg_tab.paths)
    print("training:", cfg_tab.training)
    print("resolved_paths:", paths_tab)
else:
    print("Tabular config not found; skip or adjust TABULAR_CONFIG path.")

print("\n---\n")

if TS_CONFIG.exists():
    cfg_ts = get_config(config_path=TS_CONFIG, env="dev", force_reload=True)
    paths_ts = get_paths(config_path=TS_CONFIG, env="dev", force_reload=True)

    print("[TIME-SERIES CONFIG]")
    print("env:", cfg_ts.env)
    print("log_level:", cfg_ts.log_level)
    print("experiment_name:", cfg_ts.experiment_name)
    print("paths:", cfg_ts.paths)
    print("training:", cfg_ts.training)
    print("time_series section:", cfg_ts.to_dict().get("time_series", {}))
    print("resolved_paths:", paths_ts)
else:
    print("Time-series config not found; skip or adjust TS_CONFIG path.")

## 2. Tabular pipeline smoke test

This is a **lightweight smoke test** for the tabular path:

1. Load the baseline tabular config
2. Load the configured CSV into a pandas DataFrame
3. Create a `TabularDataset` and `DataLoader`
4. Instantiate a small `TabularMLP`
5. Run one forward pass to verify shapes and data wiring

If anything breaks here, you can debug directly from within the notebook without touching
your training scripts.

In [None]:
if not TABULAR_CONFIG.exists():
    print("Tabular config missing; skipping tabular smoke test.")
else:
    cfg_tab = get_config(config_path=TABULAR_CONFIG, env="dev", force_reload=True)
    paths_tab = get_paths(config_path=TABULAR_CONFIG, env="dev", force_reload=True)
    cfg_tab_dict = cfg_tab.to_dict()

    tab_cfg = cfg_tab_dict.get("tabular", {})
    dataset_csv = tab_cfg.get("dataset_csv")
    if dataset_csv is None:
        raise ValueError("tabular.dataset_csv not set in config.")

    data_path = paths_tab.data_dir / dataset_csv
    print("Tabular data path:", data_path)
    assert data_path.exists(), f"Dataset not found: {data_path}"

    df_tab = pd.read_csv(data_path)
    print("Loaded tabular dataset:")
    display(df_tab.head())
    print("Shape:", df_tab.shape)

    target_col = tab_cfg.get("target_column")
    feature_cols = tab_cfg.get("feature_columns")
    problem_type = tab_cfg.get("problem_type", "regression")

    print("Target column:", target_col)
    print("Feature columns:", feature_cols)
    print("Problem type:", problem_type)

    dataset_tab = TabularDataset.from_dataframe(
        df_tab,
        feature_columns=feature_cols,
        target_column=target_col,
        problem_type=problem_type,
    )

    print("Tabular dataset metadata:", dataset_tab.metadata)

    batch_size = cfg_tab.training.batch_size
    loader_tab = DataLoader(dataset_tab, batch_size=batch_size, shuffle=True)

    x_batch, y_batch = next(iter(loader_tab))
    print("x_batch shape:", x_batch.shape)
    print("y_batch shape:", y_batch.shape)

    model_cfg = cfg_tab_dict.get("tabular_model", {})
    input_dim = x_batch.shape[1]
    hidden_dims = model_cfg.get("hidden_dims", [128, 64])
    dropout = float(model_cfg.get("dropout", 0.1))

    num_classes = None
    if problem_type == "classification":
        num_classes = int(dataset_tab.metadata.get("num_classes", 2))

    model_tab = TabularMLP(
        input_dim=input_dim,
        hidden_dims=hidden_dims,
        output_dim=1 if problem_type == "regression" else num_classes,
        dropout=dropout,
        problem_type=problem_type,
    ).to(DEVICE)

    with torch.no_grad():
        preds = model_tab(x_batch.to(DEVICE))
    print("Forward pass OK – preds shape:", preds.shape)

## 3. Time-series pipeline smoke test

Similar to the tabular smoke test, but for the time-series path:

1. Load the baseline time-series config
2. Load the configured CSV
3. Build a `TimeSeriesSequenceDataset`
4. Inspect shapes for a few sequences

You can extend this by plugging in a sequence model (e.g. GRU, TCN) and running a small
training loop directly from here when debugging.

In [None]:
if not TS_CONFIG.exists():
    print("Time-series config missing; skipping time-series smoke test.")
else:
    cfg_ts = get_config(config_path=TS_CONFIG, env="dev", force_reload=True)
    paths_ts = get_paths(config_path=TS_CONFIG, env="dev", force_reload=True)
    cfg_ts_dict = cfg_ts.to_dict()
    ts_cfg = cfg_ts_dict.get("time_series", {})

    dataset_csv = ts_cfg.get("dataset_csv")
    if dataset_csv is None:
        raise ValueError("time_series.dataset_csv not set in config.")

    data_path = paths_ts.data_dir / dataset_csv
    print("Time-series data path:", data_path)
    assert data_path.exists(), f"Dataset not found: {data_path}"

    df_ts = pd.read_csv(data_path)
    print("Loaded time-series dataset:")
    display(df_ts.head())
    print("Shape:", df_ts.shape)

    id_col = ts_cfg.get("id_column")
    time_col = ts_cfg["time_column"]
    target_col = ts_cfg["target_column"]
    feature_cols = ts_cfg.get("feature_columns") or []
    lookback = int(ts_cfg.get("lookback", 24))
    horizon = int(ts_cfg.get("horizon", 1))

    print("ID column:", id_col)
    print("Time column:", time_col)
    print("Target column:", target_col)
    print("Feature columns:", feature_cols)
    print("lookback:", lookback, "horizon:", horizon)

    df_ts[time_col] = pd.to_datetime(df_ts[time_col], errors="raise")
    if id_col is not None:
        df_ts = df_ts.sort_values([id_col, time_col]).reset_index(drop=True)
    else:
        df_ts = df_ts.sort_values(time_col).reset_index(drop=True)

    dataset_ts = TimeSeriesSequenceDataset.from_dataframe(
        df_ts,
        id_column=id_col,
        time_column=time_col,
        feature_columns=feature_cols,
        target_column=target_col,
        lookback=lookback,
        horizon=horizon,
    )

    print("Time-series dataset metadata:", dataset_ts.metadata)
    print("Number of sequences:", len(dataset_ts))

    if len(dataset_ts) > 0:
        x_seq, y_target = dataset_ts[0]
        print("First seq X shape:", x_seq.shape)
        print("First seq y shape:", y_target.shape)
    else:
        print("WARNING: dataset produced zero sequences – check lookback/horizon vs series length.")

## 4. Minimal training loop sanity check (tabular)

This optional section runs a **very small training loop** on the tabular pipeline to ensure:

- Loss decreases
- `train_one_epoch` / `evaluate` / `fit` work end-to-end

You can keep epochs tiny (e.g. 2–3) to make this fast and use it to debug training code
without touching your main scripts.

> If you don't want to run training from the notebook, just skip this cell.

In [None]:
RUN_TRAINING_SANITY = False  # flip to True when you want to run this

if RUN_TRAINING_SANITY and TABULAR_CONFIG.exists():
    cfg_tab = get_config(config_path=TABULAR_CONFIG, env="dev", force_reload=True)
    paths_tab = get_paths(config_path=TABULAR_CONFIG, env="dev", force_reload=True)
    cfg_tab_dict = cfg_tab.to_dict()
    tab_cfg = cfg_tab_dict.get("tabular", {})

    dataset_csv = tab_cfg["dataset_csv"]
    data_path = paths_tab.data_dir / dataset_csv
    df_tab = pd.read_csv(data_path)

    target_col = tab_cfg["target_column"]
    feature_cols = tab_cfg["feature_columns"]
    problem_type = tab_cfg.get("problem_type", "regression")

    dataset_tab = TabularDataset.from_dataframe(
        df_tab,
        feature_columns=feature_cols,
        target_column=target_col,
        problem_type=problem_type,
    )

    batch_size = cfg_tab.training.batch_size
    loader_tab = DataLoader(dataset_tab, batch_size=batch_size, shuffle=True)
    val_loader_tab = DataLoader(dataset_tab, batch_size=batch_size, shuffle=False)

    import torch.nn as nn
    from torch.optim import Adam

    x_sample, _ = dataset_tab[0]
    input_dim = x_sample.shape[0]
    model_cfg = cfg_tab_dict.get("tabular_model", {})
    hidden_dims = model_cfg.get("hidden_dims", [128, 64])
    dropout = float(model_cfg.get("dropout", 0.1))

    num_classes = None
    if problem_type == "classification":
        num_classes = int(dataset_tab.metadata.get("num_classes", 2))

    model_tab = TabularMLP(
        input_dim=input_dim,
        hidden_dims=hidden_dims,
        output_dim=1 if problem_type == "regression" else num_classes,
        dropout=dropout,
        problem_type=problem_type,
    ).to(DEVICE)

    if problem_type == "regression":
        loss_fn = nn.MSELoss()
    else:
        loss_fn = nn.CrossEntropyLoss()

    lr = cfg_tab.training.learning_rate
    wd = getattr(cfg_tab.training, "weight_decay", 0.0)
    optimizer = Adam(model_tab.parameters(), lr=lr, weight_decay=wd)

    num_epochs = min(cfg_tab.training.num_epochs, 3)
    early_stopping = EarlyStopping(patience=2, min_delta=1e-4, mode="min")

    history = fit(
        model=model_tab,
        train_loader=loader_tab,
        val_loader=val_loader_tab,
        optimizer=optimizer,
        loss_fn=loss_fn,
        num_epochs=num_epochs,
        device=DEVICE,
        early_stopping=early_stopping,
    )

    print("History:", history)
else:
    print("RUN_TRAINING_SANITY is False or tabular config missing; skipping training sanity check.")

## 5. MLflow smoke test (optional)

If you have MLflow installed and configured, you can use this section to sanity-check
your **`ml_tabular.mlops.mlflow_utils`** helpers:

- Confirm that a run can be created
- Log params / metrics / artifacts

You can keep this test extremely simple (e.g., log a dummy metric and a text file) just to
validate wiring.

> If MLflow is not installed, this section will simply report that and do nothing.

In [None]:
RUN_MLFLOW_SMOKE = False  # flip to True when you want to test MLflow wiring

if RUN_MLFLOW_SMOKE and HAS_MLFLOW_UTILS and 'is_mlflow_available' in globals() and is_mlflow_available():
    import os

    tracking_uri = os.getenv("MLFLOW_TRACKING_URI") or (PROJECT_ROOT / "mlruns").as_uri()
    experiment_name = "dev_scratchpad_smoke"

    dummy_text_path = PROJECT_ROOT / "_tmp_mlflow_dummy.txt"
    dummy_text_path.write_text("Hello from ml_tabular dev scratchpad!", encoding="utf-8")

    with mlflow_run(
        enabled=True,
        experiment_name=experiment_name,
        run_name="scratchpad_smoke_test",
        tracking_uri=tracking_uri,
        tags={"context": "dev_scratchpad"},
    ):
        log_params({
            "debug_param": 123,
            "note": "Scratchpad MLflow smoke test",
        })

        log_metrics({
            "dummy_metric": 0.42,
        })

        log_artifact(dummy_text_path, artifact_path="scratchpad")

    print("MLflow smoke test completed. Check your tracking UI.")
elif RUN_MLFLOW_SMOKE:
    print("MLflow utils not available or mlflow not installed; skipping MLflow smoke test.")
else:
    print("RUN_MLFLOW_SMOKE is False; MLflow smoke test disabled.")

## 6. Freeform scratch area

Use the cells below as a **playground** for:

- Quick one-off experiments
- Debugging data issues
- Prototyping model changes before wiring them into scripts
- Manual profiling or benchmarking

You can add more sections (with headings) as your workflow evolves. This notebook is meant
to be **alive**, not frozen.

In [None]:
# --- Playground cell ---
# Write any ad-hoc experimentation code here.
# Example: quick check of dtypes in a dataset, or inspecting a specific feature.

print("Scratchpad ready. Add your dev code in this cell or below.")

# Example (commented out):
# if TABULAR_CONFIG.exists():
#     cfg_tab = get_config(config_path=TABULAR_CONFIG, env="dev")
#     paths_tab = get_paths(config_path=TABULAR_CONFIG, env="dev")
#     df_example = pd.read_csv(paths_tab.data_dir / cfg_tab.to_dict()["tabular"]["dataset_csv"])
#     print(df_example.dtypes)
#     display(df_example.describe(include="all").T)