# Tabular Quickstart: ml_tabular Template

This notebook shows how to use the **ml_tabular** template end-to-end for a small tabular problem:

1. Load configuration from a YAML file
2. Inspect the dataset
3. Build PyTorch `DataLoader`s using the template's `TabularDataset`
4. Define a `TabularMLP` model via `TabularMLPConfig`
5. Train and evaluate using the shared `fit` / `evaluate` utilities
6. (Optional) Enable MLflow experiment tracking

The goal is to demonstrate **how the pieces fit together**, not to get a state-of-the-art model.

## 0. Prerequisites

This notebook assumes:

- You have installed this project in your environment (e.g. from the repo root):

  ```bash
  pip install -e .[dev,mlops]
  ```

- You are running this notebook from the project root (or you adjust relative paths accordingly).
- You have a baseline config at `configs/tabular/train_tabular_baseline.yaml`.

If you followed the template, that config should already exist and point to a small CSV under `data/`.

In [None]:
%load_ext autoreload
%autoreload 2

import os
from pathlib import Path

import pandas as pd
import torch
from torch.utils.data import DataLoader

from ml_tabular import (
    AppConfig,
    PathsConfig,
    get_config,
    get_paths,
    get_logger,
    TabularDataset,
    TabularMLP,
    TabularMLPConfig,
    train_one_epoch,
    evaluate,
    fit,
    EarlyStopping,
)

LOGGER = get_logger(__name__)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LOGGER.info("Using device: %s", DEVICE)

## 1. Load configuration

We load the **tabular baseline config** from YAML using the shared `get_config` helper.

This config controls:
- Paths (data, models)
- Training hyperparameters (batch size, epochs, learning rate, etc.)
- Tabular-specific settings (target column, feature columns, task type, etc.)

In [None]:
PROJECT_ROOT = Path.cwd()
CONFIG_PATH = PROJECT_ROOT / "configs" / "tabular" / "train_tabular_baseline.yaml"

assert CONFIG_PATH.exists(), f"Config not found: {CONFIG_PATH}"

cfg: AppConfig = get_config(config_path=CONFIG_PATH, env="dev", force_reload=True)
paths: PathsConfig = get_paths(config_path=CONFIG_PATH, env="dev", force_reload=True)

cfg_dict = cfg.to_dict()
cfg_dict

You should see a nested dictionary that includes at least:

- `env`, `experiment_name`, `log_level`
- `paths` → `data_dir`, `models_dir`, etc.
- `training` → `batch_size`, `num_epochs`, `learning_rate`, `early_stopping`, ...
- `tabular` → `dataset_csv`, `target_column`, `feature_columns`, `task_type`, `num_classes`, ...

We can quickly inspect the tabular sub-config:

In [None]:
tab_cfg = cfg_dict.get("tabular", {})
tab_cfg

## 2. Inspect the dataset

We'll load the CSV referenced in the config and do a quick sanity check.

In [None]:
dataset_csv = tab_cfg["dataset_csv"]
data_path = paths.data_dir / dataset_csv
assert data_path.exists(), f"Dataset not found: {data_path}"

df = pd.read_csv(data_path)
LOGGER.info("Loaded dataset with shape %s", df.shape)

df.head()

We'll also verify that the columns referenced in config actually exist.

In [None]:
feature_columns = tab_cfg.get("feature_columns") or []
target_column = tab_cfg["target_column"]

missing_features = [c for c in feature_columns if c not in df.columns]
missing_target = target_column not in df.columns

        
print("Feature columns:", feature_columns)
print("Target column:", target_column)
print("Missing features:", missing_features)
print("Missing target?", missing_target)

assert not missing_features, "Config references feature columns not in dataset."
assert not missing_target, "Config references a target column not in dataset."

## 3. Build PyTorch DataLoaders

We now turn the `pandas.DataFrame` into a `TabularDataset`, and then into `DataLoader`s for training and validation.

The `TabularDataset` handles:
- Column selection
- Conversion to float tensors
- Simple metadata (feature names, target name, etc.)

We’ll use a simple **train/validation split** for this quickstart. For real projects you might:
- Use stratified splits
- Use cross-validation
- Or reuse pre-split train/val files per config.

In [None]:
from sklearn.model_selection import train_test_split

test_size = tab_cfg.get("val_size", 0.2)
random_seed = cfg.training.random_seed if hasattr(cfg, "training") else 42

train_df, val_df = train_test_split(
    df,
    test_size=test_size,
    random_state=random_seed,
    shuffle=True,
)

LOGGER.info("Train shape: %s, Val shape: %s", train_df.shape, val_df.shape)

train_df.head()

In [None]:
train_ds = TabularDataset.from_dataframe(
    train_df,
    feature_columns=feature_columns,
    target_column=target_column,
)

val_ds = TabularDataset.from_dataframe(
    val_df,
    feature_columns=feature_columns,
    target_column=target_column,
)

print("Train samples:", len(train_ds))
print("Val samples:", len(val_ds))
print("Metadata:", train_ds.metadata)

x0, y0 = train_ds[0]
print("Single sample X shape:", x0.shape)
print("Single sample y shape:", y0.shape)

In [None]:
batch_size = cfg.training.batch_size if hasattr(cfg, "training") else 32

train_loader = DataLoader(
    train_ds,
    batch_size=batch_size,
    shuffle=True,
)

val_loader = DataLoader(
    val_ds,
    batch_size=batch_size,
    shuffle=False,
)

batch_x, batch_y = next(iter(train_loader))
print("Batch X shape:", batch_x.shape)
print("Batch y shape:", batch_y.shape)

## 4. Define `TabularMLP` via `TabularMLPConfig`

We now set up the MLP model. We infer:

- `input_dim` from the number of features
- `output_dim` and `task_type` from the tabular config (binary, multiclass, or regression)

The idea is that **model config** is explicit, and driven by the config file + dataset metadata.

In [None]:
num_features = len(feature_columns)
task_type = tab_cfg.get("task_type", "binary")
num_classes = tab_cfg.get("num_classes", 1)

if task_type == "binary" or task_type == "regression":
    output_dim = 1
elif task_type == "multiclass":
    output_dim = int(num_classes)
else:
    raise ValueError(f"Unknown task_type in config: {task_type}")

hidden_dims = tab_cfg.get("hidden_dims") or [64, 32]

model_cfg = TabularMLPConfig(
    input_dim=num_features,
    hidden_dims=hidden_dims,
    output_dim=output_dim,
    activation="relu",
    dropout=float(tab_cfg.get("dropout", 0.1)),
    batch_norm=bool(tab_cfg.get("batch_norm", False)),
    layer_norm=bool(tab_cfg.get("layer_norm", False)),
    task_type=task_type,
)

model = TabularMLP(model_cfg).to(DEVICE)
model

## 5. Train and evaluate with shared training utilities

We can now:

1. Define an optimizer and loss function
2. Optionally configure **early stopping**
3. Call `fit(...)` to run the training loop using the shared utilities from the template.

This mirrors what your `train_tabular_mlp.py` script does, but in an interactive setting.

In [None]:
import torch.nn as nn
from torch.optim import Adam

learning_rate = cfg.training.learning_rate if hasattr(cfg, "training") else 1e-3
weight_decay = getattr(cfg.training, "weight_decay", 0.0) if hasattr(cfg, "training") else 0.0
num_epochs = cfg.training.num_epochs if hasattr(cfg, "training") else 5

if task_type == "regression":
    loss_fn = nn.MSELoss()
else:
    # For binary: BCEWithLogitsLoss, for multiclass: CrossEntropyLoss on logits
    if task_type == "binary":
        loss_fn = nn.BCEWithLogitsLoss()
    else:
        loss_fn = nn.CrossEntropyLoss()

optimizer = Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

# Optional: early stopping based on validation loss
es_cfg = cfg_dict.get("training", {}).get("early_stopping", {})
if es_cfg:
    early_stopping = EarlyStopping(
        patience=int(es_cfg.get("patience", 5)),
        min_delta=float(es_cfg.get("min_delta", 1e-4)),
        mode=es_cfg.get("mode", "min"),
    )
else:
    early_stopping = None

history = fit(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    loss_fn=loss_fn,
    num_epochs=num_epochs,
    device=DEVICE,
    early_stopping=early_stopping,
)

history

You should see `train_losses` and `val_losses` over epochs. Let’s plot them quickly.

In [None]:
import matplotlib.pyplot as plt

train_losses = history["train_losses"]
val_losses = history["val_losses"]

plt.figure(figsize=(6, 4))
plt.plot(train_losses, marker="o", label="train")
plt.plot(val_losses, marker="o", label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.show()

## 6. Save model artifacts

Finally, we save the trained model to the `models` directory defined by the config.

This mirrors what your CLI script does, but gives you full control in the notebook.

In [None]:
paths.models_dir.mkdir(parents=True, exist_ok=True)
model_path = paths.models_dir / f"tabular_mlp_{cfg.experiment_name}.pt"

torch.save({
    "model_state_dict": model.state_dict(),
    "model_cfg": model_cfg.model_dump(),
    "config": cfg_dict,
}, model_path)

LOGGER.info("Saved model to: %s", model_path)
model_path

## 7. (Optional) Enable MLflow tracking

If you installed the `mlops` extra and set up MLflow, you can wrap training with the template's MLflow helper.

This section is optional; comment it out if you don't have MLflow configured yet.

In [None]:
from ml_tabular.mlops.mlflow_utils import (
    is_mlflow_available,
    mlflow_run,
    log_params,
    log_metrics,
    log_artifact,
)

if is_mlflow_available():
    import mlflow

    tracking_uri = os.getenv("MLFLOW_TRACKING_URI") or (paths.base_dir / "mlruns").as_uri()
    experiment_name = cfg.experiment_name

    with mlflow_run(
        enabled=True,
        experiment_name=experiment_name,
        run_name="tabular_quickstart",
        tracking_uri=tracking_uri,
        tags={"template": "ml_tabular", "notebook": "00_tabular_quickstart"},
    ):
        # Log high-level config and hyperparameters
        log_params({
            "task_type": task_type,
            "input_dim": num_features,
            "hidden_dims": hidden_dims,
            "output_dim": output_dim,
            "learning_rate": learning_rate,
            "batch_size": batch_size,
            "num_epochs": num_epochs,
        })

        # Re-run a short training just for demonstration
        model2 = TabularMLP(model_cfg).to(DEVICE)
        optimizer2 = Adam(model2.parameters(), lr=learning_rate, weight_decay=weight_decay)

        history2 = fit(
            model=model2,
            train_loader=train_loader,
            val_loader=val_loader,
            optimizer=optimizer2,
            loss_fn=loss_fn,
            num_epochs=3,
            device=DEVICE,
            early_stopping=None,
        )

        # Log final metrics
        log_metrics({
            "final_train_loss": float(history2["train_losses"][-1]),
            "final_val_loss": float(history2["val_losses"][-1]),
        })

        # Log the previously saved model as an artifact
        if model_path.exists():
            log_artifact(model_path, artifact_path="models")
else:
    print("MLflow not available; skipping MLflow tracking demo.")

## Summary

In this quickstart, you:

- Loaded configuration via a strongly-typed `AppConfig`
- Used `PathsConfig` to resolve data and model directories
- Built a `TabularDataset` and `DataLoader`s from a CSV
- Defined a `TabularMLP` using `TabularMLPConfig`
- Trained the model using shared `fit` / `train_one_epoch` utilities
- Saved model artifacts and (optionally) logged them to MLflow

This demonstrates the **end-to-end story** your template is designed to tell: from config-driven setup, through a clean data/model/training stack, to reproducible artifacts and optional experiment tracking.