
# Tabular ML End-to-End Demo

This notebook mirrors the stage-wise experience of the `tabular_ml` UI while keeping everything offline, reproducible, and keyboard-friendly. It walks you through a complete pipeline for both classification and regression tasks on tabular data.

## Table of Contents
- [1. Environment Check & Imports](#environment-check)
- [2. Helper Utilities & Global State](#helper-utilities)
- [3. Configuration](#configuration)
- [4. Data Loading](#data-loading)
- [5. Target Detection & Basic Cleaning](#target-detection)
- [6. Exploratory Data Analysis (EDA)](#eda)
- [7. Outlier Detection & Removal](#outliers)
- [8. Preprocessing Pipeline](#preprocessing)
- [9. Data Splitting Strategy](#splitting)
- [10. Model Zoo Setup](#model-zoo)
- [11. Training & Live Progress](#training)
- [12. Evaluation](#evaluation)
- [13. Inference Demo](#inference)
- [14. Run Summary](#run-summary)
- [15. Optional Export Cells](#optional-export)
- [16. Testing Checklist](#testing-checklist)

**Pipeline stages at a glance**
- Data ingestion → exploration → outlier handling → preprocessing → split → modeling → training → evaluation → inference.
- Designed for rapid iteration with sensible defaults and reproducible results.

> **What you'll need**  
> Required: Python ≥3.9, pandas, numpy, scikit-learn, matplotlib, scipy.  
> Optional (auto-detected): xgboost, lightgbm, torch.  
> Run the next cell to see which versions are available in your environment.

<details>
<summary><strong>Quick Demo Path vs Full Walkthrough</strong></summary>

- **Quick Demo:** Keep defaults, use the built-in Titanic-like dataset, and step through the cells sequentially.
- **Full Walkthrough:** Adjust the configuration cell, experiment with custom datasets, and explore optional models/exports.

</details>



## 1. Environment Check & Imports <a id="environment-check"></a>

**What this step does**
- Verifies required/optional dependencies and records their versions.
- Configures matplotlib for notebook-friendly visuals and sets global seeds.
- Establishes a capability matrix so later sections can adapt automatically.

➡️ **Run this cell next.**


In [None]:

from __future__ import annotations

import importlib
import math
import os
import platform
import random
import sys
import time
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple

import numpy as np
import pandas as pd
from IPython.display import Markdown, display, clear_output
import matplotlib.pyplot as plt

from sklearn import __version__ as sklearn_version

OPTIONAL_MODULES: Dict[str, Any] = {}
OPTIONAL_SPECS = {
    "xgboost": "xgboost",
    "lightgbm": "lightgbm",
    "torch": "torch",
}
for friendly_name, import_path in OPTIONAL_SPECS.items():
    spec = importlib.util.find_spec(import_path)
    if spec is not None:
        module = importlib.import_module(import_path)
        OPTIONAL_MODULES[friendly_name] = module

HAS_XGBOOST = "xgboost" in OPTIONAL_MODULES
HAS_LIGHTGBM = "lightgbm" in OPTIONAL_MODULES
HAS_TORCH = "torch" in OPTIONAL_MODULES
try:
    import scipy
    from scipy import stats as scipy_stats  # type: ignore
    HAS_SCIPY = True
except Exception:  # noqa: BLE001
    scipy = None
    scipy_stats = None
    HAS_SCIPY = False

version_rows = [
    ("python", platform.python_version()),
    ("pandas", pd.__version__),
    ("numpy", np.__version__),
    ("scikit-learn", sklearn_version),
    ("matplotlib", plt.matplotlib.__version__),
    ("scipy", getattr(scipy, "__version__", "❌ missing")),
]
version_rows.append(("xgboost", OPTIONAL_MODULES.get("xgboost").__version__ if HAS_XGBOOST else "❌ missing"))
version_rows.append(("lightgbm", OPTIONAL_MODULES.get("lightgbm").__version__ if HAS_LIGHTGBM else "❌ missing"))
version_rows.append(("torch", OPTIONAL_MODULES.get("torch").__version__ if HAS_TORCH else "❌ missing"))

version_df = pd.DataFrame(version_rows, columns=["Library", "Version / Status"])
display(Markdown("**Package availability overview**"))
display(version_df)

capability_rows = [
    ("GPU Available (torch)", "✅" if HAS_TORCH and OPTIONAL_MODULES["torch"].cuda.is_available() else "⚠️/❌"),
    ("XGBoost Enabled", "✅" if HAS_XGBOOST else "❌"),
    ("LightGBM Enabled", "✅" if HAS_LIGHTGBM else "❌"),
    ("SciPy Stats", "✅" if HAS_SCIPY else "⚠️ (skew/kurtosis disabled)"),
]
capability_df = pd.DataFrame(capability_rows, columns=["Capability", "Status"])
display(capability_df)

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
if HAS_TORCH:
    OPTIONAL_MODULES["torch"].manual_seed(RANDOM_SEED)

plt.style.use("default")
plt.rcParams.update(
    {
        "figure.figsize": (8, 5),
        "axes.titlesize": 14,
        "axes.labelsize": 12,
        "xtick.labelsize": 11,
        "ytick.labelsize": 11,
        "legend.fontsize": 11,
        "grid.alpha": 0.3,
        "axes.grid": True,
        "figure.autolayout": True,
    }
)
print(f"Global random seed set to {RANDOM_SEED}. Deterministic ops depend on library guarantees.")



## 2. Helper Utilities & Global State <a id="helper-utilities"></a>

**What this step does**
- Defines reusable helper functions for dataset synthesis, heuristics, plotting, and evaluation.
- Centralises mutable state (`PIPELINE_STATE`) shared across subsequent steps.
- Keeps implementation concise while maintaining readability via comments.

➡️ **Run this cell after the imports.**


In [None]:

from dataclasses import dataclass

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    log_loss,
    mean_absolute_error,
    mean_squared_error,
    precision_recall_curve,
    r2_score,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import StratifiedKFold, KFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
from sklearn.utils.multiclass import type_of_target

PIPELINE_STATE: Dict[str, Any] = {}


def synthesize_titanic_like(n_rows: int = 800, seed: int = RANDOM_SEED) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    pclass = rng.integers(1, 4, size=n_rows)
    sex = rng.choice(["male", "female"], size=n_rows)
    embarked = rng.choice(["S", "C", "Q"], size=n_rows, p=[0.72, 0.18, 0.10])
    age = np.clip(rng.normal(29.7, 13.0, size=n_rows), 0.4, 80)
    sibsp = rng.integers(0, 5, size=n_rows)
    parch = rng.integers(0, 4, size=n_rows)
    fare = np.round(np.clip(rng.normal(32.0, 49.0, size=n_rows), 4, 512), 2)
    cabin_known = rng.choice([0, 1], size=n_rows, p=[0.7, 0.3])
    title = np.where(age < 16, "Master", np.where(sex == "female", "Mrs", "Mr"))
    family_size = sibsp + parch + 1
    survived_logits = (
        -1.2 * (pclass - 1)
        + 0.8 * (sex == "female").astype(float)
        + 0.3 * cabin_known
        + 0.02 * (family_size <= 4)
        + rng.normal(0, 0.8, size=n_rows)
    )
    survived = (1 / (1 + np.exp(-survived_logits)) > 0.5).astype(int)

    df = pd.DataFrame(
        {
            "PassengerId": np.arange(1, n_rows + 1),
            "Survived": survived,
            "Pclass": pclass,
            "Name_Title": title,
            "Sex": sex,
            "Age": np.round(age, 1),
            "SibSp": sibsp,
            "Parch": parch,
            "FamilySize": family_size,
            "Fare": fare,
            "CabinKnown": cabin_known,
            "Embarked": embarked,
        }
    )

    mask_age = rng.random(n_rows) < 0.1
    mask_embarked = rng.random(n_rows) < 0.03
    df.loc[mask_age, "Age"] = np.nan
    df.loc[mask_embarked, "Embarked"] = np.nan
    df.loc[df.sample(frac=0.02, random_state=seed).index, "Fare"] = np.nan
    return df


def load_local_dataset(path_str: str, max_bytes: int = 50 * 1024 * 1024) -> pd.DataFrame:
    path = Path(path_str).expanduser()
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path}")
    if path.stat().st_size > max_bytes:
        raise ValueError(f"File is larger than {max_bytes / (1024 * 1024):.1f} MB; please load a smaller sample.")

    if path.suffix.lower() in {".xls", ".xlsx"}:
        return pd.read_excel(path)

    try:
        return pd.read_csv(path)
    except UnicodeDecodeError:
        return pd.read_csv(path, encoding="latin-1")
    except pd.errors.ParserError:
        return pd.read_csv(path, sep=";")


def memory_usage_mb(df: pd.DataFrame) -> float:
    return float(df.memory_usage(deep=True).sum() / (1024 ** 2))


def suggest_target_column(df: pd.DataFrame) -> Optional[str]:
    priority_names = ["target", "label", "survived", "outcome", "y"]
    lower_name_map = {col.lower(): col for col in df.columns}
    for name in priority_names:
        if name in lower_name_map:
            return lower_name_map[name]
    binary_candidates = [
        col
        for col in df.columns
        if df[col].nunique(dropna=True) <= 2 and df[col].dtype != object and not col.lower().startswith("id")
    ]
    if binary_candidates:
        return binary_candidates[0]
    if df.columns.size >= 2:
        return df.columns[-1]
    return None


def drop_redundant_columns(df: pd.DataFrame) -> Tuple[pd.DataFrame, Dict[str, List[str]]]:
    dropped: Dict[str, List[str]] = {"constant": [], "duplicates": []}
    for col in df.columns:
        if df[col].nunique(dropna=False) <= 1:
            dropped["constant"].append(col)
    df = df.drop(columns=dropped["constant"], errors="ignore")
    duplicate_mask = df.T.duplicated()
    duplicate_cols = df.columns[duplicate_mask]
    if len(duplicate_cols) > 0:
        dropped["duplicates"] = list(duplicate_cols)
        df = df.loc[:, ~duplicate_mask]
    return df, dropped


def resolve_task_type(y: pd.Series, explicit: str = "auto") -> str:
    if explicit in {"classification", "regression"}:
        return explicit
    inferred = type_of_target(y)
    if inferred in {"binary", "multiclass", "multilabel-indicator"}:
        return "classification"
    return "regression"


def describe_numeric(df: pd.DataFrame) -> pd.DataFrame:
    summary = df.describe().T
    if HAS_SCIPY:
        summary["skew"] = df.apply(lambda s: scipy_stats.skew(s.dropna()) if s.dropna().size else np.nan)
        summary["kurtosis"] = df.apply(lambda s: scipy_stats.kurtosis(s.dropna()) if s.dropna().size else np.nan)
    summary["missing"] = df.isna().sum()
    return summary


def describe_categorical(df: pd.DataFrame, top_k: int = 5) -> Dict[str, pd.Series]:
    summaries: Dict[str, pd.Series] = {}
    for col in df.columns:
        counts = df[col].astype("object").value_counts(dropna=False).head(top_k)
        summaries[col] = counts
    return summaries


def plot_missingness(df: pd.DataFrame, top_n: int = 15) -> None:
    missing = df.isna().sum().sort_values(ascending=False)
    missing = missing[missing > 0].head(top_n)
    if missing.empty:
        print("No missing values to visualise.")
        return
    plt.figure(figsize=(8, 4))
    plt.bar(missing.index, missing.values, color="#386cb0")
    plt.xticks(rotation=45, ha="right")
    plt.ylabel("Missing count")
    plt.title("Top missing columns")
    plt.show()


def plot_numeric_distributions(df: pd.DataFrame, max_features: int = 4) -> None:
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if not numeric_cols:
        print("No numeric columns available for distribution plots.")
        return
    for col in numeric_cols[:max_features]:
        col_data = df[col].dropna()
        plt.figure(figsize=(7, 4))
        plt.hist(col_data, bins=30, color="#7fc97f", alpha=0.7, density=False)
        if HAS_SCIPY and col_data.size > 1:
            kde = scipy_stats.gaussian_kde(col_data)
            xs = np.linspace(col_data.min(), col_data.max(), 200)
            plt.plot(xs, kde(xs) * (col_data.size * (xs[1] - xs[0])), color="#f0027f", linewidth=2, label="KDE")
            plt.legend()
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.ylabel("Frequency")
        plt.show()


def plot_correlation_heatmap(df: pd.DataFrame, max_features: int = 20) -> None:
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if len(numeric_cols) <= 1:
        print("Not enough numeric columns for correlation heatmap.")
        return
    variances = df[numeric_cols].var().sort_values(ascending=False)
    top_cols = variances.head(max_features).index
    corr = df[top_cols].corr()
    plt.figure(figsize=(min(0.5 * len(top_cols) + 4, 12), min(0.5 * len(top_cols) + 4, 12)))
    plt.imshow(corr, cmap="coolwarm", vmin=-1, vmax=1)
    plt.colorbar(label="Correlation")
    plt.xticks(range(len(top_cols)), top_cols, rotation=45, ha="right")
    plt.yticks(range(len(top_cols)), top_cols)
    plt.title("Correlation heatmap (top variance numerics)")
    plt.show()


def plot_target_relationships(X: pd.DataFrame, y: pd.Series, task: str, max_features: int = 3) -> None:
    numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
    if task == "classification":
        target_counts = y.value_counts()
        plt.figure(figsize=(6, 4))
        plt.bar(target_counts.index.astype(str), target_counts.values, color="#beaed4")
        plt.title("Target class distribution")
        plt.xlabel("Class")
        plt.ylabel("Count")
        plt.show()
        for col in numeric_cols[:max_features]:
            plt.figure(figsize=(7, 4))
            data = [X.loc[y == cls, col].dropna() for cls in sorted(y.unique())]
            plt.boxplot(data, labels=[str(cls) for cls in sorted(y.unique())], showmeans=True)
            plt.title(f"{col} vs target")
            plt.xlabel("Target class")
            plt.ylabel(col)
            plt.show()
    else:
        target_vals = y.values
        plt.figure(figsize=(6, 4))
        plt.hist(target_vals, bins=30, color="#fdc086", alpha=0.8)
        plt.title("Target distribution (regression)")
        plt.xlabel("Target value")
        plt.ylabel("Frequency")
        plt.show()
        if numeric_cols:
            correlations = (
                X[numeric_cols]
                .apply(lambda col: np.corrcoef(col.fillna(col.mean()), y)[0, 1])
                .abs()
                .sort_values(ascending=False)
            )
            for col in correlations.head(max_features).index:
                plt.figure(figsize=(6, 4))
                plt.scatter(X[col], y, alpha=0.6, color="#386cb0")
                plt.title(f"{col} vs target")
                plt.xlabel(col)
                plt.ylabel("Target")
                plt.show()


def compute_outlier_mask(df: pd.DataFrame, method: str = "IQR", threshold: float = 1.5) -> pd.Series:
    numeric_df = df.select_dtypes(include=[np.number])
    if numeric_df.empty:
        return pd.Series(False, index=df.index)
    if method == "IQR":
        q1 = numeric_df.quantile(0.25)
        q3 = numeric_df.quantile(0.75)
        iqr = q3 - q1
        lower = q1 - threshold * iqr
        upper = q3 + threshold * iqr
        mask = ((numeric_df < lower) | (numeric_df > upper)).any(axis=1)
        return mask
    if method == "ZScore":
        means = numeric_df.mean()
        stds = numeric_df.std(ddof=0).replace(0, np.nan)
        zscores = (numeric_df - means) / stds
        mask = zscores.abs().gt(threshold).any(axis=1)
        return mask.fillna(False)
    if method == "IsolationForest":
        from sklearn.ensemble import IsolationForest

        iso = IsolationForest(random_state=RANDOM_SEED, contamination="auto")
        preds = iso.fit_predict(numeric_df.fillna(numeric_df.median()))
        return pd.Series(preds == -1, index=df.index)
    return pd.Series(False, index=df.index)


def build_preprocessing_pipeline(
    X: pd.DataFrame,
    scaling: Optional[str] = None,
    encoding: Optional[str] = "onehot",
    impute_strategy_num: str = "median",
    impute_strategy_cat: str = "most_frequent",
) -> ColumnTransformer:
    numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = [col for col in X.columns if col not in numeric_cols]

    transformers: List[Tuple[str, Pipeline, List[str]]] = []
    if numeric_cols:
        steps: List[Tuple[str, Any]] = [("imputer", SimpleImputer(strategy=impute_strategy_num))]
        if scaling == "standard":
            steps.append(("scaler", StandardScaler()))
        elif scaling == "minmax":
            steps.append(("scaler", MinMaxScaler()))
        transformers.append(("num", Pipeline(steps=steps), numeric_cols))

    if categorical_cols and encoding:
        if encoding == "onehot":
            encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
        else:
            encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
        transformers.append(
            (
                "cat",
                Pipeline(
                    steps=[
                        ("imputer", SimpleImputer(strategy=impute_strategy_cat, fill_value="missing")),
                        ("encoder", encoder),
                    ]
                ),
                categorical_cols,
            )
        )

    if not transformers:
        raise ValueError("No transformers configured; check your feature columns.")

    return ColumnTransformer(transformers=transformers)


def format_seconds(seconds: float) -> str:
    if seconds < 60:
        return f"{seconds:.1f}s"
    minutes, secs = divmod(seconds, 60)
    if minutes < 60:
        return f"{int(minutes)}m {secs:.0f}s"
    hours, minutes = divmod(minutes, 60)
    return f"{int(hours)}h {int(minutes)}m"


@dataclass
class ModelResult:
    name: str
    pipeline: Pipeline
    train_time: float
    val_metric: Optional[float]
    test_metric: Optional[float]
    metrics: Dict[str, float]
    task_type: str
    additional: Dict[str, Any]

PIPELINE_STATE["helper_ready"] = True
print("Helper utilities loaded. STATE dictionary initialised as PIPELINE_STATE.")



## 3. Configuration <a id="configuration"></a>

**What this step does**
- Collects key knobs that control dataset handling, preprocessing, and modeling.
- Auto-populates defaults based on environment detection while allowing quick overrides.
- Displays the resolved configuration so every run is transparent and reproducible.

➡️ **Adjust values as needed, then run this cell.**


In [None]:

if not PIPELINE_STATE.get("helper_ready"):
    raise RuntimeError("Run the helper utilities cell before configuring the pipeline.")

config: Dict[str, Any] = {
    "task_type": "auto",  # 'classification' | 'regression' | 'auto'
    "target_column": None,
    "test_size": 0.2,
    "val_size": 0.1,
    "cv_folds": 0,
    "scaling": "standard",  # None | 'standard' | 'minmax'
    "outlier_method": None,  # None | 'IQR' | 'ZScore' | 'IsolationForest'
    "outlier_threshold": 1.5,
    "categorical_encoding": "onehot",  # None | 'onehot' | 'ordinal'
    "enable_pytorch_nn": HAS_TORCH,
    "enable_xgboost": HAS_XGBOOST,
    "enable_lightgbm": HAS_LIGHTGBM,
    "max_rows_preview": 5,
    "impute_numeric": "median",
    "impute_categorical": "most_frequent",
    "allow_export": False,
}

PIPELINE_STATE["config"] = config

display(pd.DataFrame(list(config.items()), columns=["Parameter", "Value"]))



## 4. Data Loading <a id="data-loading"></a>

**What this step does**
- Offers a deterministic built-in Titanic-style dataset for offline demos.
- Provides a hook to load local CSV/Excel files with safety checks.
- Summarises the loaded data: shape, dtypes, memory footprint, missing values, and quick previews.

➡️ **Choose the data source and run the cell.**


In [None]:

config = PIPELINE_STATE.get("config", {})

# --- User toggles ---
USE_BUILTIN_DATA = True  # Set to False to load a local file.
CUSTOM_FILE_PATH = ""  # Provide a path when USE_BUILTIN_DATA is False.
PREVIEW_RANDOM_SEED = RANDOM_SEED

if USE_BUILTIN_DATA:
    df_loaded = synthesize_titanic_like(seed=RANDOM_SEED)
    data_source = "Built-in Titanic-like demo"
else:
    if not CUSTOM_FILE_PATH:
        raise ValueError("Please set CUSTOM_FILE_PATH when USE_BUILTIN_DATA is False.")
    df_loaded = load_local_dataset(CUSTOM_FILE_PATH)
    data_source = f"User file: {CUSTOM_FILE_PATH}"

PIPELINE_STATE["df_raw"] = df_loaded.copy()
PIPELINE_STATE["data_source"] = data_source

print(f"Data source: {data_source}")
print(f"Shape: {df_loaded.shape[0]} rows × {df_loaded.shape[1]} columns")
print(f"Memory usage: {memory_usage_mb(df_loaded):.2f} MB")
print("Dtypes:")
display(df_loaded.dtypes.to_frame(name="dtype"))

missing_counts = df_loaded.isna().sum()
if missing_counts.any():
    display(missing_counts[missing_counts > 0].to_frame(name="missing_values"))
else:
    print("No missing values detected.")

preview_rows = config.get("max_rows_preview", 5)
print(f"Top {preview_rows} rows:")
display(df_loaded.head(preview_rows))
print(f"Bottom {preview_rows} rows:")
display(df_loaded.tail(preview_rows))
print("Random sample:")
display(df_loaded.sample(min(preview_rows, len(df_loaded)), random_state=PREVIEW_RANDOM_SEED))



## 5. Target Detection & Basic Cleaning <a id="target-detection"></a>

**What this step does**
- Suggests a target column if none is configured, based on names and cardinality.
- Removes constant and duplicate feature columns for a cleaner modeling matrix.
- Splits the dataset into features (`X`) and target (`y`), storing them in shared state.

➡️ **Run after loading data. Re-run if you adjust the configuration.**


In [None]:

df_raw = PIPELINE_STATE.get("df_raw")
if df_raw is None:
    raise RuntimeError("Load data before running the target detection step.")

config = PIPELINE_STATE["config"]
user_target = config.get("target_column")

if not user_target or user_target not in df_raw.columns:
    suggested = suggest_target_column(df_raw)
    if suggested:
        config["target_column"] = suggested
        print(f"Target column auto-detected: {suggested}")
    else:
        raise ValueError("Unable to infer target column. Set config['target_column'] and re-run.")
else:
    print(f"Using user-specified target column: {user_target}")

PIPELINE_STATE["config"] = config

target_col = config["target_column"]
X = df_raw.drop(columns=[target_col])
y = df_raw[target_col]

X_clean, dropped_columns = drop_redundant_columns(X)
if dropped_columns["constant"] or dropped_columns["duplicates"]:
    print("Dropped redundant columns:")
    for kind, cols in dropped_columns.items():
        if cols:
            print(f"  {kind}: {cols}")
else:
    print("No redundant columns removed.")

PIPELINE_STATE["X"] = X_clean
PIPELINE_STATE["y"] = y

resolved_task = resolve_task_type(y, config.get("task_type", "auto"))
PIPELINE_STATE["task_type"] = resolved_task
print(f"Resolved task type: {resolved_task}")
print(f"Feature matrix shape: {X_clean.shape}; target length: {len(y)}")



## 6. Exploratory Data Analysis (EDA) <a id="eda"></a>

**What this step does**
- Generates summary statistics for numeric and categorical columns.
- Visualises missingness, distributions, correlations, and target relationships.
- Highlights keyboard-friendly guidance so you know which cell to run next.

➡️ **Run to understand the dataset before modeling.**


In [None]:

X = PIPELINE_STATE.get("X")
y = PIPELINE_STATE.get("y")
resolved_task = PIPELINE_STATE.get("task_type")
if X is None or y is None:
    raise RuntimeError("Ensure the target detection step has been executed.")

numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [col for col in X.columns if col not in numeric_cols]

print("Numeric summary statistics:")
if numeric_cols:
    display(describe_numeric(X[numeric_cols]))
else:
    print("No numeric columns.")

print("Categorical frequency snapshots:")
if cat_cols:
    for col, summary in describe_categorical(X[cat_cols]).items():
        display(summary.to_frame(name="count"))
else:
    print("No categorical columns.")

print("Missingness overview:")
plot_missingness(pd.concat([X, y.rename("__target__")], axis=1))

print("Numeric distributions:")
plot_numeric_distributions(X)

print("Correlation heatmap:")
plot_correlation_heatmap(pd.concat([X[numeric_cols], y], axis=1))

print("Feature-target relationships:")
plot_target_relationships(X, y, resolved_task)



## 7. Outlier Detection & Removal <a id="outliers"></a>

**What this step does**
- Flags potential outliers using configurable methods (IQR, Z-Score, IsolationForest).
- Reports how many rows would be removed before applying the filter.
- Lets you opt-in to removal explicitly so you stay in control of data hygiene.

➡️ **Adjust the toggles below, then re-run if you change your mind.**


In [None]:

X = PIPELINE_STATE.get("X")
y = PIPELINE_STATE.get("y")
config = PIPELINE_STATE.get("config", {})
if X is None or y is None:
    raise RuntimeError("Load data and identify the target before running outlier detection.")

OUTLIER_METHOD = config.get("outlier_method")  # Override here if desired
OUTLIER_THRESHOLD = config.get("outlier_threshold", 1.5)
APPLY_OUTLIER_FILTER = False  # Flip to True to remove flagged rows

if OUTLIER_METHOD:
    mask = compute_outlier_mask(pd.concat([X, y], axis=1), method=OUTLIER_METHOD, threshold=OUTLIER_THRESHOLD)
    num_outliers = int(mask.sum())
    print(f"Outlier method: {OUTLIER_METHOD} (threshold={OUTLIER_THRESHOLD}). Rows flagged: {num_outliers} / {len(mask)}")
    if num_outliers > 0:
        preview = pd.concat([X.loc[mask].head(5), y.loc[mask].head(5)], axis=1)
        print("Preview of rows flagged as outliers:")
        display(preview)
    if APPLY_OUTLIER_FILTER and num_outliers > 0:
        X_filtered = X.loc[~mask].copy()
        y_filtered = y.loc[~mask].copy()
        PIPELINE_STATE["X"] = X_filtered
        PIPELINE_STATE["y"] = y_filtered
        PIPELINE_STATE.setdefault("run_notes", []).append(f"Removed {num_outliers} rows via {OUTLIER_METHOD} outlier filter.")
        print(f"Applied outlier removal. New shape: {X_filtered.shape}")
    else:
        print("Outliers not removed (APPLY_OUTLIER_FILTER=False). Re-run with True to drop them.")
else:
    print("Outlier detection skipped. Set config['outlier_method'] to enable (IQR, ZScore, IsolationForest).")



## 8. Preprocessing Pipeline <a id="preprocessing"></a>

**What this step does**
- Configures imputers, encoders, and scalers using scikit-learn pipelines.
- Shows the final `ColumnTransformer` so you know exactly how features are processed.
- Stores the preprocessing object for reuse during training and inference.

➡️ **Run once you are satisfied with outlier handling.**


In [None]:

X = PIPELINE_STATE.get("X")
y = PIPELINE_STATE.get("y")
config = PIPELINE_STATE.get("config", {})
if X is None or y is None:
    raise RuntimeError("Run previous steps to set X and y.")

preprocessor = build_preprocessing_pipeline(
    X,
    scaling=config.get("scaling"),
    encoding=config.get("categorical_encoding"),
    impute_strategy_num=config.get("impute_numeric", "median"),
    impute_strategy_cat=config.get("impute_categorical", "most_frequent"),
)
PIPELINE_STATE["preprocessor"] = preprocessor

print("Preprocessing pipeline configured:")
print(preprocessor)



## 9. Data Splitting Strategy <a id="splitting"></a>

**What this step does**
- Performs train/validation/test splits with stratification when appropriate.
- Summarises target distribution across splits (classification) or value range (regression).
- Prepares optional cross-validation folds if requested.

➡️ **Run after preprocessing is set.**


In [None]:

X = PIPELINE_STATE.get("X")
y = PIPELINE_STATE.get("y")
preprocessor = PIPELINE_STATE.get("preprocessor")
config = PIPELINE_STATE.get("config", {})
resolved_task = PIPELINE_STATE.get("task_type", "classification")
if X is None or y is None or preprocessor is None:
    raise RuntimeError("Ensure X, y, and the preprocessing pipeline are ready before splitting.")

test_size = config.get("test_size", 0.2)
val_size = config.get("val_size", 0.1)
stratify_arg = y if resolved_task == "classification" else None

X_temp, X_test, y_temp, y_test = train_test_split(
    X,
    y,
    test_size=test_size,
    random_state=RANDOM_SEED,
    stratify=stratify_arg,
)

if val_size and val_size > 0:
    adjusted_val_size = val_size / (1 - test_size)
    stratify_temp = y_temp if resolved_task == "classification" else None
    X_train, X_val, y_train, y_val = train_test_split(
        X_temp,
        y_temp,
        test_size=adjusted_val_size,
        random_state=RANDOM_SEED,
        stratify=stratify_temp,
    )
else:
    X_train, y_train = X_temp, y_temp
    X_val = pd.DataFrame()
    y_val = pd.Series(dtype=y.dtype)

PIPELINE_STATE.update(
    {
        "X_train": X_train,
        "y_train": y_train,
        "X_val": X_val,
        "y_val": y_val,
        "X_test": X_test,
        "y_test": y_test,
    }
)

print(f"Train set: {X_train.shape}")
if not X_val.empty:
    print(f"Validation set: {X_val.shape}")
else:
    print("Validation set: not created (val_size=0 or not provided).")
print(f"Test set: {X_test.shape}")

if resolved_task == "classification":
    for split_name, target_split in {"train": y_train, "test": y_test, "val": y_val}.items():
        if target_split.empty:
            continue
        counts = target_split.value_counts(normalize=True)
        display(pd.DataFrame({"proportion": counts, "count": target_split.value_counts()}).rename_axis("class"))
else:
    for split_name, target_split in {"train": y_train, "test": y_test, "val": y_val}.items():
        if target_split.empty:
            continue
        plt.figure(figsize=(6, 3))
        plt.hist(target_split, bins=20, color="#bf5b17", alpha=0.75)
        plt.title(f"Target distribution – {split_name}")
        plt.xlabel("Value")
        plt.ylabel("Frequency")
        plt.show()

if config.get("cv_folds", 0) and config["cv_folds"] > 1:
    print(f"Cross-validation enabled: {config['cv_folds']} folds.")
else:
    print("Cross-validation disabled (cv_folds <= 1).")



## 10. Model Zoo Setup <a id="model-zoo"></a>

**What this step does**
- Builds a catalogue of baseline and advanced estimators based on task type and library availability.
- Keeps model configurations modest for quick experimentation on CPU.
- Stores ready-to-train pipelines that combine preprocessing with each estimator.

➡️ **Run before the training stage.**


In [None]:

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge, LinearRegression

config = PIPELINE_STATE.get("config", {})
resolved_task = PIPELINE_STATE.get("task_type", "classification")
preprocessor = PIPELINE_STATE.get("preprocessor")
if preprocessor is None:
    raise RuntimeError("Preprocessor not configured. Run the preprocessing cell first.")

model_specs: List[Tuple[str, Any]] = []

if resolved_task == "classification":
    model_specs.extend(
        [
            ("LogisticRegression", LogisticRegression(max_iter=1000, class_weight="balanced")),
            ("RandomForestClassifier", RandomForestClassifier(n_estimators=200, random_state=RANDOM_SEED)),
        ]
    )
    if config.get("enable_xgboost") and HAS_XGBOOST:
        model_specs.append(
            (
                "XGBoostClassifier",
                OPTIONAL_MODULES["xgboost"].XGBClassifier(
                    n_estimators=200,
                    max_depth=4,
                    learning_rate=0.1,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    eval_metric="logloss",
                    random_state=RANDOM_SEED,
                ),
            )
        )
    if config.get("enable_lightgbm") and HAS_LIGHTGBM:
        model_specs.append(
            (
                "LightGBMClassifier",
                OPTIONAL_MODULES["lightgbm"].LGBMClassifier(
                    n_estimators=200,
                    learning_rate=0.1,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    random_state=RANDOM_SEED,
                ),
            )
        )
else:
    model_specs.extend(
        [
            ("LinearRegression", LinearRegression()),
            ("Ridge", Ridge(alpha=1.0, random_state=RANDOM_SEED)),
            ("RandomForestRegressor", RandomForestRegressor(n_estimators=250, random_state=RANDOM_SEED)),
        ]
    )
    if config.get("enable_xgboost") and HAS_XGBOOST:
        model_specs.append(
            (
                "XGBoostRegressor",
                OPTIONAL_MODULES["xgboost"].XGBRegressor(
                    n_estimators=300,
                    max_depth=4,
                    learning_rate=0.1,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    random_state=RANDOM_SEED,
                ),
            )
        )
    if config.get("enable_lightgbm") and HAS_LIGHTGBM:
        model_specs.append(
            (
                "LightGBMRegressor",
                OPTIONAL_MODULES["lightgbm"].LGBMRegressor(
                    n_estimators=300,
                    learning_rate=0.1,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    random_state=RANDOM_SEED,
                ),
            )
        )

models: Dict[str, Pipeline] = {}
for name, estimator in model_specs:
    models[name] = Pipeline(steps=[("preprocessor", preprocessor), ("model", estimator)])

PIPELINE_STATE["models"] = models
print("Models configured:")
for name in models:
    print(f"  - {name}")

if resolved_task == "classification" and config.get("enable_pytorch_nn") and HAS_TORCH:
    print("PyTorch detected; neural network head will be initialised during training if requested.")
elif config.get("enable_pytorch_nn"):
    print("PyTorch requested but not available; skipping neural network setup.")



## 11. Training & Live Progress <a id="training"></a>

**What this step does**
- Fits each configured model with timing information and optional cross-validation.
- Tracks validation/test metrics for leaderboard comparison.
- Provides a PyTorch MLP training loop with live loss curves when the library is available.

➡️ **Run to train all models. Adjust epochs or selection as needed.**


In [None]:

from sklearn.base import clone
from sklearn.preprocessing import LabelEncoder

models = PIPELINE_STATE.get("models", {})
config = PIPELINE_STATE.get("config", {})
resolved_task = PIPELINE_STATE.get("task_type", "classification")
X_train = PIPELINE_STATE.get("X_train")
y_train = PIPELINE_STATE.get("y_train")
X_val = PIPELINE_STATE.get("X_val")
y_val = PIPELINE_STATE.get("y_val")
X_test = PIPELINE_STATE.get("X_test")
y_test = PIPELINE_STATE.get("y_test")
preprocessor = PIPELINE_STATE.get("preprocessor")

if not models or X_train is None or y_train is None:
    raise RuntimeError("Ensure previous steps configured models and data splits.")

results: List[ModelResult] = []
leaderboard_rows: List[Dict[str, Any]] = []
cv_folds = config.get("cv_folds", 0)

PRIMARY_METRIC = "F1 (macro)" if resolved_task == "classification" else "R2"

for name, pipeline in models.items():
    start = time.time()
    pipeline.fit(X_train, y_train)
    train_time = time.time() - start

    metrics: Dict[str, float] = {"train_time_s": float(train_time)}
    val_metric = None
    test_metric = None

    if cv_folds and cv_folds > 1:
        scoring = "f1_macro" if resolved_task == "classification" else "r2"
        cv_cls = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=RANDOM_SEED) if resolved_task == "classification" else KFold(n_splits=cv_folds, shuffle=True, random_state=RANDOM_SEED)
        cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv_cls, scoring=scoring)
        metrics["cv_mean"] = float(np.mean(cv_scores))
        metrics["cv_std"] = float(np.std(cv_scores))

    if not X_val.empty and not y_val.empty:
        val_pred = pipeline.predict(X_val)
        if resolved_task == "classification":
            val_metric = f1_score(y_val, val_pred, average="macro")
            metrics["val_f1_macro"] = float(val_metric)
            metrics["val_accuracy"] = float(accuracy_score(y_val, val_pred))
        else:
            val_metric = r2_score(y_val, val_pred)
            metrics["val_r2"] = float(val_metric)
            metrics["val_rmse"] = float(mean_squared_error(y_val, val_pred, squared=False))

    test_pred = pipeline.predict(X_test)
    if resolved_task == "classification":
        test_metric = f1_score(y_test, test_pred, average="macro")
        metrics["test_f1_macro"] = float(test_metric)
        metrics["test_accuracy"] = float(accuracy_score(y_test, test_pred))
        if hasattr(pipeline.named_steps["model"], "predict_proba"):
            try:
                proba = pipeline.predict_proba(X_test)
                metrics["test_log_loss"] = float(log_loss(y_test, proba))
            except ValueError:
                pass
    else:
        test_metric = r2_score(y_test, test_pred)
        metrics["test_r2"] = float(test_metric)
        metrics["test_rmse"] = float(mean_squared_error(y_test, test_pred, squared=False))
        metrics["test_mae"] = float(mean_absolute_error(y_test, test_pred))

    results.append(
        ModelResult(
            name=name,
            pipeline=pipeline,
            train_time=train_time,
            val_metric=val_metric,
            test_metric=test_metric,
            metrics=metrics,
            task_type=resolved_task,
            additional={"predictions": test_pred},
        )
    )

    leaderboard_rows.append(
        {
            "Model": name,
            "Primary metric": float(test_metric) if test_metric is not None else (float(val_metric) if val_metric is not None else np.nan),
            "Train time": format_seconds(train_time),
        }
    )
    print(f"Trained {name} in {format_seconds(train_time)}")

# Optional PyTorch neural network
if config.get("enable_pytorch_nn") and HAS_TORCH and preprocessor is not None:
    torch = OPTIONAL_MODULES["torch"]
    nn = torch.nn
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    preprocessor_nn = clone(preprocessor)
    X_train_trans = preprocessor_nn.fit_transform(X_train)
    X_test_trans = preprocessor_nn.transform(X_test)
    if not X_val.empty:
        X_val_trans = preprocessor_nn.transform(X_val)
    else:
        X_val_trans = None

    if resolved_task == "classification":
        label_encoder = LabelEncoder()
        y_train_enc = torch.tensor(label_encoder.fit_transform(y_train), dtype=torch.long)
        y_test_enc = torch.tensor(label_encoder.transform(y_test), dtype=torch.long)
        if X_val_trans is not None:
            y_val_enc = torch.tensor(label_encoder.transform(y_val), dtype=torch.long)
        output_dim = len(label_encoder.classes_)
    else:
        label_encoder = None
        y_train_enc = torch.tensor(y_train.values, dtype=torch.float32)
        y_test_enc = torch.tensor(y_test.values, dtype=torch.float32)
        if X_val_trans is not None:
            y_val_enc = torch.tensor(y_val.values, dtype=torch.float32)
        output_dim = 1

    X_train_tensor = torch.tensor(X_train_trans, dtype=torch.float32)
    X_test_tensor = torch.tensor(X_test_trans, dtype=torch.float32)
    if X_val_trans is not None:
        X_val_tensor = torch.tensor(X_val_trans, dtype=torch.float32)

    train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_enc)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

    class SimpleTabularNN(nn.Module):
        def __init__(self, input_dim: int, output_dim: int, hidden_layers: List[int], dropout: float = 0.1):
            super().__init__()
            layers: List[nn.Module] = []
            prev_dim = input_dim
            for hidden_dim in hidden_layers:
                layers.append(nn.Linear(prev_dim, hidden_dim))
                layers.append(nn.ReLU())
                layers.append(nn.Dropout(dropout))
                prev_dim = hidden_dim
            layers.append(nn.Linear(prev_dim, output_dim))
            self.network = nn.Sequential(*layers)

        def forward(self, x: torch.Tensor) -> torch.Tensor:
            return self.network(x)

    input_dim = X_train_trans.shape[1]
    hidden_layers = [128, 64]
    model = SimpleTabularNN(input_dim, output_dim, hidden_layers, dropout=0.2).to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_fn = torch.nn.CrossEntropyLoss() if resolved_task == "classification" else torch.nn.MSELoss()

    max_epochs = 25
    patience = 5
    best_score = -np.inf
    wait = 0
    history_train_loss: List[float] = []
    history_metric: List[float] = []
    torch_start = time.time()

    for epoch in range(max_epochs):
        model.train()
        running_loss = 0.0
        for batch_x, batch_y in train_loader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)
            optimizer.zero_grad()
            outputs = model(batch_x)
            if resolved_task == "classification":
                loss = loss_fn(outputs, batch_y)
            else:
                loss = loss_fn(outputs.squeeze(), batch_y)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * batch_x.size(0)
        epoch_loss = running_loss / len(train_loader.dataset)
        history_train_loss.append(epoch_loss)

        model.eval()
        with torch.no_grad():
            if X_val_trans is not None:
                val_outputs = model(X_val_tensor.to(device))
                if resolved_task == "classification":
                    val_probs = torch.softmax(val_outputs, dim=1)
                    val_preds = val_probs.argmax(dim=1).cpu().numpy()
                    val_metric = f1_score(y_val, label_encoder.inverse_transform(val_preds), average="macro")
                else:
                    val_preds = val_outputs.squeeze().cpu().numpy()
                    val_metric = r2_score(y_val, val_preds)
            else:
                train_outputs = model(X_train_tensor.to(device))
                if resolved_task == "classification":
                    train_probs = torch.softmax(train_outputs, dim=1)
                    train_preds = train_probs.argmax(dim=1).cpu().numpy()
                    val_metric = f1_score(y_train, label_encoder.inverse_transform(train_preds), average="macro")
                else:
                    train_preds = train_outputs.squeeze().cpu().numpy()
                    val_metric = r2_score(y_train, train_preds)
        history_metric.append(val_metric)

        clear_output(wait=True)
        plt.figure(figsize=(6, 4))
        plt.plot(history_train_loss, label="Train loss", color="#1b9e77")
        plt.xlabel("Epoch")
        plt.ylabel("Loss")
        plt.title("PyTorch MLP training progress")
        ax2 = plt.gca().twinx()
        ax2.plot(history_metric, label="Validation metric", color="#d95f02")
        ax2.set_ylabel("Metric")
        plt.grid(True)
        plt.legend(loc="upper right")
        plt.show()
        print(f"Epoch {epoch + 1}/{max_epochs} | loss={epoch_loss:.4f} | metric={val_metric:.4f}")

        if val_metric > best_score:
            best_score = val_metric
            wait = 0
            best_state = {
                "model": model.state_dict(),
                "preprocessor": preprocessor_nn,
            }
        else:
            wait += 1
            if wait >= patience:
                print("Early stopping triggered.")
                break

    if 'best_state' in locals():
        model.load_state_dict(best_state["model"])
        preprocessor_nn = best_state["preprocessor"]

    torch_train_time = time.time() - torch_start

    class TorchPipeline:
        def __init__(self, preprocessor, model, task_type, label_encoder, device):
            self.preprocessor = preprocessor
            self.model = model
            self.task_type = task_type
            self.label_encoder = label_encoder
            self.device = device
            self.model.eval()

        def predict(self, X_df: pd.DataFrame) -> np.ndarray:
            transformed = self.preprocessor.transform(X_df)
            tensor = torch.tensor(transformed, dtype=torch.float32).to(self.device)
            with torch.no_grad():
                outputs = self.model(tensor)
            if self.task_type == "classification":
                preds = outputs.argmax(dim=1).cpu().numpy()
                return self.label_encoder.inverse_transform(preds)
            return outputs.squeeze().cpu().numpy()

        def predict_proba(self, X_df: pd.DataFrame) -> np.ndarray:
            if self.task_type != "classification":
                raise AttributeError("predict_proba is classification-only.")
            transformed = self.preprocessor.transform(X_df)
            tensor = torch.tensor(transformed, dtype=torch.float32).to(self.device)
            with torch.no_grad():
                outputs = self.model(tensor)
                probs = torch.softmax(outputs, dim=1).cpu().numpy()
            return probs

    torch_pipeline = TorchPipeline(preprocessor_nn, model, resolved_task, label_encoder, device)

    torch_test_pred = torch_pipeline.predict(X_test)
    torch_metrics: Dict[str, float] = {"train_time_s": float(torch_train_time)}
    if resolved_task == "classification":
        torch_test_metric = f1_score(y_test, torch_test_pred, average="macro")
        torch_metrics["test_f1_macro"] = float(torch_test_metric)
        torch_metrics["test_accuracy"] = float(accuracy_score(y_test, torch_test_pred))
    else:
        torch_test_metric = r2_score(y_test, torch_test_pred)
        torch_metrics["test_r2"] = float(torch_test_metric)
        torch_metrics["test_rmse"] = float(mean_squared_error(y_test, torch_test_pred, squared=False))
        torch_metrics["test_mae"] = float(mean_absolute_error(y_test, torch_test_pred))

    results.append(
        ModelResult(
            name="PyTorchMLP",
            pipeline=torch_pipeline,
            train_time=torch_train_time,
            val_metric=best_score if 'best_state' in locals() else None,
            test_metric=torch_test_metric,
            metrics=torch_metrics,
            task_type=resolved_task,
            additional={"history_loss": history_train_loss, "history_metric": history_metric},
        )
    )

    leaderboard_rows.append(
        {
            "Model": "PyTorchMLP",
            "Primary metric": float(torch_test_metric),
            "Train time": format_seconds(torch_train_time),
        }
    )
    print("PyTorch MLP training complete.")
elif config.get("enable_pytorch_nn") and not HAS_TORCH:
    print("PyTorch was requested but is not installed; skipping neural network training.")

PIPELINE_STATE["model_results"] = results
PIPELINE_STATE["leaderboard_rows"] = leaderboard_rows

print("
Training complete. Proceed to the Evaluation section for detailed metrics.")



## 12. Evaluation <a id="evaluation"></a>

**What this step does**
- Computes rich metrics (classification or regression) on the held-out test set.
- Visualises confusion matrices, ROC/PR curves, or regression diagnostics.
- Produces a leaderboard sorted by the primary metric for quick comparison.

➡️ **Run after training to analyse performance.**


In [None]:

results: List[ModelResult] = PIPELINE_STATE.get("model_results", [])
leaderboard_rows = PIPELINE_STATE.get("leaderboard_rows", [])
resolved_task = PIPELINE_STATE.get("task_type", "classification")
X_test = PIPELINE_STATE.get("X_test")
y_test = PIPELINE_STATE.get("y_test")

if not results:
    raise RuntimeError("No trained models found. Run the training cell first.")

leaderboard = pd.DataFrame(leaderboard_rows)
leaderboard = leaderboard.sort_values(by="Primary metric", ascending=False)
print("Model leaderboard (sorted by primary metric):")
display(leaderboard.reset_index(drop=True))

for result in results:
    pipeline = result.pipeline
    name = result.name
    print("=" * 60)
    print(f"Model: {name}")
    print(f"Train time: {format_seconds(result.train_time)}")

    y_pred = pipeline.predict(X_test)

    if resolved_task == "classification":
        print(classification_report(y_test, y_pred))
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(5, 4))
        plt.imshow(cm, cmap="Blues")
        plt.title(f"Confusion matrix – {name}")
        plt.xlabel("Predicted")
        plt.ylabel("True")
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                plt.text(j, i, cm[i, j], ha="center", va="center", color="black")
        plt.colorbar()
        plt.show()

        if hasattr(pipeline, "predict_proba"):
            try:
                proba = pipeline.predict_proba(X_test)
            except Exception:
                proba = None
        else:
            proba = None

        if proba is not None:
            classes = np.unique(y_test)
            if len(classes) == 2:
                fpr, tpr, _ = roc_curve(y_test, proba[:, 1], pos_label=classes[1])
                plt.figure(figsize=(5, 4))
                plt.plot(fpr, tpr, label=f"ROC AUC = {roc_auc_score(y_test, proba[:, 1]):.3f}")
                plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
                plt.xlabel("False Positive Rate")
                plt.ylabel("True Positive Rate")
                plt.title(f"ROC Curve – {name}")
                plt.legend()
                plt.show()

                precision, recall, _ = precision_recall_curve(y_test, proba[:, 1], pos_label=classes[1])
                plt.figure(figsize=(5, 4))
                plt.plot(recall, precision, color="#7fc97f")
                plt.xlabel("Recall")
                plt.ylabel("Precision")
                plt.title(f"Precision-Recall Curve – {name}")
                plt.show()
            else:
                try:
                    roc_auc = roc_auc_score(y_test, proba, multi_class="ovr")
                    print(f"Macro ROC AUC: {roc_auc:.3f}")
                except ValueError:
                    pass

        print("Interpretation: Higher recall/precision and balanced confusion matrices indicate robust classifiers.")
    else:
        mae = mean_absolute_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        r2 = r2_score(y_test, y_pred)
        print(f"MAE: {mae:.3f} | RMSE: {rmse:.3f} | R2: {r2:.3f}")
        residuals = y_test - y_pred
        plt.figure(figsize=(5, 4))
        plt.scatter(y_pred, residuals, alpha=0.6, color="#386cb0")
        plt.axhline(0, color="black", linestyle="--")
        plt.title(f"Residuals vs Prediction – {name}")
        plt.xlabel("Predicted")
        plt.ylabel("Residuals")
        plt.show()

        plt.figure(figsize=(5, 4))
        plt.scatter(y_test, y_pred, alpha=0.6, color="#f0027f")
        min_val = min(y_test.min(), y_pred.min())
        max_val = max(y_test.max(), y_pred.max())
        plt.plot([min_val, max_val], [min_val, max_val], linestyle="--", color="gray")
        plt.title(f"Predicted vs True – {name}")
        plt.xlabel("True")
        plt.ylabel("Predicted")
        plt.show()
        print("Interpretation: Points hugging the diagonal indicate well-calibrated regression predictions.")



## 13. Inference Demo <a id="inference"></a>

**What this step does**
- Shows how to generate predictions on the held-out test set.
- Demonstrates inline dictionary-to-DataFrame conversion for single-sample inference.
- Keeps everything in-memory for quick experimentation.

➡️ **Run to validate prediction APIs for the trained models.**


In [None]:

results: List[ModelResult] = PIPELINE_STATE.get("model_results", [])
X_test = PIPELINE_STATE.get("X_test")
y_test = PIPELINE_STATE.get("y_test")

if not results:
    raise RuntimeError("Train models before running inference.")

sample_rows = X_test.head(3)
print("Test set sample predictions:")
for result in results:
    preds = result.pipeline.predict(sample_rows)
    print(f"{result.name}: {preds}")
    if result.task_type == "classification" and hasattr(result.pipeline, "predict_proba"):
        try:
            probas = result.pipeline.predict_proba(sample_rows)
            print("  Probabilities:")
            print(probas)
        except Exception:
            pass

custom_input = {col: sample_rows.iloc[0][col] for col in sample_rows.columns}
print("
Custom single-row inference (edit `custom_input` as needed):")
custom_df = pd.DataFrame([custom_input])
for result in results:
    pred = result.pipeline.predict(custom_df)[0]
    print(f"{result.name}: {pred}")
    if result.task_type == "classification" and hasattr(result.pipeline, "predict_proba"):
        try:
            print("  Probability vector:", result.pipeline.predict_proba(custom_df)[0])
        except Exception:
            pass



## 14. Run Summary <a id="run-summary"></a>

**What this step does**
- Consolidates key run details: dataset source, rows used, preprocessing choices, and top model.
- Provides a lightweight log for reproducibility (no disk writes).
- Helps you verify the pipeline’s overall status at a glance.

➡️ **Run after evaluation to summarise the workflow.**


In [None]:

config = PIPELINE_STATE.get("config", {})
results: List[ModelResult] = PIPELINE_STATE.get("model_results", [])
leaderboard = PIPELINE_STATE.get("leaderboard_rows", [])
X = PIPELINE_STATE.get("X")
y = PIPELINE_STATE.get("y")
run_notes = PIPELINE_STATE.get("run_notes", [])

if not results:
    raise RuntimeError("Run training and evaluation before summarising.")

best_entry = max(leaderboard, key=lambda row: row.get("Primary metric", float("-inf")))
best_model_name = best_entry["Model"]
best_metric_value = best_entry.get("Primary metric")
if isinstance(best_metric_value, (float, int)):
    best_metric_display = f"{best_metric_value:.4f}"
else:
    best_metric_display = str(best_metric_value)

summary_lines = [
    f"**Data source:** {PIPELINE_STATE.get('data_source', 'Unknown')}",
    f"**Rows / columns:** {X.shape[0]} / {X.shape[1]}",
    f"**Target column:** {config.get('target_column')}",
    f"**Task type:** {PIPELINE_STATE.get('task_type')}",
    f"**Scaling:** {config.get('scaling')} | **Categorical encoding:** {config.get('categorical_encoding')}",
    f"**Outlier method:** {config.get('outlier_method') or 'None'}",
    f"**Models trained:** {', '.join([r.name for r in results])}",
    f"**Best model:** {best_model_name} (score={best_metric_display})",
]
if run_notes:
    summary_lines.append("**Notes:** " + " | ".join(run_notes))

display(Markdown("
".join(summary_lines)))



## 15. Optional Export Cells <a id="optional-export"></a>

**What this step does**
- Provides opt-in utilities for serialising the best model and exporting figures.
- Only executes when `allow_export=True` to avoid unintended file writes.
- Demonstrates how to keep artefacts in-memory or persist them locally.

➡️ **Enable exports in the config cell and run this block explicitly.**


In [None]:

config = PIPELINE_STATE.get("config", {})
allow_export = bool(config.get("allow_export"))
results: List[ModelResult] = PIPELINE_STATE.get("model_results", [])
leaderboard = PIPELINE_STATE.get("leaderboard_rows", [])

if not allow_export:
    print("Exports are disabled. Set config['allow_export']=True and re-run if needed.")
else:
    import io
    import joblib

    best_entry = max(leaderboard, key=lambda row: row.get("Primary metric", float("-inf")))
    best_model_name = best_entry["Model"]
    best_result = next(r for r in results if r.name == best_model_name)

    buffer = io.BytesIO()
    joblib.dump(best_result.pipeline, buffer)
    buffer.seek(0)
    print(f"Serialized pipeline for {best_model_name} to in-memory bytes (size={len(buffer.getvalue())/1024:.1f} KB).")

    EXPORT_PATH = Path("best_model_pipeline.joblib")
    # Uncomment the next line to persist locally (opt-in, small file).
    # EXPORT_PATH.write_bytes(buffer.getvalue())
    print("To persist on disk, uncomment the write line in this cell.")



## 16. Testing Checklist <a id="testing-checklist"></a>

- [ ] Load built-in Titanic-like sample → complete full pipeline.
- [ ] Flip task to regression by choosing a numeric target (e.g., `Fare`) and rerun.
- [ ] Simulate missing values in a copy and verify preprocessing handles them.
- [ ] Toggle outlier removal methods and confirm row count changes and plots update.
- [ ] If torch available: run 15–20 epochs, verify live curves & early stopping.
- [ ] Confirm no disk writes occur unless `ALLOW_EXPORT=True` is set and export cells are executed.
