leadforge-dev · shaypal5 · May 3, 2026 · May 2, 2026 · May 2, 2026 · May 3, 2026
diff --git a/.agent-plan.md b/.agent-plan.md
@@ -217,7 +217,7 @@ Documentation + CI:
 | M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
 | M12: CLI help text polish | Deferred | Low priority vs dataset |
 | M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample |
-| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
+| M14: Notebook 1 (inspecting world) | **Done** | `leadforge/examples/notebooks/01_inspect_world.ipynb` |
 | M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
 | M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
 | M14: Notebook 4 (recipe customization) | Discarded | Premature |

diff --git a/leadforge/examples/notebooks/01_inspect_world.ipynb b/leadforge/examples/notebooks/01_inspect_world.ipynb
@@ -0,0 +1,273 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# Inspecting a Generated World\n\nThis notebook walks you through generating a synthetic CRM dataset with **leadforge** and exploring what's inside the output bundle.\n\n**Prerequisites:** `pip install -e \".[dev]\"` from the repo root, plus a Jupyter environment (`pip install notebook` or `pip install jupyterlab`).\n\nWe'll cover:\n1. Generating a bundle via the Python API\n2. Exploring `manifest.json` — provenance, row counts, file hashes\n3. Loading the relational tables and examining FK relationships\n4. Inspecting the task splits (train/valid/test)\n5. Reading the dataset card and feature dictionary"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Generate a bundle\n",
+    "\n",
+    "We use `Generator.from_recipe()` to create a small world (500 leads) in `student_public` mode with `intro` difficulty. The bundle is written to a temporary directory so nothing lingers after the notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import atexit\nimport shutil\nimport tempfile\nfrom pathlib import Path\n\nfrom leadforge.api import Generator\n\ntmpdir = tempfile.mkdtemp(prefix=\"leadforge_demo_\")\natexit.register(shutil.rmtree, tmpdir, True)  # cleanup even on kernel restart\nbundle_path = Path(tmpdir) / \"demo_bundle\"\n\ngen = Generator.from_recipe(\n    \"b2b_saas_procurement_v1\",\n    seed=42,\n    exposure_mode=\"student_public\",\n    difficulty=\"intro\",\n)\nbundle = gen.generate(n_leads=500)\nbundle.save(str(bundle_path))\n\nprint(f\"Bundle written to: {bundle_path}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's see what files were created:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for p in sorted(bundle_path.rglob(\"*\")):\n",
+    "    if p.is_file():\n",
+    "        size_kb = p.stat().st_size / 1024\n",
+    "        print(f\"  {p.relative_to(bundle_path)}  ({size_kb:.1f} KB)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Explore the manifest\n",
+    "\n",
+    "`manifest.json` is the bundle's provenance record. It captures the recipe, seed, package version, exposure mode, row counts, and SHA-256 hashes for every data file — everything you need to reproduce or verify the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "with open(bundle_path / \"manifest.json\") as f:\n",
+    "    manifest = json.load(f)\n",
+    "\n",
+    "# Top-level provenance fields\n",
+    "for key in [\n",
+    "    \"package_version\",\n",
+    "    \"recipe_id\",\n",
+    "    \"seed\",\n",
+    "    \"exposure_mode\",\n",
+    "    \"difficulty\",\n",
+    "    \"generation_timestamp\",\n",
+    "    \"bundle_schema_version\",\n",
+    "]:\n",
+    "    print(f\"{key}: {manifest.get(key)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Table inventory: row counts and file hashes\nprint(\"Relational tables:\")\nfor name, info in manifest[\"tables\"].items():\n    print(f\"  {name:20s}  {info['row_count']:>6,} rows  sha256={info['sha256'][:12]}...\")\n\nprint(\"\\nTask splits:\")\nfor task_id, task_info in manifest[\"tasks\"].items():\n    print(f\"  {task_id}:\")\n    for key in (\"train\", \"valid\", \"test\"):\n        rows_key = f\"{key}_rows\"\n        if rows_key in task_info:\n            print(f\"    {key:6s}  {task_info[rows_key]:>5,} rows\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Relational tables\n",
+    "\n",
+    "The bundle contains 9 relational tables stored as Parquet files under `tables/`. These represent the full CRM world: accounts, contacts, leads, their interactions (touches, sessions, sales activities), and outcomes (opportunities, customers, subscriptions)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "tables = {}\n",
+    "for parquet_file in sorted((bundle_path / \"tables\").glob(\"*.parquet\")):\n",
+    "    name = parquet_file.stem\n",
+    "    tables[name] = pd.read_parquet(parquet_file)\n",
+    "\n",
+    "# Summary of all tables\n",
+    "summary = pd.DataFrame(\n",
+    "    [{\"table\": name, \"rows\": len(df), \"columns\": len(df.columns)} for name, df in tables.items()]\n",
+    ")\n",
+    "print(summary.to_string(index=False))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sample rows from the leads table\n",
+    "tables[\"leads\"].head(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sample rows from the touches table\n",
+    "tables[\"touches\"].head(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### FK relationships\n",
+    "\n",
+    "The tables are linked by foreign keys (e.g., every lead references an account and a contact). Let's verify one relationship and see how the tables connect."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Every lead's account_id should exist in the accounts table\nlead_account_ids = set(tables[\"leads\"][\"account_id\"])\naccount_ids = set(tables[\"accounts\"][\"account_id\"])\norphans = lead_account_ids - account_ids\nprint(f\"FK check: {len(orphans)} orphan account_ids (expect 0)\")\n\nprint(f\"Accounts: {len(account_ids)}\")\nprint(f\"Contacts: {len(tables['contacts'])}\")\nprint(f\"Leads: {len(tables['leads'])}\")\nprint(f\"Leads per account (mean): {len(tables['leads']) / len(account_ids):.1f}\")\nprint(f\"Touches per lead (mean): {len(tables['touches']) / len(tables['leads']):.1f}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Task splits\n",
+    "\n",
+    "The primary task (`converted_within_90_days`) is exported as train/valid/test Parquet splits under `tasks/`. Each row is a lead snapshot — a flat, ML-ready feature vector anchored at the snapshot date. No post-snapshot data leaks into these features."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Read task ID from the manifest rather than hardcoding\ntask_id = next(iter(manifest[\"tasks\"]))\ntask_dir = bundle_path / \"tasks\" / task_id\n\nsplits = {}\nfor split_file in sorted(task_dir.glob(\"*.parquet\")):\n    splits[split_file.stem] = pd.read_parquet(split_file)\n\nfor name, df in splits.items():\n    n_pos = df[task_id].sum()\n    rate = n_pos / len(df) * 100\n    print(f\"{name:6s}: {len(df):>4} rows, {n_pos:>3} converted ({rate:.1f}%)\")"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Feature overview from the train split\ntrain = splits[\"train\"]\nprint(f\"Task: {task_id}\")\nprint(f\"Features: {len(train.columns)} columns\\n\")\ntrain.dtypes"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Quick summary statistics for numeric features\n",
+    "train.describe().T"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task manifest\n",
+    "\n",
+    "`task_manifest.json` records the split ratios and label column for reproducibility."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(task_dir / \"task_manifest.json\") as f:\n",
+    "    task_manifest = json.load(f)\n",
+    "\n",
+    "task_manifest"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Dataset card and feature dictionary\n",
+    "\n",
+    "Every bundle includes a human-readable dataset card (Markdown) and a machine-readable feature dictionary (CSV) describing each column in the task table."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dataset card (first 40 lines)\n",
+    "card_text = (bundle_path / \"dataset_card.md\").read_text()\n",
+    "print(\"\\n\".join(card_text.splitlines()[:40]))\n",
+    "print(f\"\\n... ({len(card_text.splitlines())} lines total)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Feature dictionary\n",
+    "feat_dict = pd.read_csv(bundle_path / \"feature_dictionary.csv\")\n",
+    "feat_dict"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What's next?\n",
+    "\n",
+    "This bundle was generated in **`student_public`** mode, which excludes the hidden causal structure behind the data. leadforge also supports a **`research_instructor`** mode that includes the full world graph, latent variable registry, and mechanism summaries — useful for teaching causal inference or evaluating model interpretability. That's a topic for a future notebook.\n",
+    "\n",
+    "For now, you have everything you need to start building models on the task splits!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cleanup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Explicit cleanup (atexit also handles this if the kernel dies)\nshutil.rmtree(tmpdir, ignore_errors=True)\nprint(f\"Cleaned up {tmpdir}\")"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}