Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ Documentation + CI:
| M12: CLI `--strict` flag | Deferred | Per-check control is better than global flag |
| M12: CLI help text polish | Deferred | Low priority vs dataset |
| M14: Sample bundle commit | Absorbed into v4-M2 | v4 dataset IS the sample |
| M14: Notebook 1 (inspecting world) | Deferred | Do after v4 ships |
| M14: Notebook 1 (inspecting world) | **Done** | `leadforge/examples/notebooks/01_inspect_world.ipynb` |
| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
| M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
| M14: Notebook 4 (recipe customization) | Discarded | Premature |
Expand Down
273 changes: 273 additions & 0 deletions leadforge/examples/notebooks/01_inspect_world.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# Inspecting a Generated World\n\nThis notebook walks you through generating a synthetic CRM dataset with **leadforge** and exploring what's inside the output bundle.\n\n**Prerequisites:** `pip install -e \".[dev]\"` from the repo root, plus a Jupyter environment (`pip install notebook` or `pip install jupyterlab`).\n\nWe'll cover:\n1. Generating a bundle via the Python API\n2. Exploring `manifest.json` — provenance, row counts, file hashes\n3. Loading the relational tables and examining FK relationships\n4. Inspecting the task splits (train/valid/test)\n5. Reading the dataset card and feature dictionary"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Generate a bundle\n",
"\n",
"We use `Generator.from_recipe()` to create a small world (500 leads) in `student_public` mode with `intro` difficulty. The bundle is written to a temporary directory so nothing lingers after the notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import atexit\nimport shutil\nimport tempfile\nfrom pathlib import Path\n\nfrom leadforge.api import Generator\n\ntmpdir = tempfile.mkdtemp(prefix=\"leadforge_demo_\")\natexit.register(shutil.rmtree, tmpdir, True) # cleanup even on kernel restart\nbundle_path = Path(tmpdir) / \"demo_bundle\"\n\ngen = Generator.from_recipe(\n \"b2b_saas_procurement_v1\",\n seed=42,\n exposure_mode=\"student_public\",\n difficulty=\"intro\",\n)\nbundle = gen.generate(n_leads=500)\nbundle.save(str(bundle_path))\n\nprint(f\"Bundle written to: {bundle_path}\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see what files were created:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for p in sorted(bundle_path.rglob(\"*\")):\n",
" if p.is_file():\n",
" size_kb = p.stat().st_size / 1024\n",
" print(f\" {p.relative_to(bundle_path)} ({size_kb:.1f} KB)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Explore the manifest\n",
"\n",
"`manifest.json` is the bundle's provenance record. It captures the recipe, seed, package version, exposure mode, row counts, and SHA-256 hashes for every data file — everything you need to reproduce or verify the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"with open(bundle_path / \"manifest.json\") as f:\n",
" manifest = json.load(f)\n",
"\n",
"# Top-level provenance fields\n",
"for key in [\n",
" \"package_version\",\n",
" \"recipe_id\",\n",
" \"seed\",\n",
" \"exposure_mode\",\n",
" \"difficulty\",\n",
" \"generation_timestamp\",\n",
" \"bundle_schema_version\",\n",
"]:\n",
" print(f\"{key}: {manifest.get(key)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Table inventory: row counts and file hashes\nprint(\"Relational tables:\")\nfor name, info in manifest[\"tables\"].items():\n print(f\" {name:20s} {info['row_count']:>6,} rows sha256={info['sha256'][:12]}...\")\n\nprint(\"\\nTask splits:\")\nfor task_id, task_info in manifest[\"tasks\"].items():\n print(f\" {task_id}:\")\n for key in (\"train\", \"valid\", \"test\"):\n rows_key = f\"{key}_rows\"\n if rows_key in task_info:\n print(f\" {key:6s} {task_info[rows_key]:>5,} rows\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Relational tables\n",
"\n",
"The bundle contains 9 relational tables stored as Parquet files under `tables/`. These represent the full CRM world: accounts, contacts, leads, their interactions (touches, sessions, sales activities), and outcomes (opportunities, customers, subscriptions)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
Comment thread
shaypal5 marked this conversation as resolved.
"tables = {}\n",
"for parquet_file in sorted((bundle_path / \"tables\").glob(\"*.parquet\")):\n",
" name = parquet_file.stem\n",
" tables[name] = pd.read_parquet(parquet_file)\n",
"\n",
"# Summary of all tables\n",
"summary = pd.DataFrame(\n",
" [{\"table\": name, \"rows\": len(df), \"columns\": len(df.columns)} for name, df in tables.items()]\n",
")\n",
"print(summary.to_string(index=False))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sample rows from the leads table\n",
"tables[\"leads\"].head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sample rows from the touches table\n",
"tables[\"touches\"].head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### FK relationships\n",
"\n",
"The tables are linked by foreign keys (e.g., every lead references an account and a contact). Let's verify one relationship and see how the tables connect."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Every lead's account_id should exist in the accounts table\nlead_account_ids = set(tables[\"leads\"][\"account_id\"])\naccount_ids = set(tables[\"accounts\"][\"account_id\"])\norphans = lead_account_ids - account_ids\nprint(f\"FK check: {len(orphans)} orphan account_ids (expect 0)\")\n\nprint(f\"Accounts: {len(account_ids)}\")\nprint(f\"Contacts: {len(tables['contacts'])}\")\nprint(f\"Leads: {len(tables['leads'])}\")\nprint(f\"Leads per account (mean): {len(tables['leads']) / len(account_ids):.1f}\")\nprint(f\"Touches per lead (mean): {len(tables['touches']) / len(tables['leads']):.1f}\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Task splits\n",
"\n",
"The primary task (`converted_within_90_days`) is exported as train/valid/test Parquet splits under `tasks/`. Each row is a lead snapshot — a flat, ML-ready feature vector anchored at the snapshot date. No post-snapshot data leaks into these features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Read task ID from the manifest rather than hardcoding\ntask_id = next(iter(manifest[\"tasks\"]))\ntask_dir = bundle_path / \"tasks\" / task_id\n\nsplits = {}\nfor split_file in sorted(task_dir.glob(\"*.parquet\")):\n splits[split_file.stem] = pd.read_parquet(split_file)\n\nfor name, df in splits.items():\n n_pos = df[task_id].sum()\n rate = n_pos / len(df) * 100\n print(f\"{name:6s}: {len(df):>4} rows, {n_pos:>3} converted ({rate:.1f}%)\")"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Feature overview from the train split\ntrain = splits[\"train\"]\nprint(f\"Task: {task_id}\")\nprint(f\"Features: {len(train.columns)} columns\\n\")\ntrain.dtypes"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Quick summary statistics for numeric features\n",
"train.describe().T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task manifest\n",
"\n",
"`task_manifest.json` records the split ratios and label column for reproducibility."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(task_dir / \"task_manifest.json\") as f:\n",
" task_manifest = json.load(f)\n",
"\n",
"task_manifest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Dataset card and feature dictionary\n",
"\n",
"Every bundle includes a human-readable dataset card (Markdown) and a machine-readable feature dictionary (CSV) describing each column in the task table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Dataset card (first 40 lines)\n",
"card_text = (bundle_path / \"dataset_card.md\").read_text()\n",
"print(\"\\n\".join(card_text.splitlines()[:40]))\n",
"print(f\"\\n... ({len(card_text.splitlines())} lines total)\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Feature dictionary\n",
"feat_dict = pd.read_csv(bundle_path / \"feature_dictionary.csv\")\n",
"feat_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"This bundle was generated in **`student_public`** mode, which excludes the hidden causal structure behind the data. leadforge also supports a **`research_instructor`** mode that includes the full world graph, latent variable registry, and mechanism summaries — useful for teaching causal inference or evaluating model interpretability. That's a topic for a future notebook.\n",
"\n",
"For now, you have everything you need to start building models on the task splits!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Explicit cleanup (atexit also handles this if the kernel dies)\nshutil.rmtree(tmpdir, ignore_errors=True)\nprint(f\"Cleaned up {tmpdir}\")"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading