-
Notifications
You must be signed in to change notification settings - Fork 0
docs: add Notebook 1 — Inspecting a Generated World (M14) #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+274
−1
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,273 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": "# Inspecting a Generated World\n\nThis notebook walks you through generating a synthetic CRM dataset with **leadforge** and exploring what's inside the output bundle.\n\n**Prerequisites:** `pip install -e \".[dev]\"` from the repo root, plus a Jupyter environment (`pip install notebook` or `pip install jupyterlab`).\n\nWe'll cover:\n1. Generating a bundle via the Python API\n2. Exploring `manifest.json` — provenance, row counts, file hashes\n3. Loading the relational tables and examining FK relationships\n4. Inspecting the task splits (train/valid/test)\n5. Reading the dataset card and feature dictionary" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 1. Generate a bundle\n", | ||
| "\n", | ||
| "We use `Generator.from_recipe()` to create a small world (500 leads) in `student_public` mode with `intro` difficulty. The bundle is written to a temporary directory so nothing lingers after the notebook." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "import atexit\nimport shutil\nimport tempfile\nfrom pathlib import Path\n\nfrom leadforge.api import Generator\n\ntmpdir = tempfile.mkdtemp(prefix=\"leadforge_demo_\")\natexit.register(shutil.rmtree, tmpdir, True) # cleanup even on kernel restart\nbundle_path = Path(tmpdir) / \"demo_bundle\"\n\ngen = Generator.from_recipe(\n \"b2b_saas_procurement_v1\",\n seed=42,\n exposure_mode=\"student_public\",\n difficulty=\"intro\",\n)\nbundle = gen.generate(n_leads=500)\nbundle.save(str(bundle_path))\n\nprint(f\"Bundle written to: {bundle_path}\")" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Let's see what files were created:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "for p in sorted(bundle_path.rglob(\"*\")):\n", | ||
| " if p.is_file():\n", | ||
| " size_kb = p.stat().st_size / 1024\n", | ||
| " print(f\" {p.relative_to(bundle_path)} ({size_kb:.1f} KB)\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 2. Explore the manifest\n", | ||
| "\n", | ||
| "`manifest.json` is the bundle's provenance record. It captures the recipe, seed, package version, exposure mode, row counts, and SHA-256 hashes for every data file — everything you need to reproduce or verify the dataset." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import json\n", | ||
| "\n", | ||
| "with open(bundle_path / \"manifest.json\") as f:\n", | ||
| " manifest = json.load(f)\n", | ||
| "\n", | ||
| "# Top-level provenance fields\n", | ||
| "for key in [\n", | ||
| " \"package_version\",\n", | ||
| " \"recipe_id\",\n", | ||
| " \"seed\",\n", | ||
| " \"exposure_mode\",\n", | ||
| " \"difficulty\",\n", | ||
| " \"generation_timestamp\",\n", | ||
| " \"bundle_schema_version\",\n", | ||
| "]:\n", | ||
| " print(f\"{key}: {manifest.get(key)}\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# Table inventory: row counts and file hashes\nprint(\"Relational tables:\")\nfor name, info in manifest[\"tables\"].items():\n print(f\" {name:20s} {info['row_count']:>6,} rows sha256={info['sha256'][:12]}...\")\n\nprint(\"\\nTask splits:\")\nfor task_id, task_info in manifest[\"tasks\"].items():\n print(f\" {task_id}:\")\n for key in (\"train\", \"valid\", \"test\"):\n rows_key = f\"{key}_rows\"\n if rows_key in task_info:\n print(f\" {key:6s} {task_info[rows_key]:>5,} rows\")" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 3. Relational tables\n", | ||
| "\n", | ||
| "The bundle contains 9 relational tables stored as Parquet files under `tables/`. These represent the full CRM world: accounts, contacts, leads, their interactions (touches, sessions, sales activities), and outcomes (opportunities, customers, subscriptions)." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import pandas as pd\n", | ||
| "\n", | ||
| "tables = {}\n", | ||
| "for parquet_file in sorted((bundle_path / \"tables\").glob(\"*.parquet\")):\n", | ||
| " name = parquet_file.stem\n", | ||
| " tables[name] = pd.read_parquet(parquet_file)\n", | ||
| "\n", | ||
| "# Summary of all tables\n", | ||
| "summary = pd.DataFrame(\n", | ||
| " [{\"table\": name, \"rows\": len(df), \"columns\": len(df.columns)} for name, df in tables.items()]\n", | ||
| ")\n", | ||
| "print(summary.to_string(index=False))" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Sample rows from the leads table\n", | ||
| "tables[\"leads\"].head(3)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Sample rows from the touches table\n", | ||
| "tables[\"touches\"].head(3)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### FK relationships\n", | ||
| "\n", | ||
| "The tables are linked by foreign keys (e.g., every lead references an account and a contact). Let's verify one relationship and see how the tables connect." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# Every lead's account_id should exist in the accounts table\nlead_account_ids = set(tables[\"leads\"][\"account_id\"])\naccount_ids = set(tables[\"accounts\"][\"account_id\"])\norphans = lead_account_ids - account_ids\nprint(f\"FK check: {len(orphans)} orphan account_ids (expect 0)\")\n\nprint(f\"Accounts: {len(account_ids)}\")\nprint(f\"Contacts: {len(tables['contacts'])}\")\nprint(f\"Leads: {len(tables['leads'])}\")\nprint(f\"Leads per account (mean): {len(tables['leads']) / len(account_ids):.1f}\")\nprint(f\"Touches per lead (mean): {len(tables['touches']) / len(tables['leads']):.1f}\")" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 4. Task splits\n", | ||
| "\n", | ||
| "The primary task (`converted_within_90_days`) is exported as train/valid/test Parquet splits under `tasks/`. Each row is a lead snapshot — a flat, ML-ready feature vector anchored at the snapshot date. No post-snapshot data leaks into these features." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# Read task ID from the manifest rather than hardcoding\ntask_id = next(iter(manifest[\"tasks\"]))\ntask_dir = bundle_path / \"tasks\" / task_id\n\nsplits = {}\nfor split_file in sorted(task_dir.glob(\"*.parquet\")):\n splits[split_file.stem] = pd.read_parquet(split_file)\n\nfor name, df in splits.items():\n n_pos = df[task_id].sum()\n rate = n_pos / len(df) * 100\n print(f\"{name:6s}: {len(df):>4} rows, {n_pos:>3} converted ({rate:.1f}%)\")" | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# Feature overview from the train split\ntrain = splits[\"train\"]\nprint(f\"Task: {task_id}\")\nprint(f\"Features: {len(train.columns)} columns\\n\")\ntrain.dtypes" | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Quick summary statistics for numeric features\n", | ||
| "train.describe().T" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Task manifest\n", | ||
| "\n", | ||
| "`task_manifest.json` records the split ratios and label column for reproducibility." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "with open(task_dir / \"task_manifest.json\") as f:\n", | ||
| " task_manifest = json.load(f)\n", | ||
| "\n", | ||
| "task_manifest" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 5. Dataset card and feature dictionary\n", | ||
| "\n", | ||
| "Every bundle includes a human-readable dataset card (Markdown) and a machine-readable feature dictionary (CSV) describing each column in the task table." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Dataset card (first 40 lines)\n", | ||
| "card_text = (bundle_path / \"dataset_card.md\").read_text()\n", | ||
| "print(\"\\n\".join(card_text.splitlines()[:40]))\n", | ||
| "print(f\"\\n... ({len(card_text.splitlines())} lines total)\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Feature dictionary\n", | ||
| "feat_dict = pd.read_csv(bundle_path / \"feature_dictionary.csv\")\n", | ||
| "feat_dict" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## What's next?\n", | ||
| "\n", | ||
| "This bundle was generated in **`student_public`** mode, which excludes the hidden causal structure behind the data. leadforge also supports a **`research_instructor`** mode that includes the full world graph, latent variable registry, and mechanism summaries — useful for teaching causal inference or evaluating model interpretability. That's a topic for a future notebook.\n", | ||
| "\n", | ||
| "For now, you have everything you need to start building models on the task splits!" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Cleanup" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# Explicit cleanup (atexit also handles this if the kernel dies)\nshutil.rmtree(tmpdir, ignore_errors=True)\nprint(f\"Cleaned up {tmpdir}\")" | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "name": "python", | ||
| "version": "3.11.0" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 4 | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.