Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .agent-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,12 @@ Documentation + CI:
- [x] `leadforge/render/snapshots.py` — added comment documenting the feature-label temporal mismatch when `label_window_days < horizon_days` (features aggregate over full horizon; label uses shorter window)
- [x] 7 new tests: default-90 unchanged, shorter window fewer conversions, 1-day window zero conversions, late conversions excluded (with day-offset verification), conversion_timestamp still set outside window, event counts unchanged by window, bundle round-trip integration; total 788 passing

### M15: README + CHANGELOG polish (PR #46)

- [x] `README.md` — removed all "coming in vX.Y.Z" placeholders; added working CLI quickstart, Python API example, exposure modes table, difficulty profiles table, output bundle layout, key design principles
- [x] `CHANGELOG.md` — created; covers v0.1.0 through post-v0.5.0 improvements; user-facing descriptions grouped by version
- [x] `.agent-plan.md` — updated deferred items table

### Fix: direct conversion bypass for pre-SQL leads (PR #45, closes #44)

- [x] `leadforge/simulation/engine.py` — added `_DIRECT_CONVERSION_STAGES` and `_DIRECT_CONVERSION_DISCOUNT` (0.01) constants; pre-SQL leads (`mql`, `sal`) now have a small daily probability of converting directly, bypassing the full funnel
Expand Down Expand Up @@ -207,7 +213,7 @@ Documentation + CI:
| M14: Notebook 2 (lead scoring baseline) | Deferred | v4 validation script covers this |
| M14: Notebook 3 (public vs instructor) | Discarded | No current audience |
| M14: Notebook 4 (recipe customization) | Discarded | Premature |
| M15: Docs polish + v1.0 RC | Deferred | Do after v4 ships |
| M15: Docs polish + v1.0 RC | **In progress** | README + CHANGELOG done; architecture diagram and notebooks remain |

### From post-v1 list

Expand Down
54 changes: 54 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Changelog

All notable changes to leadforge are documented here.
Format inspired by [Keep a Changelog](https://keepachangelog.com/).

---

## Unreleased

- **Direct conversion bypass** (PR #45): pre-SQL leads can now convert via a rare direct path, fixing the deterministic `is_sql → converts` invariant.
- **Configurable label window** (PR #43): `label_window_days` controls conversion label derivation in the simulation.
- **Generalized task support** (PR #40, #42): `primary_task` threaded through bundle, validation, and pipelines; dataset card prose adapts to non-conversion tasks.
- **Pipeline extraction** (PR #29, #34): build pipeline functions extracted into `leadforge.pipelines` with proper RNG conventions.
- **Latent-aware touch intensity** (PR #31): `LatentDecayIntensity` mechanism creates causal link between latent traits and touch patterns.
- **Canonical validation module** (PR #26): reusable lead scoring validation with sklearn pipeline.
- **v4–v6 dataset pipelines**: progressive dataset versions with leakage traps, student/instructor splits, value-aware scoring, and GBM improvement validation.

---

## Milestone 0.5.0 — Validation Harness & CLI Complete (2026-04-29)

- Full validation harness: determinism checks, exposure monotonicity, realism bounds, difficulty validation, cross-seed drift detection.
- `leadforge validate` command with artifact checks, FK integrity, leakage detection, and task split validation.
- Parquet metadata used for row counts (no full table reads during validation).

## Milestone 0.4.0 — Simulation Engine & End-to-End Generation (2026-04-28)

- 90-day daily-step simulation engine with churn, stage advancement, conversion hazards, and touch emission.
- Population generation: accounts (3 latent traits), contacts (4 traits), leads (1 trait) with motif-family biases.
- Full render pipeline: 9-table relational output, leakage-free lead snapshots, deterministic train/valid/test splits.
- Exposure filtering: `student_public` and `research_instructor` modes with truth redaction.
- CLI commands: `generate`, `inspect`, `validate`, `list-recipes` — all fully wired.
- Bundle manifest with provenance, row counts, and SHA-256 file hashes.

## Milestone 0.3.0 — World Structure & Mechanisms (2026-04-25)

- Hidden world graph (DAG) with 5 motif families: fit-dominant, intent-dominant, sales-execution-sensitive, demo/trial-mediated, buying-committee-friction.
- Stochastic graph rewiring: optional-node dropping, edge-weight jitter, latent-confounder injection.
- Mechanism layer: latent scores, conversion hazards, stage transitions, Poisson intensities, categorical influences, noisy proxies.
- Motif-aware mechanism assignment policies.

## Milestone 0.2.0 — Config, Recipes & Narrative (2026-04-20)

- Typed `GenerationConfig`, `Recipe`, `WorldSpec` models with full precedence resolution (kwargs > override > recipe > defaults).
- Seeded RNG with SHA-256-derived named substreams for reproducibility.
- Narrative layer: company, product, market, GTM motion, personas, funnel stages — loaded from recipe YAML.
- Schema layer: 9 entity dataclasses, 10 FK constraints, 29 snapshot features, feature dictionary writer.
- Dataset card renderer from narrative + world spec.

## Milestone 0.1.0 — Project Foundation (2026-04-18)

- Package skeleton, CLI entry point (`leadforge list-recipes`), CI pipeline.
- Recipe registry with `b2b_saas_procurement_v1` recipe.
- GitHub Actions: lint, typecheck, test matrix (Python 3.11 + 3.12).
107 changes: 92 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,20 @@

**Opinionated framework for generating synthetic CRM and GTM datasets from simulated commercial worlds.**

`leadforge` generates narrative-grounded synthetic revenue datasets starting with lead scoring, designed to support teaching, portfolio projects, and research. Rather than sampling rows from a distribution, it simulates a commercial worlda specific company, selling a specific product, to a specific kind of buyer and renders realistic CRM-style outputs from that world.
`leadforge` generates narrative-grounded synthetic revenue datasets starting with lead scoringdesigned for teaching, portfolio projects, and research. Rather than sampling rows from a distribution, it simulates a commercial world: a specific company, selling a specific product, to a specific kind of buyer, and renders realistic CRM-style outputs from that world.

---

## Installation

Requires **Python 3.11+**.

```bash
pip install leadforge
pip install git+https://github.com/leadforge-dev/leadforge.git
```

> PyPI package coming with the v1.0 release.

For development:

```bash
Expand All @@ -25,27 +29,29 @@ pre-commit install

## Quickstart

### CLI

```bash
# List available recipes
leadforge list-recipes

# Coming in v0.2.0: generate a dataset bundle
# leadforge generate \
# --recipe b2b_saas_procurement_v1 \
# --seed 42 \
# --mode student_public \
# --difficulty intermediate \
# --n-leads 5000 \
# --out ./out/demo_bundle
# Generate a dataset bundle
leadforge generate \
--recipe b2b_saas_procurement_v1 \
--seed 42 \
--mode student_public \
--difficulty intermediate \
--n-leads 5000 \
--out ./out/demo_bundle

# Coming in v0.4.0: inspect a generated bundle
# leadforge inspect ./out/demo_bundle
# Inspect bundle metadata
leadforge inspect ./out/demo_bundle

# Coming in v0.5.0: validate a generated bundle
# leadforge validate ./out/demo_bundle
# Validate bundle integrity
leadforge validate ./out/demo_bundle
```

**Python API** (coming in v0.2.0):
### Python API

```python
from leadforge.api import Generator
Expand All @@ -61,11 +67,82 @@ bundle.save("./out/demo_bundle")

---

## Exposure Modes

Control what truth is visible in the output bundle:

| Mode | Purpose | Includes |
|------|---------|----------|
| `student_public` | Teaching / portfolio use | Tables, features, task splits, dataset card |
| `research_instructor` | Full truth for instructors / researchers | All of the above + hidden graph, world spec, latent registry, mechanism summary |

Set via `--mode` on the CLI or `exposure_mode=` in the Python API.

---

## Difficulty Profiles

Each recipe ships with difficulty profiles that control signal-to-noise ratio:

| Profile | Description |
|---------|-------------|
| `intro` | Strong signal, low noise — good for first-time learners |
| `intermediate` | Moderate signal, realistic noise |
| `advanced` | Weak signal, high noise — challenges experienced practitioners |

Set via `--difficulty` on the CLI or `difficulty=` in `generate()`.

---

## Output Bundle

```
bundle_root/
manifest.json # provenance, row counts, file hashes
dataset_card.md # human-readable dataset documentation
feature_dictionary.csv # feature names, types, descriptions
tables/ # 9 relational Parquet tables
tasks/
converted_within_90_days/
train.parquet
valid.parquet
test.parquet
task_manifest.json
metadata/ # (research_instructor only) hidden graph, world spec, latents
```

---

## Key Design Principles

- **Deterministic**: same (recipe, seed, version) → identical output.
- **Relational-first**: 9 normalized tables; flat ML exports are derived.
- **No external APIs**: core generation never requires network access.
- **Simulation-driven labels**: `converted_within_90_days` emerges from simulated events, not sampled directly.
- **Leakage-safe**: no feature uses events after the snapshot anchor.

---

## Documentation

- [Design document](docs/leadforge_design_doc.md)
- [Architecture spec](docs/leadforge_architecture_spec.md)
- [Implementation plan](docs/leadforge_implementation_plan.md)
- [v4 dataset design](docs/v4/design.md)
- [Changelog](CHANGELOG.md)

---

## Development

```bash
pip install -e ".[dev]"
pytest # run all tests (~800)
ruff check . # lint
ruff format . # format
mypy leadforge/ # type check
pre-commit run --all-files # full pre-commit suite
```

---

Expand Down
Loading