Context
PR #40 threaded primary_task and label_window_days through naming and metadata (directory names, manifest keys, validation paths). However, label_window_days currently has no effect on the actual generated data:
- The simulation engine runs for
config.horizon_days (default 90) regardless of label_window_days.
- The
LeadRow.converted_within_90_days field name is hardcoded — it doesn't change when the window changes.
- The snapshot builder uses
horizon_days for the observation window, not label_window_days.
- The conversion label is derived from events within the full simulation horizon, not a configurable window.
What needs to change
For label_window_days to truly work, the conversion label should be derived from events within [0, label_window_days] rather than [0, horizon_days]. This likely requires:
simulation/engine.py — either the engine or a post-processing step needs to compute the label based on label_window_days, not horizon_days. One approach: the simulation still runs for horizon_days (to generate realistic event histories), but the label is set based on whether conversion happened within label_window_days.
schema/entities.py — consider whether LeadRow.converted_within_90_days should be renamed to a generic converted or parameterized. This is a large schema change with wide blast radius.
render/snapshots.py — the snapshot builder may need to use label_window_days for label derivation rather than horizon_days.
Design considerations
- The simulation should probably still run for
horizon_days to produce rich event data. The label window is a separate concept from the simulation horizon.
LeadRow field renaming is a large refactor touching entities, engine, snapshots, features, pipeline rename maps, and all tests. A backward-compatible approach (e.g., keeping converted_within_90_days as the internal field name but documenting it as "conversion label") may be pragmatic.
- This interacts with the
snapshot_day parameter already used by the v4/v5/v6 build pipelines.
References
Context
PR #40 threaded
primary_taskandlabel_window_daysthrough naming and metadata (directory names, manifest keys, validation paths). However,label_window_dayscurrently has no effect on the actual generated data:config.horizon_days(default 90) regardless oflabel_window_days.LeadRow.converted_within_90_daysfield name is hardcoded — it doesn't change when the window changes.horizon_daysfor the observation window, notlabel_window_days.What needs to change
For
label_window_daysto truly work, the conversion label should be derived from events within[0, label_window_days]rather than[0, horizon_days]. This likely requires:simulation/engine.py— either the engine or a post-processing step needs to compute the label based onlabel_window_days, nothorizon_days. One approach: the simulation still runs forhorizon_days(to generate realistic event histories), but the label is set based on whether conversion happened withinlabel_window_days.schema/entities.py— consider whetherLeadRow.converted_within_90_daysshould be renamed to a genericconvertedor parameterized. This is a large schema change with wide blast radius.render/snapshots.py— the snapshot builder may need to uselabel_window_daysfor label derivation rather thanhorizon_days.Design considerations
horizon_daysto produce rich event data. The label window is a separate concept from the simulation horizon.LeadRowfield renaming is a large refactor touching entities, engine, snapshots, features, pipeline rename maps, and all tests. A backward-compatible approach (e.g., keepingconverted_within_90_daysas the internal field name but documenting it as "conversion label") may be pragmatic.snapshot_dayparameter already used by the v4/v5/v6 build pipelines.References
GenerationConfig)