perf(worker-ingest): scope orphan-sweep node_pipeline to phase-1 seeds#297
perf(worker-ingest): scope orphan-sweep node_pipeline to phase-1 seeds#297charlie83Gs merged 1 commit intomainfrom
Conversation
Phase 2 of orphan_fact_sweep_wf was dispatching node_pipeline_wf for up
to 500 seeds every 10 minutes via list_seeds(exclude_merged=True), even
though only the seeds touched by phase-1 entity extraction had new
facts. The merged status excluded by list_seeds is an epistemological
identity marker (loser folded into winner via alias), not a "needs
rebuild" signal — the broad sweep was conflating the two.
Mirror the decompose_sources pattern: dispatch only the seeds returned
from store_seeds_from_extracted_nodes, after resolving each through the
SeedDedupBatchOutput.merges chain (loser → winner, transitively) so the
canonical winner gets rebuilt instead of a now-merged loser.
- Capture seed_dedup_batch output (was fire-and-forget) and parse merges
- Add _resolve_through_merges helper with cycle guard
- Replace list_seeds(exclude_merged=True, limit=500) with
get_seeds_by_keys_batch(canonical_keys)
- Defensively skip status in ('garbage', 'merged') after re-fetch
Result: dispatch count drops from up to 500 to N (typically the entity
count from this orphan batch). Stale-graph re-enrichment is a separate
maintenance concern, not this cron's job.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI status noteBackend Lint failed on pre-existing F821 errors in
Both errors exist on My changed files lint clean: Unit/Integration/Frontend tests are skipped because they're gated on Backend Lint. Local worker-ingest test suite passes (32/32). Recommend merging the lint fix on main first, then re-running this PR's CI. |
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
Summary
orphan_fact_sweep_wfPhase 2 was dispatching up to 500node_pipeline_wfruns every 10 min vialist_seeds(exclude_merged=True, limit=500)— broad sweep over the whole graph, regardless of whether seeds had new facts.SeedDedupBatchOutput.merges(loser → winner, transitively) so canonical winners get rebuilt.mergedis an epistemological identity marker (loser folded into winner via alias trail) — not a "needs rebuild" signal. The previous code was conflating the two.Key changes
services/worker-ingest/src/kt_worker_ingest/workflows/orphan_fact_sweeper.pyseed_dedup_batchoutput (was fire-and-forget)_resolve_through_mergeshelper with cycle guardlist_seeds(exclude_merged=True, limit=500)forget_seeds_by_keys_batch(canonical_keys)status in ('garbage', 'merged')after re-fetchservices/worker-ingest/tests/test_orphan_fact_sweeper.pyTest plan
uv run --project services/worker-ingest pytest services/worker-ingest/tests/ -x -v(32 passed)orphan_fact_sweep_wfrun — child dispatch count should drop from ~500 to N (typically <50)🤖 Generated with Claude Code