Skip to content

perf(worker-ingest): scope orphan-sweep node_pipeline to phase-1 seeds#297

Merged
charlie83Gs merged 1 commit intomainfrom
worktree-fix-orphan-sweep-targeted
Apr 26, 2026
Merged

perf(worker-ingest): scope orphan-sweep node_pipeline to phase-1 seeds#297
charlie83Gs merged 1 commit intomainfrom
worktree-fix-orphan-sweep-targeted

Conversation

@charlie83Gs
Copy link
Copy Markdown
Contributor

Summary

  • orphan_fact_sweep_wf Phase 2 was dispatching up to 500 node_pipeline_wf runs every 10 min via list_seeds(exclude_merged=True, limit=500) — broad sweep over the whole graph, regardless of whether seeds had new facts.
  • Replaced with targeted dispatch on the seeds actually touched by Phase 1 entity extraction, resolved through SeedDedupBatchOutput.merges (loser → winner, transitively) so canonical winners get rebuilt.
  • merged is an epistemological identity marker (loser folded into winner via alias trail) — not a "needs rebuild" signal. The previous code was conflating the two.

Key changes

  • services/worker-ingest/src/kt_worker_ingest/workflows/orphan_fact_sweeper.py
    • Capture seed_dedup_batch output (was fire-and-forget)
    • Add _resolve_through_merges helper with cycle guard
    • Swap list_seeds(exclude_merged=True, limit=500) for get_seeds_by_keys_batch(canonical_keys)
    • Defensively skip status in ('garbage', 'merged') after re-fetch
  • services/worker-ingest/tests/test_orphan_fact_sweeper.py
    • 5 new tests covering merge-chain resolution (no-op, single hop, transitive, cycle, phase-1 collapse)

Test plan

  • uv run --project services/worker-ingest pytest services/worker-ingest/tests/ -x -v (32 passed)
  • Watch Hatchet UI for next orphan_fact_sweep_wf run — child dispatch count should drop from ~500 to N (typically <50)
  • Confirm orphan-derived facts still produce graph nodes (smoke test against default graph)

🤖 Generated with Claude Code

Phase 2 of orphan_fact_sweep_wf was dispatching node_pipeline_wf for up
to 500 seeds every 10 minutes via list_seeds(exclude_merged=True), even
though only the seeds touched by phase-1 entity extraction had new
facts. The merged status excluded by list_seeds is an epistemological
identity marker (loser folded into winner via alias), not a "needs
rebuild" signal — the broad sweep was conflating the two.

Mirror the decompose_sources pattern: dispatch only the seeds returned
from store_seeds_from_extracted_nodes, after resolving each through the
SeedDedupBatchOutput.merges chain (loser → winner, transitively) so the
canonical winner gets rebuilt instead of a now-merged loser.

- Capture seed_dedup_batch output (was fire-and-forget) and parse merges
- Add _resolve_through_merges helper with cycle guard
- Replace list_seeds(exclude_merged=True, limit=500) with
  get_seeds_by_keys_batch(canonical_keys)
- Defensively skip status in ('garbage', 'merged') after re-fetch

Result: dispatch count drops from up to 500 to N (typically the entity
count from this orphan batch). Stale-graph re-enrichment is a separate
maintenance concern, not this cron's job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@charlie83Gs
Copy link
Copy Markdown
Contributor Author

CI status note

Backend Lint failed on pre-existing F821 errors in worker-synthesis files I did not touch:

  • services/worker-synthesis/src/kt_worker_synthesis/workflows/super_synthesizer.py:399
  • services/worker-synthesis/src/kt_worker_synthesis/workflows/synthesizer.py:215

Both errors exist on main HEAD (8b453875) — verified the same agent.get_model_id() reference is present unchanged. Same lint failure has been hitting recent main releases (runs 24913588083, 24914692793, 24915126401 all failed).

My changed files lint clean:

$ uv run ruff check services/worker-ingest/src/kt_worker_ingest/workflows/orphan_fact_sweeper.py services/worker-ingest/tests/test_orphan_fact_sweeper.py
All checks passed!

Unit/Integration/Frontend tests are skipped because they're gated on Backend Lint. Local worker-ingest test suite passes (32/32).

Recommend merging the lint fix on main first, then re-running this PR's CI.

@charlie83Gs charlie83Gs merged commit 9f07cd5 into main Apr 26, 2026
7 of 8 checks passed
@charlie83Gs charlie83Gs deleted the worktree-fix-orphan-sweep-targeted branch April 26, 2026 00:20
@github-actions
Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant