Skip to content

feat(worker-synthesis): store synthesis errors and add regeneration#58

Closed
charlie83Gs wants to merge 1 commit intomainfrom
feat/synthesis-error-state-regenerate
Closed

feat(worker-synthesis): store synthesis errors and add regeneration#58
charlie83Gs wants to merge 1 commit intomainfrom
feat/synthesis-error-state-regenerate

Conversation

@charlie83Gs
Copy link
Copy Markdown
Contributor

Summary

  • When a synthesizer agent fails to produce text, the workflow now stores the failure as an error state (metadata["synthesis_error"]) instead of a fake fallback string. The original input is preserved in metadata["synthesis_input"] for regeneration.
  • Adds POST /syntheses/{id}/regenerate API endpoint that dispatches a new regenerate_synthesis_wf Hatchet workflow to re-run the agent on the existing node in-place.
  • For sub-syntheses: regenerating automatically triggers recombine_supersynthesis_wf on the parent super-synthesis via parent_supersynthesis_id back-references.
  • Frontend shows an error banner with "Regenerate" button on failed syntheses, a "Failed" badge in the investigations list, and polls for regeneration progress.

Changes

Backend

  • _helpers.py (new) — Extracted run_synthesis_agent(), process_and_store_synthesis(), store_synthesis_error(), and run_super_synthesis_combine() helpers to share between original and regeneration workflows
  • regenerate.py (new) — regenerate_synthesis_wf and recombine_supersynthesis_wf Hatchet workflows
  • synthesizer.py — Uses helpers, stores error state on failure instead of fallback text
  • super_synthesizer.py — Same error handling, writes parent_supersynthesis_id back to sub-syntheses
  • syntheses.py (API) — New regenerate endpoint, status/error_message fields in response schemas
  • models.py (kt-hatchet) — RegenerateSynthesisInput, RecombineSuperSynthesisInput

Frontend

  • Error banner with regenerate button in SynthesisDocument.tsx
  • "Failed" badge in investigations list
  • Regeneration progress polling in detail page

Test plan

  • Frontend: lint, type-check, and 123 tests pass
  • Backend: worker-synthesis 9 tests pass, API 92 tests pass
  • Manual: trigger a synthesis failure (tiny budget, no matching nodes), verify error state is shown, click regenerate, verify document is produced

🤖 Generated with Claude Code

When a synthesizer agent fails to produce text, instead of storing a
fake fallback string as the definition, the workflow now:
- Creates the node with no definition
- Stores error info in metadata["synthesis_error"]
- Stores the original SynthesizerInput in metadata["synthesis_input"]
  for later regeneration

Adds two new Hatchet workflows:
- regenerate_synthesis_wf: re-runs the agent on a failed synthesis node
  in-place, preserving the node ID. If the node is a sub-synthesis with
  a parent super-synthesis, automatically dispatches recombine.
- recombine_supersynthesis_wf: re-runs only the combine step on an
  existing super-synthesis using its current sub-synthesis documents.

The super-synthesizer now writes parent_supersynthesis_id back to each
sub-synthesis node's metadata, enabling the cascade.

API: adds POST /syntheses/{id}/regenerate endpoint and status/error_message
fields to response schemas.

Frontend: shows error banner with regenerate button on failed syntheses,
error badge in the investigations list, and polls for regeneration progress.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@charlie83Gs
Copy link
Copy Markdown
Contributor Author

abanding stale work based on an old version

@github-actions
Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@charlie83Gs
Copy link
Copy Markdown
Contributor Author

Closed in favor of #229 — a narrower, additive re-scoping of the same functionality against current main. Reference commit 98d7a3e preserved on feat/synthesis-error-state-regenerate.

charlie83Gs added a commit that referenced this pull request Apr 20, 2026
…d refresh, typed status, tiebreak

Addresses the second-pass review's four flagged items:

- **find_in_flight_for_graph()** (🔴): new method filtering status IN
  ('pending', 'running') — the honest variant of most_recent_for_graph
  for the Phase 7 #58 auto-dispatch "is this graph already processing
  a migration?" question. most_recent_for_graph keeps its name but
  the docstring now warns "regardless of status" and points callers
  asking about in-flight state at the new method.

- **mark_running workflow_run_id refresh** (🟡): optional
  ``workflow_run_id`` param refreshes the column when re-dispatching
  a failed hop under a new Hatchet run. Omitting it preserves the
  existing pointer (retry-without-redispatch). Audit row → live
  workflow navigation survives the re-dispatch flow.

- **ORDER BY tiebreak** (🟡): list_for_graph + most_recent_for_graph +
  find_in_flight_for_graph all add id.desc() as secondary sort.
  Batch inserts with microsecond-resolution collisions now return
  deterministic orderings.

- **MigrationRunStatus Literal type** (🟡): new type alias in
  ``kt_db.models`` narrowing status to the closed set
  ``{'pending','running','success','failed','skipped'}``. GraphMigrationRun
  column is now ``Mapped[MigrationRunStatus]`` so pyright catches
  ``"succeded"``-class typos at every write site. No DB migration
  needed — column stays VARCHAR(16); adding a new state is a
  code-only change (update Literal + worker).

Tests (7 new): find_in_flight returns None / pending / running,
skips terminal states, prefers newest when multiple; mark_running
refreshes workflow_run_id when provided, preserves when omitted.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
charlie83Gs added a commit that referenced this pull request Apr 20, 2026
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
charlie83Gs added a commit that referenced this pull request Apr 20, 2026
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
charlie83Gs added a commit that referenced this pull request Apr 20, 2026
Three bugs from PR review:

1. **mark_failed audit row never persisted on hop crash.** ``run_hop``
   flushes the failed row but doesn't commit — committing is the
   caller's job. The workflow's outer ``async with`` closed on the
   exception path before we committed, rolling back the flushed write,
   so the history API never surfaced failures from crashes. Fix: catch
   inside the async-with, commit, then re-raise into the outer handler
   that flips ``fail_migration`` on the graph row. New test
   ``test_failed_hop_persists_failed_audit_row`` reads the audit row
   from a fresh session to prove durability.

2. **Target-ahead-of-plugin silently stamped wrong version.** If the
   dispatcher asked for v3 but the plugin topped out at v2, the workflow
   trimmed the plan to v1→v2 yet still stamped
   ``graph_type_version=3`` at commit — sync worker would read v3 on
   v2 data. Fix: abort up-front when ``target_version >
   plugin.current_version``, leaving the graph untouched. Updated
   test ``test_target_ahead_of_plugin_aborts_before_any_hop`` asserts
   the abort path end-to-end (no hop invoked, no version bump, no
   read_only flip).

3. **Misleading "per-hop refresh" comment.** Removed. There was no
   ``ctx.refresh_timeout`` call to justify the comment. Per-hop timeout
   refresh is a future enhancement, can be wired via a callback if/when
   we see real long-running hops.

Reviewer's minor items left for follow-ups:
- Advisory-lock gap between begin/commit/fail is noted for #58/#59
  integration (``find_in_flight_for_graph`` mitigates at dispatch time).
- ``repr(exc)`` sanitization on the SSE stream — fine as-is for the
  internal operator view; expose-sanitized version if/when we surface
  this on a user-facing stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant