Skip to content

fix: improve failover resilience and observability#49

Merged
psteinroe merged 1 commit intomainfrom
fix/incident-failover-improvements
Apr 5, 2026
Merged

fix: improve failover resilience and observability#49
psteinroe merged 1 commit intomainfrom
fix/incident-failover-improvements

Conversation

@psteinroe
Copy link
Copy Markdown
Owner

@psteinroe psteinroe commented Apr 5, 2026

  • Graceful mid-replay error handling: Sink errors during failover replay now abort the replay cleanly instead of propagating, keeping the stream in failover for retry on the next batch
  • Skip table sync copies: Changed TableSyncCopyConfig to SkipAllTables since write_table_rows is a no-op — avoids unnecessary work during initial sync
  • Improved failure-path logging: Upgraded infowarn with structured fields (checkpoint_event_id, error) for all failover/failure paths, and added startup warning when stream begins in failover mode

Port incident fixes from getmateo fork: gracefully handle sink errors
during failover replay, skip unnecessary table sync copies, and upgrade
failure-path logging from info to warn with structured fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@psteinroe psteinroe merged commit 40a0115 into main Apr 5, 2026
6 checks passed
psteinroe added a commit that referenced this pull request Apr 5, 2026
- Reverts the error-swallowing behavior introduced in #49 for sink
failures during failover replay
- Returning `Ok(())` on sink error would let the stream complete
recovery and mark itself `Healthy`, even though not all events were
replayed — causing data loss with flaky destinations
- Propagating the error with `?` lets the destination's retry logic
handle transient failures

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant