Skip to content

Add transport failure simulation coverage#389

Merged
kacy merged 3 commits intomainfrom
improve/transport-failure-sims
Apr 5, 2026
Merged

Add transport failure simulation coverage#389
kacy merged 3 commits intomainfrom
improve/transport-failure-sims

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Apr 5, 2026

Summary

This PR expands single-machine raft failure simulation coverage around transport-style faults and hardens one leader-side reply handling edge case that the new tests exposed.

What Changed

  • Added raft unit coverage for:
    • stale failed append_entries_reply not regressing next_index
    • legitimate failed append_entries_reply backtracking only above the matched prefix
    • higher-term install_snapshot_reply forcing leader stepdown
  • Hardened leader replication reply handling so a delayed failed append_entries_reply cannot backtrack below the follower's known matched prefix
  • Extended the cluster simulation harness to model:
    • dropped requests
    • dropped replies
    • duplicate delivery
  • Added simulation coverage for:
    • dropped append_entries_reply recovering on heartbeat retry without duplicating log entries
    • duplicate append_entries delivery being idempotent
    • dropped install_snapshot_reply recovering on retry without regressing follower state
    • duplicate install_snapshot delivery being idempotent
    • dropped request_vote_reply messages recovering on the next election round
    • 5-node quorum commit under mixed dropped replies, duplicate delivery, and lagging follower repair
    • snapshot retry after follower restart resuming replication cleanly

Why

The earlier simulation work covered dropped delivery and restart continuity, but it still left important transport-fault classes under-tested:

  • delayed negative replies arriving after newer successful replication
  • duplicated RPC delivery
  • lost vote replies and lost snapshot replies
  • mixed-fault conditions where quorum should still commit while laggards recover later
  • snapshot retry behavior across follower restarts

These tests improve confidence in raft behavior on a single machine without needing real distributed fault injection.

Validation

All test runs were executed serially with YOQ_SKIP_SLOW_TESTS=1.

  • zig build test -Doptimize=ReleaseSafe -Dtest-filter='append_entries_reply'
  • zig build test -Doptimize=ReleaseSafe -Dtest-filter='leader steps down on higher term in install_snapshot_reply'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='dropped append_entries replies recover on heartbeat retry without duplicating log entries'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='duplicate append_entries delivery is idempotent'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='dropped install_snapshot reply recovers on retry without regressing follower state'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='duplicate install_snapshot delivery is idempotent'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='dropped request_vote replies recover on the next election round'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='5-node mixed transport faults still commit with quorum and repair laggards later'
  • zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='snapshot retry after follower restart resumes replication cleanly'

@kacy kacy marked this pull request as ready for review April 5, 2026 04:26
@kacy kacy merged commit 81bb414 into main Apr 5, 2026
6 of 7 checks passed
@kacy kacy deleted the improve/transport-failure-sims branch April 5, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant