Skip to content

Expand raft restart and contract coverage#388

Merged
kacy merged 1 commit intomainfrom
improve/raft-restart-and-contract-tests
Apr 4, 2026
Merged

Expand raft restart and contract coverage#388
kacy merged 1 commit intomainfrom
improve/raft-restart-and-contract-tests

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Apr 4, 2026

Summary

This PR expands contract and simulation coverage around two reliability-sensitive areas:

  • S3 contract behavior for deletes, bucket emptiness, prefix listing, and multipart upload validation
  • Raft snapshot/restart continuity, including persisted state-machine apply progress and restart-time snapshot boundaries

What Changed

  • Added S3 contract tests for:
    • deleting a missing object idempotently
    • rejecting bucket deletion when objects remain
    • prefix-filtered object listing
    • rejecting multipart upload IDs used with the wrong bucket or key
  • Fixed S3 delete behavior so deleting a missing object returns 204
  • Fixed nested object deletion cleanup so empty parent directories are removed and bucket deletion works after the last nested object is deleted
  • Added raft simulation coverage for:
    • leader backtracking to repair followers that missed committed history
    • stale leader divergence repair after higher-term leadership changes
    • snapshot catch-up followed by fresh replication
    • 5-node quorum behavior with lagging minority repair
    • restarted leaders reloading snapshot metadata before repairing followers
  • Persisted state-machine last_applied in the on-disk database so replay after restart continues from the correct index
  • Ensured snapshot restore persists recovered apply progress
  • Recovered raft snapshot boundaries into restart-time commit_index and last_applied
  • Added node-level restart regressions proving recovered nodes continue applying new entries and ignore stale snapshots older than the recovered boundary

Why

Two restart-path gaps showed up while expanding the tests:

  • the state machine could restart with a durable snapshot/database state but lose last_applied, which could stall or mis-sequence future apply
  • raft could restart with snapshot metadata on disk but reset commit_index to 0, which left a restarted follower willing to roll its state back via an older snapshot

The added tests reproduce those cases and the fixes make restart behavior line up with the persisted snapshot boundary.

Validation

  • YOQ_SKIP_SLOW_TESTS=1 zig build test-contract -Doptimize=ReleaseSafe -Dtest-filter='contract: s3'
  • YOQ_SKIP_SLOW_TESTS=1 zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='sim:'
  • YOQ_SKIP_SLOW_TESTS=1 zig build test -Doptimize=ReleaseSafe -Dtest-filter='restart preserves'
  • YOQ_SKIP_SLOW_TESTS=1 zig build test -Doptimize=ReleaseSafe -Dtest-filter='init reloads snapshot boundary into commit_index and last_applied'
  • YOQ_SKIP_SLOW_TESTS=1 zig build test -Doptimize=ReleaseSafe -Dtest-filter='install_snapshot restart ignores stale snapshot older than recovered boundary'
  • YOQ_SKIP_SLOW_TESTS=1 zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='restarted leader reloads snapshot metadata and catches up lagging follower'

@kacy kacy marked this pull request as ready for review April 4, 2026 17:59
@kacy kacy merged commit 3011422 into main Apr 4, 2026
6 of 7 checks passed
@kacy kacy deleted the improve/raft-restart-and-contract-tests branch April 4, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant