Expand raft restart and contract coverage by kacy · Pull Request #388 · kacy/yoq

kacy · 2026-04-04T17:47:49Z

Summary

This PR expands contract and simulation coverage around two reliability-sensitive areas:

S3 contract behavior for deletes, bucket emptiness, prefix listing, and multipart upload validation
Raft snapshot/restart continuity, including persisted state-machine apply progress and restart-time snapshot boundaries

Added S3 contract tests for:
- deleting a missing object idempotently
- rejecting bucket deletion when objects remain
- prefix-filtered object listing
- rejecting multipart upload IDs used with the wrong bucket or key
Fixed S3 delete behavior so deleting a missing object returns 204
Fixed nested object deletion cleanup so empty parent directories are removed and bucket deletion works after the last nested object is deleted
Added raft simulation coverage for:
- leader backtracking to repair followers that missed committed history
- stale leader divergence repair after higher-term leadership changes
- snapshot catch-up followed by fresh replication
- 5-node quorum behavior with lagging minority repair
- restarted leaders reloading snapshot metadata before repairing followers
Persisted state-machine last_applied in the on-disk database so replay after restart continues from the correct index
Ensured snapshot restore persists recovered apply progress
Recovered raft snapshot boundaries into restart-time commit_index and last_applied
Added node-level restart regressions proving recovered nodes continue applying new entries and ignore stale snapshots older than the recovered boundary

Two restart-path gaps showed up while expanding the tests:

the state machine could restart with a durable snapshot/database state but lose last_applied, which could stall or mis-sequence future apply
raft could restart with snapshot metadata on disk but reset commit_index to 0, which left a restarted follower willing to roll its state back via an older snapshot

The added tests reproduce those cases and the fixes make restart behavior line up with the persisted snapshot boundary.

YOQ_SKIP_SLOW_TESTS=1 zig build test-contract -Doptimize=ReleaseSafe -Dtest-filter='contract: s3'
YOQ_SKIP_SLOW_TESTS=1 zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='sim:'
YOQ_SKIP_SLOW_TESTS=1 zig build test -Doptimize=ReleaseSafe -Dtest-filter='restart preserves'
YOQ_SKIP_SLOW_TESTS=1 zig build test -Doptimize=ReleaseSafe -Dtest-filter='init reloads snapshot boundary into commit_index and last_applied'
YOQ_SKIP_SLOW_TESTS=1 zig build test -Doptimize=ReleaseSafe -Dtest-filter='install_snapshot restart ignores stale snapshot older than recovered boundary'
YOQ_SKIP_SLOW_TESTS=1 zig build test-sim -Doptimize=ReleaseSafe -Dtest-filter='restarted leader reloads snapshot metadata and catches up lagging follower'

Expand raft restart and contract coverage

8f67602

kacy marked this pull request as ready for review April 4, 2026 17:59

kacy merged commit 3011422 into main Apr 4, 2026
6 of 7 checks passed

kacy deleted the improve/raft-restart-and-contract-tests branch April 4, 2026 17:59