Skip to content

test(e2e): wire mixed-version-soak Job + add tsoracle admin activate-format#503

Merged
SebastianThiebaud merged 18 commits into
mainfrom
test/kube-e2e-mixed-version-soak
May 26, 2026
Merged

test(e2e): wire mixed-version-soak Job + add tsoracle admin activate-format#503
SebastianThiebaud merged 18 commits into
mainfrom
test/kube-e2e-mixed-version-soak

Conversation

@SebastianThiebaud
Copy link
Copy Markdown
Contributor

Closes #484.

Summary

Wires e2e/kube/driver/job-mixed-version-soak.yaml (added in #470) into a CI lane that proves the openraft format-activation control plane works under sustained client load + a partial-then-full image rollout. The issue's "Suggested orchestration" required an externally-callable activation surface, which didn't exist — StandaloneHost::initiate_format_activation was a Rust method with no gRPC or CLI surface. So this branch also lands that surface end-to-end.

Three independent assertions per CI run:

  1. Activation mid-partial-rollout is safely rejected — exit code 2 (MEMBERS_BELOW_TARGET, gate fired) OR exit code 4 (TARGET_OUT_OF_RANGE, local-range check fired on a baseline-binary leader). Both prove the activation control plane refused an unsafe state. Local kind verification confirmed the latter shape fires when ordinal 0 leads the partial-rollout cluster.
  2. Activation post-full-rollout succeeds — apply-keyed flip lands; active write version flips 4 → 5.
  3. Soak Job: zero monotonicity violations + < 0.5% error rate across the whole sequence.

What's in the diff (20 files, +956/−67, 18 commits)

Library surface (crates/tsoracle-standalone/)

  • New ActivateFormat RPC on the MembershipAdmin proto service + three new AdminErrorKind variants (MEMBERS_BELOW_TARGET, TARGET_OUT_OF_RANGE, MEMBERSHIP_CHANGED)
  • MembershipAdmin::activate_format trait method + OpenraftMembershipAdmin impl (delegates to StandaloneHost::initiate_format_activation); UnsupportedAdmin returns Unsupported for paxos/file
  • build_openraft now wraps StandaloneHost in Arc and constructs PeerCapabilitySource (first production caller — previously #[allow(dead_code)])
  • gRPC handler with u8::try_from overflow validation
  • Six handler-mapping unit tests + six map_activation_error unit tests

CLI (crates/tsoracle-bin/)

  • New tsoracle admin activate-format --endpoint <addr> --target <N> subcommand
  • Four-way exit-code contract: 0 success, 2 MEMBERS_BELOW_TARGET, 3 NOT_LEADER, 4 TARGET_OUT_OF_RANGE, 1 other. Lets the shell do ordinal-walk leader discovery and accept either rejection shape without parsing stderr
  • Pure classify_activation + impure report_activation split for testability; six unit tests

Cargo features

  • e2e-max-readable-next pass-through tsoracle-bin → tsoracle-standalone → tsoracle-driver-openraft → tsoracle-openraft-toolkit, gated via ? so the feature can't silently activate the optional driver dep

Chart admin-port wiring (deploy/)

  • entrypoint.sh now passes --admin-listen 127.0.0.1:51002 for openraft (loopback-only by design — routable bind would trip the secure-by-default AdminInsecureRoutable guard). Closes the long-standing TODO(admin-listen) comment
  • statefulset.yaml: new ADMIN_PORT env + admin containerPort
  • values.yaml: new ports.admin: 51002

E2E shell + workflow (e2e/kube/, .github/workflows/)

  • New _assertions_lib.sh extracted from run-assertions.sh; wait_soak_live now takes an explicit sentinel arg; new wait_pod_image_and_ready helper; wait_soak_live short-circuits on Job failure
  • New run-mixed-version-assertions.sh orchestrator with ordinal-walk leader discovery
  • New kube-e2e-mixed-version.yml workflow (workflow_dispatch only) — builds two cluster images per run (:e2e-baseline openraft-only, :e2e-next openraft+e2e-max-readable-next), loads into kind, runs the assertions script. 60-min timeout absorbs both builds + the rollout sequence
  • Drive-by fix: kube-e2e.yml now passes --set tls.allowInsecurePeer=true to satisfy the secure-by-default chart guard from fix(deploy): require peer mTLS or explicit opt-in for HA Helm deployments #452 (the existing lane has likely been silently broken on first re-run since fix(deploy): require peer mTLS or explicit opt-in for HA Helm deployments #452)

Pre-merge verification

  • Full workspace cargo test --workspace --features openraft,test-support: 120 test binaries, 0 failures
  • cargo clippy --workspace --all-features -- -D warnings: clean
  • actionlint .github/workflows/kube-e2e-mixed-version.yml: clean
  • helm template tsoracle deploy/charts/tsoracle --set driver=openraft,replicas=3,tls.allowInsecurePeer=true: renders cleanly with ADMIN_PORT=51002 env + admin containerPort
  • Local kind end-to-end (./e2e/kube/run-mixed-version-assertions.sh against a real 3-node cluster): PASS. Soak Job tracker: calls=52947 errors=0 error_rate=0.0000% monotonicity_violations=0 → PASS. The local run also surfaced two latent assertion-harness defects that would have failed CI on first run (kubelet's canonical image-name rewriting + macOS-only gtimeout fallback); both are fixed in the last two commits.

Test plan

  • CI: workflow_dispatch on kube-e2e-mixed-version.yml against this branch → expect the same PASS shape as local kind (~16 min wall-clock)
  • CI: workflow_dispatch on the existing kube-e2e.yml → confirm the tls.allowInsecurePeer=true drive-by fix unbroke it
  • Spec + plan included under docs/superpowers/{specs,plans}/2026-05-26-mixed-version-soak-wiring*.md for posterity (NOT load-bearing on the implementation)

Out of scope (sibling open issues)

Adds ActivateFormat to the MembershipAdmin proto service + Rust
trait + openraft impl (delegating to StandaloneHost::initiate_format_activation),
with gRPC handler error mapping to dedicated AdminErrorKind variants
(MEMBERS_BELOW_TARGET, TARGET_OUT_OF_RANGE, MEMBERSHIP_CHANGED).
First production caller of PeerCapabilitySource (was dead_code in
network.rs). UnsupportedAdmin (paxos/file) returns Unsupported.
Mirrors the existing map_write_error test pattern. Six tests, one per
FormatActivationError variant, asserting the AdminError mapping.
Also clarifies the doc comment to remove the false implication that
the CLI's redirect path is bypassed for activation — only the
leader-endpoint hint is absent (NotLeader is a unit variant).
Six tests covering every AdminError -> AdminErrorKind transition the
spec defines, plus the u8-overflow Status::invalid_argument case.
test-support feature gates a pub admin_service() constructor so
production builds don't compile the test seam.
Comment incorrectly claimed the test would 'fail later by reading the
wrong kind back' if the handler called through. It actually fails
immediately at expect_err() because the handler returns Ok(Response).
Clarify that the Unsupported in the slot is just a safe fallback for
the take() guard.
AdminCmd::ActivateFormat variant + dispatch with four-way exit
contract: 0 success, 2 MEMBERS_BELOW_TARGET, 3 NOT_LEADER,
4 TARGET_OUT_OF_RANGE, 1 other. Lets the kube-e2e shell do
ordinal-walk leader discovery and accept either rejection shape
(2 or 4) for the partial-rollout safety assertion without parsing
stderr.
Original comment claimed the CLI 'does NOT follow leader-redirect for
activation' but the dispatch arm uses with_redirect like every other
admin subcommand. The behavior is correct (activation NotLeader
carries an empty leader_admin_endpoint so with_redirect short-circuits
and exits 3), but the prior comment misdescribed the mechanism. The
remaining IMPORTANT review finding (report_activation calling
std::process::exit directly) is addressed by Task 5's planned
refactor into a pure classify_activation function + impure caller.
Lifted classify_activation/ActivationOutcome out of dispatch_admin's
inner scope so the test module reaches them. ChangeResponse +
AdminErrorKind imports lifted to module scope under the openraft
feature gate. Six tests covering every script-meaningful outcome
(Success / GateRejected / NotLeader / TargetOutOfRange / Other (Driver)
/ Other (MembershipChanged)).
…anged test

Fixes two follow-up code-review findings on d9141d7:
- classify_activation called AdminErrorKind::try_from twice on resp.error
  (the kind_string computed unconditionally but only used in the Other
  arm). Collapse into one match with split Ok(k)/Err(_) arms.
- membership_changed_classifies_as_other dropped the kind/message fields
  with .., making the test discriminate only at variant level. Now
  asserts kind == "MembershipChanged" + message == "drift", matching
  driver_error_classifies_as_other's pattern.
Enables `cargo build -p tsoracle --features openraft,e2e-max-readable-next`
and therefore the `--build-arg FEATURES=...` path in deploy/Dockerfile.
Required for the mixed-version kube-e2e lane to build its `:next` image.
Bare `tsoracle-driver-openraft/e2e-max-readable-next` forcibly activated
the optional dep under `--features e2e-max-readable-next` alone (cargo
2024 namespaced-features rule), silently pulling in half the openraft
stack despite no `openraft` request. `?` makes the pass-through a no-op
unless openraft (or another feature) already pulled the dep in. Doc
comment corrected to match the actual mechanism.
Closes the long-standing TODO(admin-listen) in entrypoint.sh. Loopback
bind keeps the secure-by-default AdminInsecureRoutable guard happy;
routable admin (with TLS) is a future operator-side opt-in by editing
the entrypoint.
Both lanes (cold-start+soak, mixed-version-soak) share wait_job /
wait_soak_live / wait_pod_image_and_ready. wait_soak_live now takes
an explicit sentinel so a future rename in one mode doesn't silently
match the other's log via substring.
Drives partial-then-full rollout with two activate-format calls;
ordinal-walk leader discovery via the CLI's exit codes (0/2/3/4/1).
Mid-rollout assertion accepts either MEMBERS_BELOW_TARGET (exit 2)
or TARGET_OUT_OF_RANGE (exit 4) — both prove the activation control
plane refused an unsafe state. Pass = both activation assertions +
soak Job's monotonicity invariant.
Builds two cluster images per run (:e2e-baseline lean openraft,
:e2e-next openraft+e2e-max-readable-next), loads into kind, runs
the mixed-version assertion script. 60min timeout absorbs two
build steps + the partial-rollout sequence.
Required by the secure-by-default guard added in #452
(_helpers.tpl:12). The existing workflow has likely been silently
broken on first re-run since #452 — fix in passing alongside the
mixed-version lane (kube-e2e-mixed-version.yml) which already passes
this flag.
(1) Cargo.toml: add [[test]] required-features for activation_admin_rpc
so `cargo test -p tsoracle-standalone --features openraft` (without
test-support) silently skips the test binary instead of compiling it
to zero test items. Without this entry, a developer running the
default-feature test set saw the binary build but produce 0 tests
with no diagnostic.

(2) _assertions_lib.sh: wait_soak_live now short-circuits on Job
failure, mirroring wait_job's pattern. An immediately-failing soak
Job (image pull, driver panic, cluster unreachable) previously sat
for the full TIMEOUT_S (60s in the mixed-version lane) before
surfacing as "never became live" — now fails immediately with the
typed FAILED message + log dump.
…age_and_ready

Kubelet rewrites Docker Hub short-form image refs (e.g. tsoracle:e2e-next)
into their canonical form (docker.io/library/tsoracle:e2e-next) before
surfacing them in .status.containerStatuses[*].image. Our strict equality
check would never match, so wait_pod_image_and_ready always timed out
on a freshly-rolled pod — blocking the kube-e2e-mixed-version lane on
its very first run.

Accept either an exact match OR a '*/IMAGE' path-boundary suffix match
via a POSIX case glob. Path-boundary preserves the safety property:
a requested 'tsoracle:e2e-next' is satisfied by the canonicalized form
but NOT by an unrelated 'foo-tsoracle:e2e-next'.

Found by Task 12's local kind verification on the third attempt.
Bare 'timeout' is GNU coreutils only. CI (Ubuntu) has it; stock macOS
does not — coreutils via brew installs it as 'gtimeout'. The wrapped
'timeout 30s kubectl exec ...' is load-bearing because the admin CLI
has no per-RPC deadline; without the wrapper a hung admin RPC blocks
the orchestrator until the Job's activeDeadlineSeconds.

Detect 'timeout' first (Linux/CI), fall back to 'gtimeout' (macOS),
fail loud with installation guidance if neither is on PATH. Found
during local kind validation on macOS.
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 26467016207

Coverage increased (+0.1%) to 95.25%

Details

  • Coverage increased (+0.1%) from the base build.
  • Patch coverage: 8 uncovered changes across 1 file (108 of 116 lines covered, 93.1%).
  • 1 coverage regression across 1 file.

Uncovered Changes

File Changed Covered %
crates/tsoracle-standalone/src/admin/openraft.rs 74 66 89.19%
Total (4 files) 116 108 93.1%

Coverage Regressions

1 previously-covered line in 1 file lost coverage.

File Lines Losing Coverage Coverage
crates/tsoracle-standalone/src/admin/service.rs 1 74.79%

Coverage Stats

Coverage Status
Relevant Lines: 14084
Covered Lines: 13415
Line Coverage: 95.25%
Coverage Strength: 368259.1 hits per line

💛 - Coveralls

@SebastianThiebaud SebastianThiebaud merged commit f4b3abf into main May 26, 2026
37 checks passed
@SebastianThiebaud SebastianThiebaud deleted the test/kube-e2e-mixed-version-soak branch May 26, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

e2e(kube): wire mixed-version-soak Job into run-assertions.sh

2 participants