test(e2e): wire mixed-version-soak Job + add tsoracle admin activate-format#503
Merged
Conversation
Adds ActivateFormat to the MembershipAdmin proto service + Rust trait + openraft impl (delegating to StandaloneHost::initiate_format_activation), with gRPC handler error mapping to dedicated AdminErrorKind variants (MEMBERS_BELOW_TARGET, TARGET_OUT_OF_RANGE, MEMBERSHIP_CHANGED). First production caller of PeerCapabilitySource (was dead_code in network.rs). UnsupportedAdmin (paxos/file) returns Unsupported.
Mirrors the existing map_write_error test pattern. Six tests, one per FormatActivationError variant, asserting the AdminError mapping. Also clarifies the doc comment to remove the false implication that the CLI's redirect path is bypassed for activation — only the leader-endpoint hint is absent (NotLeader is a unit variant).
Six tests covering every AdminError -> AdminErrorKind transition the spec defines, plus the u8-overflow Status::invalid_argument case. test-support feature gates a pub admin_service() constructor so production builds don't compile the test seam.
Comment incorrectly claimed the test would 'fail later by reading the wrong kind back' if the handler called through. It actually fails immediately at expect_err() because the handler returns Ok(Response). Clarify that the Unsupported in the slot is just a safe fallback for the take() guard.
AdminCmd::ActivateFormat variant + dispatch with four-way exit contract: 0 success, 2 MEMBERS_BELOW_TARGET, 3 NOT_LEADER, 4 TARGET_OUT_OF_RANGE, 1 other. Lets the kube-e2e shell do ordinal-walk leader discovery and accept either rejection shape (2 or 4) for the partial-rollout safety assertion without parsing stderr.
Original comment claimed the CLI 'does NOT follow leader-redirect for activation' but the dispatch arm uses with_redirect like every other admin subcommand. The behavior is correct (activation NotLeader carries an empty leader_admin_endpoint so with_redirect short-circuits and exits 3), but the prior comment misdescribed the mechanism. The remaining IMPORTANT review finding (report_activation calling std::process::exit directly) is addressed by Task 5's planned refactor into a pure classify_activation function + impure caller.
Lifted classify_activation/ActivationOutcome out of dispatch_admin's inner scope so the test module reaches them. ChangeResponse + AdminErrorKind imports lifted to module scope under the openraft feature gate. Six tests covering every script-meaningful outcome (Success / GateRejected / NotLeader / TargetOutOfRange / Other (Driver) / Other (MembershipChanged)).
…anged test Fixes two follow-up code-review findings on d9141d7: - classify_activation called AdminErrorKind::try_from twice on resp.error (the kind_string computed unconditionally but only used in the Other arm). Collapse into one match with split Ok(k)/Err(_) arms. - membership_changed_classifies_as_other dropped the kind/message fields with .., making the test discriminate only at variant level. Now asserts kind == "MembershipChanged" + message == "drift", matching driver_error_classifies_as_other's pattern.
Enables `cargo build -p tsoracle --features openraft,e2e-max-readable-next` and therefore the `--build-arg FEATURES=...` path in deploy/Dockerfile. Required for the mixed-version kube-e2e lane to build its `:next` image.
Bare `tsoracle-driver-openraft/e2e-max-readable-next` forcibly activated the optional dep under `--features e2e-max-readable-next` alone (cargo 2024 namespaced-features rule), silently pulling in half the openraft stack despite no `openraft` request. `?` makes the pass-through a no-op unless openraft (or another feature) already pulled the dep in. Doc comment corrected to match the actual mechanism.
Closes the long-standing TODO(admin-listen) in entrypoint.sh. Loopback bind keeps the secure-by-default AdminInsecureRoutable guard happy; routable admin (with TLS) is a future operator-side opt-in by editing the entrypoint.
Both lanes (cold-start+soak, mixed-version-soak) share wait_job / wait_soak_live / wait_pod_image_and_ready. wait_soak_live now takes an explicit sentinel so a future rename in one mode doesn't silently match the other's log via substring.
Drives partial-then-full rollout with two activate-format calls; ordinal-walk leader discovery via the CLI's exit codes (0/2/3/4/1). Mid-rollout assertion accepts either MEMBERS_BELOW_TARGET (exit 2) or TARGET_OUT_OF_RANGE (exit 4) — both prove the activation control plane refused an unsafe state. Pass = both activation assertions + soak Job's monotonicity invariant.
Builds two cluster images per run (:e2e-baseline lean openraft, :e2e-next openraft+e2e-max-readable-next), loads into kind, runs the mixed-version assertion script. 60min timeout absorbs two build steps + the partial-rollout sequence.
(1) Cargo.toml: add [[test]] required-features for activation_admin_rpc so `cargo test -p tsoracle-standalone --features openraft` (without test-support) silently skips the test binary instead of compiling it to zero test items. Without this entry, a developer running the default-feature test set saw the binary build but produce 0 tests with no diagnostic. (2) _assertions_lib.sh: wait_soak_live now short-circuits on Job failure, mirroring wait_job's pattern. An immediately-failing soak Job (image pull, driver panic, cluster unreachable) previously sat for the full TIMEOUT_S (60s in the mixed-version lane) before surfacing as "never became live" — now fails immediately with the typed FAILED message + log dump.
…age_and_ready Kubelet rewrites Docker Hub short-form image refs (e.g. tsoracle:e2e-next) into their canonical form (docker.io/library/tsoracle:e2e-next) before surfacing them in .status.containerStatuses[*].image. Our strict equality check would never match, so wait_pod_image_and_ready always timed out on a freshly-rolled pod — blocking the kube-e2e-mixed-version lane on its very first run. Accept either an exact match OR a '*/IMAGE' path-boundary suffix match via a POSIX case glob. Path-boundary preserves the safety property: a requested 'tsoracle:e2e-next' is satisfied by the canonicalized form but NOT by an unrelated 'foo-tsoracle:e2e-next'. Found by Task 12's local kind verification on the third attempt.
Bare 'timeout' is GNU coreutils only. CI (Ubuntu) has it; stock macOS does not — coreutils via brew installs it as 'gtimeout'. The wrapped 'timeout 30s kubectl exec ...' is load-bearing because the admin CLI has no per-RPC deadline; without the wrapper a hung admin RPC blocks the orchestrator until the Job's activeDeadlineSeconds. Detect 'timeout' first (Linux/CI), fall back to 'gtimeout' (macOS), fail loud with installation guidance if neither is on PATH. Found during local kind validation on macOS.
Coverage Report for CI Build 26467016207Coverage increased (+0.1%) to 95.25%Details
Uncovered Changes
Coverage Regressions1 previously-covered line in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #484.
Summary
Wires
e2e/kube/driver/job-mixed-version-soak.yaml(added in #470) into a CI lane that proves the openraft format-activation control plane works under sustained client load + a partial-then-full image rollout. The issue's "Suggested orchestration" required an externally-callable activation surface, which didn't exist —StandaloneHost::initiate_format_activationwas a Rust method with no gRPC or CLI surface. So this branch also lands that surface end-to-end.Three independent assertions per CI run:
MEMBERS_BELOW_TARGET, gate fired) OR exit code 4 (TARGET_OUT_OF_RANGE, local-range check fired on a baseline-binary leader). Both prove the activation control plane refused an unsafe state. Local kind verification confirmed the latter shape fires when ordinal 0 leads the partial-rollout cluster.What's in the diff (20 files, +956/−67, 18 commits)
Library surface (
crates/tsoracle-standalone/)ActivateFormatRPC on theMembershipAdminproto service + three newAdminErrorKindvariants (MEMBERS_BELOW_TARGET,TARGET_OUT_OF_RANGE,MEMBERSHIP_CHANGED)MembershipAdmin::activate_formattrait method +OpenraftMembershipAdminimpl (delegates toStandaloneHost::initiate_format_activation);UnsupportedAdminreturnsUnsupportedfor paxos/filebuild_openraftnow wrapsStandaloneHostinArcand constructsPeerCapabilitySource(first production caller — previously#[allow(dead_code)])u8::try_fromoverflow validationmap_activation_errorunit testsCLI (
crates/tsoracle-bin/)tsoracle admin activate-format --endpoint <addr> --target <N>subcommand0success,2MEMBERS_BELOW_TARGET,3NOT_LEADER,4TARGET_OUT_OF_RANGE,1other. Lets the shell do ordinal-walk leader discovery and accept either rejection shape without parsing stderrclassify_activation+ impurereport_activationsplit for testability; six unit testsCargo features
e2e-max-readable-nextpass-throughtsoracle-bin → tsoracle-standalone → tsoracle-driver-openraft → tsoracle-openraft-toolkit, gated via?so the feature can't silently activate the optional driver depChart admin-port wiring (
deploy/)entrypoint.shnow passes--admin-listen 127.0.0.1:51002for openraft (loopback-only by design — routable bind would trip the secure-by-defaultAdminInsecureRoutableguard). Closes the long-standingTODO(admin-listen)commentstatefulset.yaml: newADMIN_PORTenv +admincontainerPortvalues.yaml: newports.admin: 51002E2E shell + workflow (
e2e/kube/,.github/workflows/)_assertions_lib.shextracted fromrun-assertions.sh;wait_soak_livenow takes an explicit sentinel arg; newwait_pod_image_and_readyhelper;wait_soak_liveshort-circuits on Job failurerun-mixed-version-assertions.shorchestrator with ordinal-walk leader discoverykube-e2e-mixed-version.ymlworkflow (workflow_dispatchonly) — builds two cluster images per run (:e2e-baselineopenraft-only,:e2e-nextopenraft+e2e-max-readable-next), loads into kind, runs the assertions script. 60-min timeout absorbs both builds + the rollout sequencekube-e2e.ymlnow passes--set tls.allowInsecurePeer=trueto satisfy the secure-by-default chart guard from fix(deploy): require peer mTLS or explicit opt-in for HA Helm deployments #452 (the existing lane has likely been silently broken on first re-run since fix(deploy): require peer mTLS or explicit opt-in for HA Helm deployments #452)Pre-merge verification
cargo test --workspace --features openraft,test-support: 120 test binaries, 0 failurescargo clippy --workspace --all-features -- -D warnings: cleanactionlint .github/workflows/kube-e2e-mixed-version.yml: cleanhelm template tsoracle deploy/charts/tsoracle --set driver=openraft,replicas=3,tls.allowInsecurePeer=true: renders cleanly withADMIN_PORT=51002env +admincontainerPort./e2e/kube/run-mixed-version-assertions.shagainst a real 3-node cluster): PASS. Soak Job tracker:calls=52947 errors=0 error_rate=0.0000% monotonicity_violations=0 → PASS. The local run also surfaced two latent assertion-harness defects that would have failed CI on first run (kubelet's canonical image-name rewriting + macOS-onlygtimeoutfallback); both are fixed in the last two commits.Test plan
kube-e2e-mixed-version.ymlagainst this branch → expect the same PASS shape as local kind (~16 min wall-clock)kube-e2e.yml→ confirm thetls.allowInsecurePeer=truedrive-by fix unbroke itdocs/superpowers/{specs,plans}/2026-05-26-mixed-version-soak-wiring*.mdfor posterity (NOT load-bearing on the implementation)Out of scope (sibling open issues)
AdminError::Unsupportedfor activationworkflow_dispatchonly here