tools/stress: physical-device harness driver#3829
Merged
Merged
Conversation
1915a91 to
3b1498e
Compare
elitegreg
approved these changes
Jun 3, 2026
Contributor
elitegreg
left a comment
There was a problem hiding this comment.
Reviewed. The orchestrator change is minimal and cleanly backward compatible — with all three new flags at their defaults the agent command is byte-identical to the existing containerized path. The driver script is a faithful sibling of run-stress-local.sh. Approving.
Two items worth addressing (left inline):
--max-usersis only set in the device-create branch, so reruns with a larger--target-userswon't scale the onchain cap.- The controller
EXITtrap disarms as soon as the gRPC port is up, orphaning thego runcontroller if a later phase fails.
3b1498e to
ab48f61
Compare
nikw9944
added a commit
that referenced
this pull request
Jun 3, 2026
Smoke-testing the harness against the real Arista device surfaced four defaults that were wrong for non-containerized hardware, and Greg pointed out two robustness issues in the script's controller lifecycle. Smoke-test fixes: - DUT defaults: DUT_HOST=10.0.0.15 (the real chi-dn-dzd5 address — I had misread "the device can reach 10.0.0.141" as the device's address; 10.0.0.141 is the host running the controller). DUT_SSH_USER defaults to `nik` and DUT_SSH_KEY to `$HOME/.ssh/nik@malbeclabs.com`. - AGENT_BINARY defaults to /mnt/flash/doublezero-agent: persistent across EOS reboots and writable by sudo. AGENT_COMMAND_PREFIX now defaults to `bash sudo /sbin/ip netns exec ns-management` — `bash` escapes EOS's RunCli login shell into the underlying shell, and `sudo` provides the CAP_SYS_ADMIN that `ip netns exec` requires. - DEVICE_PUBLIC_IP defaults to 9.210.180.5: a globally-routable IP (the program's CreateDevice rejects RFC1918) outside the device's own dz_prefix /29 (the program also rejects overlap). The previous default ($DUT_HOST, an RFC1918 address) failed both checks. - device-tunnel-block changed from 9.210.180.0/24 to 172.16.0.0/16 (the standard from smartcontract/test/start-test.sh). The program auto-allocates loopback IPs from this block on plain-loopback creates (interface/create.rs:213-218) and validates `is_private()` on the result (interface.rs:494). 9.210.180.0/24 is globally routable, so the validator rejected every allocation — the controller then refused to render config with `no or invalid VPNv4 loopback interface`. - SSH command prefixed with `bash`: EOS pins admin's NSS shell to RunCli, so `ssh nik@dut "test -x ..."` would be parsed by Cli, not bash. `bash test -x …` runs the rest under bash. - ssh.go grows an ssh-agent fall-back: when the configured key file is passphrase-protected, the runner now connects to $SSH_AUTH_SOCK and uses the agent's signers. The old code aborted with "this private key is passphrase protected" because Go's ssh.ParsePrivateKey can't prompt for a passphrase. An operator with an encrypted key just needs `ssh-add <key>` once. - Loopback create no longer passes --ip-net: the program rejects user-supplied ip_net on plain loopbacks (interface/create.rs:155- 162). Also the loopback-create branch now actually checks exit status / error markers rather than blindly logging "registered" on any non-"already exists" output. Review feedback (#3829): - @elitegreg: `dz device update --max-users` only ran inside the `if ! dz device get` create branch, so reruns with a larger --target-users would leave the onchain cap stuck. Moved the update outside the create guard so every run resyncs it. - @elitegreg: `trap - EXIT` was disarming the controller-cleanup trap the moment the gRPC port came up, leaving subsequent phases (access-pass setup, build, orchestrator + observer launch) able to orphan the controller under `set -e`. The disarm now happens only after the orchestrator has launched, so the trap covers the entire pre-orchestrator window.
8bb0df4 to
4e3441d
Compare
Adds a sister script to run-stress-local.sh that exercises the same
orchestrator + observer against a real Arista EOS device. The physical
case differs from the containerized one in three substantive ways:
- Ledger / serviceability live on devnet (DZ_RPC_URL + DZ_PROGRAM_ID),
with the program pre-deployed but not yet initialized. The script
walks the same init steps as e2e/internal/devnet/smartcontract_init.go
(global-config, one location, one exchange, contributor co01), each
guarded with a `get`-then-skip so reruns are safe.
- The controller runs natively on the host via
`go run controlplane/controller/cmd/controller start ...` rather than
in a container; the script handles startup/teardown and waits for the
gRPC port. CONTROLLER_ADVERTISE_ADDR auto-detects the host's address
routable from the DUT (override if needed).
- The DUT has no agent wrapper, so the orchestrator must build the
full agent command itself. Three additive flags get this done while
leaving the containerized harness path completely unchanged when none
are set:
--agent-binary path of doublezero-agent on the DUT
--agent-command-prefix prepended verbatim (e.g.
"/sbin/ip netns exec ns-management")
--agent-pubkey appended as `-pubkey <value>` to the
agent command
The containerized wrapper script that injects -pubkey from
/etc/doublezero/agent/pubkey continues to work because all three
flags default to empty / "doublezero-agent" / empty.
The script is structured to be familiar to anyone who has read
run-stress-local.sh — same `log()` helper, same `xargs -P` access-pass
loop (target swapped from `docker exec manager doublezero` to
`doublezero --url ... --program-id ...`), same final stanza printing
pids + log paths.
This PR targets nikw9944/stress-test-harness so the existing reviewed
PRs (#3796 / #3820 / #3821) aren't perturbed.
Smoke-testing the harness against the real Arista device surfaced four defaults that were wrong for non-containerized hardware, and Greg pointed out two robustness issues in the script's controller lifecycle. Smoke-test fixes: - DUT defaults: DUT_HOST=10.0.0.15 (the real chi-dn-dzd5 address — I had misread "the device can reach 10.0.0.141" as the device's address; 10.0.0.141 is the host running the controller). DUT_SSH_USER defaults to `nik` and DUT_SSH_KEY to `$HOME/.ssh/nik@malbeclabs.com`. - AGENT_BINARY defaults to /mnt/flash/doublezero-agent: persistent across EOS reboots and writable by sudo. AGENT_COMMAND_PREFIX now defaults to `bash sudo /sbin/ip netns exec ns-management` — `bash` escapes EOS's RunCli login shell into the underlying shell, and `sudo` provides the CAP_SYS_ADMIN that `ip netns exec` requires. - DEVICE_PUBLIC_IP defaults to 9.210.180.5: a globally-routable IP (the program's CreateDevice rejects RFC1918) outside the device's own dz_prefix /29 (the program also rejects overlap). The previous default ($DUT_HOST, an RFC1918 address) failed both checks. - device-tunnel-block changed from 9.210.180.0/24 to 172.16.0.0/16 (the standard from smartcontract/test/start-test.sh). The program auto-allocates loopback IPs from this block on plain-loopback creates (interface/create.rs:213-218) and validates `is_private()` on the result (interface.rs:494). 9.210.180.0/24 is globally routable, so the validator rejected every allocation — the controller then refused to render config with `no or invalid VPNv4 loopback interface`. - SSH command prefixed with `bash`: EOS pins admin's NSS shell to RunCli, so `ssh nik@dut "test -x ..."` would be parsed by Cli, not bash. `bash test -x …` runs the rest under bash. - ssh.go grows an ssh-agent fall-back: when the configured key file is passphrase-protected, the runner now connects to $SSH_AUTH_SOCK and uses the agent's signers. The old code aborted with "this private key is passphrase protected" because Go's ssh.ParsePrivateKey can't prompt for a passphrase. An operator with an encrypted key just needs `ssh-add <key>` once. - Loopback create no longer passes --ip-net: the program rejects user-supplied ip_net on plain loopbacks (interface/create.rs:155- 162). Also the loopback-create branch now actually checks exit status / error markers rather than blindly logging "registered" on any non-"already exists" output. Review feedback (#3829): - @elitegreg: `dz device update --max-users` only ran inside the `if ! dz device get` create branch, so reruns with a larger --target-users would leave the onchain cap stuck. Moved the update outside the create guard so every run resyncs it. - @elitegreg: `trap - EXIT` was disarming the controller-cleanup trap the moment the gRPC port came up, leaving subsequent phases (access-pass setup, build, orchestrator + observer launch) able to orphan the controller under `set -e`. The disarm now happens only after the orchestrator has launched, so the trap covers the entire pre-orchestrator window.
… the doublezero CLI default
Program IDs aren't checked in. The stress-test serviceability program ID
lives in the private infra repo; require operators to export
DZ_PROGRAM_ID explicitly and fail fast with a clear message when it's
unset (`\${DZ_PROGRAM_ID:?...}`). The README's quick-start no longer
shows the previous program ID — it'll get stale anyway since the
program needs to be redeployed (`device_tunnel_block` is immutable
after init).
SOLANA_KEYPAIR now defaults to ~/.config/doublezero/id.json (the
doublezero CLI's default keypair location) instead of
~/.config/solana/id.json. The script's \`dz\` wrapper reads the
doublezero default implicitly, so matching its location keeps every
signing call — init, device create, access passes, orchestrator — on
the same operator authority by default. The solana CLI default may
hold a different key entirely.
Two issues surfaced in the smoke run against chi-dn-dzd5. Port-conflict fail-fast: the script's `nohup go run controller ...` will silently fail to bind if a leftover controller from a previous run is still squatting on the listen port. The readiness check then succeeds against the stale process, the script proceeds, and the orchestrator talks to a controller pointing at the wrong program / RPC. Add an `ss -ltn 'sport = :$PORT'` check before launch — if any listener is present, die with a hint pointing at the prior run's $CONTROLLER_PID_FILE. Agent metrics flag: the containerized harness ships a device-side wrapper that injects `-metrics-enable -metrics-addr :50100`. The physical harness has no wrapper, so the orchestrator's SSH-exec'd agent command runs without metrics — the observer correctly reported "connection refused" scraping :50100. Add a new `--agent-metrics-addr` orchestrator flag mirroring the additive shape of `--agent-pubkey` and `--agent-command-prefix`: empty default preserves the containerized path (wrapper still handles it); the physical script sets it to `:$AGENT_METRICS_PORT` so the agent listens where the observer scrapes.
Real EOS devices require HTTP basic auth on eAPI; the containerized cEOS accepts an empty password and the script was hardcoding that. Add EAPI_USER (defaults to $DUT_SSH_USER) and EAPI_PASS (empty by default — preserves the containerized path) so the observer can authenticate against a physical DUT without code changes.
…EAPI_USER to admin Adds a "Prerequisites on the physical DUT" section that lists what the operator must have configured on the device before running the script: SSH access + bash user with passwordless sudo, the doublezero-agent binary on disk (with build command), the EOS SDK RPC agent (daemon eapilocal + management api eos-sdk-rpc) in VRF management, eAPI http-commands with a usable login, and the ns-management netns. The containerized harness gets away without these because cEOS provides the SDK implicitly and the harness renders its own startup-config; on a physical device they must already be there. Also documents EAPI_USER / EAPI_PASS in the env-var table, fixes the EAPI_USER default from $DUT_SSH_USER (a bash-shell operator account, typically not in eAPI's user table) to `admin` (the convention), and mentions reading EAPI_PASS interactively in the quick-start.
Two issues from running against chi-dn-dzd5. Pidfile bug: the script's controller startup is `nohup go run ...`, which compiles a temp binary and exec's it as a child of `go run`. We stored `$!` (the `go run` parent's PID) in the pidfile, but the port listener is the child. `kill $(cat controller.pid)` then killed only the shim and left the listener — making subsequent runs trip the new port-conflict fail-fast check. After the port comes up, `ss -ltnp` to discover the listener's PID and overwrite the pidfile with it. Cleanup now tries the listener pid first (preferred — killing it makes `go run` exit too) and falls back to the parent. mgmt-vrf mismatch: I had `--mgmt-vrf mgmt` hardcoded (matching cEOS), so the controller rendered `ntp server vrf mgmt ...` and real EOS rejected it at command 23/253 (the device's VRF is named `management` in full). Pull `DEVICE_MGMT_VRF` out as an env var defaulting to `management`, use it in both create and update, and include it in the rerun-resync block (same pattern as max-users). This recovers from a wrong-vrf onchain device on the next run without requiring an out-of-band fix.
…ort path too
The 512-user run against chi-dn-dzd5 stranded 256 tunnels on the
device even though deprovision completed onchain. Sequence:
1. observer's device_tunnel_gap trigger fired
(false positive — `show gre tunnel static` returns empty when
tunnels are down, even though they exist in running-config)
2. orchestrator detects the abort sentinel; provision returns
context.Canceled (err != nil)
3. deprovision runs to completion under WithoutCancel — 512
deprovision_activates onchain
4. controller polls onchain state, generates a final config
(~14k lines, ~400 KB) and pushes it to the agent
5. agent receives the config (visible in orchestrator.agent.log)
6. orchestrator's quiescence wait is GUARDED by
`if err == nil && depErr == nil && agentErr == nil { wait }`.
err == context.Canceled, so wait is skipped.
7. agentCancel() runs immediately; SSH session dies; the agent
never gets to commit the 14k-line config.
8. Tunnel500..Tunnel755 stay in the running-config.
This is the same class of bug commit 0791129 originally fixed
for the success path. The abort triggers exist to flag suspicious
off-device conditions ("agent silent for 120s", "device tunnel
count doesn't match orchestrator") — they don't mean "kill the
agent mid-commit". When deprovision has already completed onchain,
we should give the agent its normal quiescence window to drain
whatever the controller pushed before tearing down the SSH session.
Two-line change:
- Drop the `err == nil` requirement from the wait predicate
(only `depErr == nil && agentErr == nil` matter — deprovision
completed and the stream is healthy).
- When ctx is already cancelled (abort path), pass a
WithoutCancel'd context to the wait so it doesn't immediately
bail at the first `<-ctx.Done()`. The
AgentQuiescenceTimeout still caps total wait, and a signal-
impatient operator can re-Ctrl-C the orchestrator.
Adds TestRun_WaitsForAgentQuiescenceEvenOnAbort proving the wait
engages on the ctx-cancellation path while deprovision still
completes.
Quiescence-wait floor (the (a) fix from the smoke-test rerun): the existing predicate "agent silent for ≥ quietWindow" returned instantly when the agent had been silent BEFORE the wait engaged — exactly the shape seen against chi-dn-dzd5 where slow EOS commits take 60+ s and the agent emits no log lines during them. Add an elapsed-since-wait- start floor so the wait always blocks at least `quietWindow` after deprovision returns, giving the agent its full window to commit any post-deprovision controller config push. Updates the matching test to assert both branches are observed. Review feedback (#3829 follow-up): - M1: `agentAuthMethod` was leaking its ssh-agent unix-socket conn. Return the conn alongside the AuthMethod, stash it on the SSH struct, close it in `shutdown()` (which is already idempotent). - M2: an ssh-agent that's reachable but has zero loaded keys now errors out fast with a "ssh-add your key first" hint instead of silently producing an AuthMethod that fails opaquely later in the SSH handshake. - M3: the `ss -ltnp pid=` parse for the controller listener PID needs GNU grep AND socket ownership. Add a `pgrep -P` fallback that walks the process tree from the `go run` parent — works on BSD/macOS and in unprivileged shells where ss returns nothing. - L1: README now calls out the `bash sudo` prefix in the agent- invocation row (the `bash` keyword escapes EOS Cli into the shell; `sudo` provides CAP_SYS_ADMIN for `ip netns exec`). - L2: README's "Management netns" prereq now explains that the netns name tracks the VRF name, with `vrf management → ns-management` and `vrf mgmt → ns-mgmt` examples, and cross-references the `DZ_STRESS_DEVICE_MGMT_VRF` env var. - L3: rewrote `TestRun_WaitsForAgentQuiescenceEvenOnAbort` to emit the synthetic event well before cancellation and assert `elapsed ≥ quietWindow/2`. The previous version would have passed against a regression that reverted either the `err == nil` guard removal or the new elapsed-since-wait-start floor. - L4: moved `trap cleanup_controller EXIT` to before the `nohup go run ... &` so the trap arms before launch — closes a tiny window where a `set -e` failure between launch and trap arming could orphan the controller. `cleanup_controller` now reads both pid vars with `:-` defaults so a pre-launch fire is a no-op rather than a `kill ""` error. - N1: comment on the orchestrator's new agent flags now references `tools/stress/docker/device/agent-wrapper.sh` for readers following the wrapper trail.
5d0734e to
31e8177
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
tools/stress/scripts/run-stress-physical.shdriver that runs the same orchestrator + observer (from PR 3747-4: abort decider + sentinel (~200 LOC code) #3796/tools/stress: fixes surfaced by running the 1024-user stress harness #3820/tools/stress: local containerized harness script #3821) against a physical Arista EOS device instead of the containerized cEOS. Idempotent serviceability init, idempotent device + loopback create, controller launch viago runagainst the devnet ledger, parallel access-pass setup, orchestrator + observer launch with a port-conflict fail-fast and a trap that cleans up the controller on any pre-launch error.--agent-binary,--agent-command-prefix,--agent-pubkey,--agent-metrics-addr. All default to empty so the containerized harness path is byte-for-byte unchanged.ssh.ParsePrivateKeycan't prompt), the runner connects to$SSH_AUTH_SOCKand uses the agent's loaded signers. Empty-agent and unreachable-agent paths fail fast with actionable diagnostics instead of opaque SSH handshake errors.Diff Breakdown
Mostly scaffolding (the new driver script is ~500 of the 857 net lines); core-logic changes are concentrated in two files (ssh-agent fallback + quiescence-wait predicate) and accompanied by a test rewrite covering the new branch.
Key files (click to expand)
tools/stress/scripts/run-stress-physical.sh— new driver: idempotent init, device + loopbacks, controller startup, parallel access passes, orchestrator + observer launch, port-conflict fail-fast.tools/stress/scripts/README.md— physical-harness section with prereqs and the env-var knob table.tools/stress/device-orchestrator/pkg/sweep/sweep_test.go— new tests for the abort-path quiescence wait and the elapsed-since-wait-start floor.tools/stress/device-orchestrator/pkg/agent/ssh.go—loadAuthMethodadds an ssh-agent fallback that's used when the key file is passphrase-protected or empty; conn is stashed on the SSH struct and closed inshutdown.tools/stress/device-orchestrator/pkg/sweep/sweep.go— post-deprovision quiescence wait now engages on the abort path (context.WithoutCancelfor the wait), and requires bothsilent ≥ windowandelapsed-since-wait-start ≥ window.tools/stress/device-orchestrator/cmd/device-orchestrator/main.go— four additive flags threaded throughorchestratorConfigandselectAgentRunner.Testing Verification
TestRun_WaitsForAgentQuiescenceEvenOnAbortexercises both axes of the new branch: that the wait engages when ctx is cancelled (abort path), and that the elapsed-since-wait-start floor blocks even when the absolute "silent for ≥ window" predicate would return instantly.TestRun_WaitsForAgentQuiescenceAfterDeprovisionandTestRun_SkipsQuiescenceWaitWhenNoAppliedObservedcontinue to assert the success and no-event paths respectively.StartUserTunnelNum=500that's 524 usable users on chi-dn-dzd5 specifically — different physical DUTs have higher caps).