tools/stress: physical-device harness driver by nikw9944 · Pull Request #3829 · malbeclabs/doublezero

nikw9944 · 2026-06-03T16:21:05Z

Summary of Changes

New tools/stress/scripts/run-stress-physical.sh driver that runs the same orchestrator + observer (from PR 3747-4: abort decider + sentinel (~200 LOC code) #3796/tools/stress: fixes surfaced by running the 1024-user stress harness #3820/tools/stress: local containerized harness script #3821) against a physical Arista EOS device instead of the containerized cEOS. Idempotent serviceability init, idempotent device + loopback create, controller launch via go run against the devnet ledger, parallel access-pass setup, orchestrator + observer launch with a port-conflict fail-fast and a trap that cleans up the controller on any pre-launch error.
Four additive flags on the device-orchestrator so the SSH-exec'd agent command can be made self-contained on a DUT with no wrapper script: --agent-binary, --agent-command-prefix, --agent-pubkey, --agent-metrics-addr. All default to empty so the containerized harness path is byte-for-byte unchanged.
ssh-agent fallback in the orchestrator's SSH runner: when the configured key file is passphrase-protected (Go's ssh.ParsePrivateKey can't prompt), the runner connects to $SSH_AUTH_SOCK and uses the agent's loaded signers. Empty-agent and unreachable-agent paths fail fast with actionable diagnostics instead of opaque SSH handshake errors.
Post-deprovision agent quiescence wait now engages on the abort path too (not just clean success) and enforces a minimum window after deprovision returns. Without this, the observer's abort sentinel (legitimate or false-positive) could kill the SSH session mid-commit and strand tunnels in the device's running-config even though onchain deprovision completed.
README appendix on physical-DUT prerequisites: SSH user with passwordless sudo + bash shell, the agent binary on disk, EOS SDK RPC daemon in the management VRF, eAPI http-commands with credentials, and the management netns. Includes the env-var knob table.

Diff Breakdown

Category	Files	Lines (+/-)	Net
Core logic	2	+121 / -26	+95
Scaffolding	2	+534 / -5	+529
Tests	1	+91 / -0	+91
Docs	1	+142 / -0	+142
Total	6	+888 / -31	+857

Mostly scaffolding (the new driver script is ~500 of the 857 net lines); core-logic changes are concentrated in two files (ssh-agent fallback + quiescence-wait predicate) and accompanied by a test rewrite covering the new branch.

Key files (click to expand)

tools/stress/scripts/run-stress-physical.sh — new driver: idempotent init, device + loopbacks, controller startup, parallel access passes, orchestrator + observer launch, port-conflict fail-fast.
tools/stress/scripts/README.md — physical-harness section with prereqs and the env-var knob table.
tools/stress/device-orchestrator/pkg/sweep/sweep_test.go — new tests for the abort-path quiescence wait and the elapsed-since-wait-start floor.
tools/stress/device-orchestrator/pkg/agent/ssh.go — loadAuthMethod adds an ssh-agent fallback that's used when the key file is passphrase-protected or empty; conn is stashed on the SSH struct and closed in shutdown.
tools/stress/device-orchestrator/pkg/sweep/sweep.go — post-deprovision quiescence wait now engages on the abort path (context.WithoutCancel for the wait), and requires both silent ≥ window and elapsed-since-wait-start ≥ window.
tools/stress/device-orchestrator/cmd/device-orchestrator/main.go — four additive flags threaded through orchestratorConfig and selectAgentRunner.

Testing Verification

TestRun_WaitsForAgentQuiescenceEvenOnAbort exercises both axes of the new branch: that the wait engages when ctx is cancelled (abort path), and that the elapsed-since-wait-start floor blocks even when the absolute "silent for ≥ window" predicate would return instantly.
TestRun_WaitsForAgentQuiescenceAfterDeprovision and TestRun_SkipsQuiescenceWaitWhenNoAppliedObserved continue to assert the success and no-event paths respectively.
Validated end-to-end against chi-dn-dzd5: 4-user sweep completes cleanly with 0 stranded tunnels; 512-user sweep provisions + deprovisions all 512 onchain. The 1024-user ceiling is a hardware limit on this DUT model (Tunnel ID names cap at 1023; with StartUserTunnelNum=500 that's 524 usable users on chi-dn-dzd5 specifically — different physical DUTs have higher caps).

elitegreg

Reviewed. The orchestrator change is minimal and cleanly backward compatible — with all three new flags at their defaults the agent command is byte-identical to the existing containerized path. The driver script is a faithful sibling of run-stress-local.sh. Approving.

Two items worth addressing (left inline):

--max-users is only set in the device-create branch, so reruns with a larger --target-users won't scale the onchain cap.
The controller EXIT trap disarms as soon as the gRPC port is up, orphaning the go run controller if a later phase fails.

@elitegreg

Smoke-testing the harness against the real Arista device surfaced four defaults that were wrong for non-containerized hardware, and Greg pointed out two robustness issues in the script's controller lifecycle. Smoke-test fixes: - DUT defaults: DUT_HOST=10.0.0.15 (the real chi-dn-dzd5 address — I had misread "the device can reach 10.0.0.141" as the device's address; 10.0.0.141 is the host running the controller). DUT_SSH_USER defaults to `nik` and DUT_SSH_KEY to `$HOME/.ssh/nik@malbeclabs.com`. - AGENT_BINARY defaults to /mnt/flash/doublezero-agent: persistent across EOS reboots and writable by sudo. AGENT_COMMAND_PREFIX now defaults to `bash sudo /sbin/ip netns exec ns-management` — `bash` escapes EOS's RunCli login shell into the underlying shell, and `sudo` provides the CAP_SYS_ADMIN that `ip netns exec` requires. - DEVICE_PUBLIC_IP defaults to 9.210.180.5: a globally-routable IP (the program's CreateDevice rejects RFC1918) outside the device's own dz_prefix /29 (the program also rejects overlap). The previous default ($DUT_HOST, an RFC1918 address) failed both checks. - device-tunnel-block changed from 9.210.180.0/24 to 172.16.0.0/16 (the standard from smartcontract/test/start-test.sh). The program auto-allocates loopback IPs from this block on plain-loopback creates (interface/create.rs:213-218) and validates `is_private()` on the result (interface.rs:494). 9.210.180.0/24 is globally routable, so the validator rejected every allocation — the controller then refused to render config with `no or invalid VPNv4 loopback interface`. - SSH command prefixed with `bash`: EOS pins admin's NSS shell to RunCli, so `ssh nik@dut "test -x ..."` would be parsed by Cli, not bash. `bash test -x …` runs the rest under bash. - ssh.go grows an ssh-agent fall-back: when the configured key file is passphrase-protected, the runner now connects to $SSH_AUTH_SOCK and uses the agent's signers. The old code aborted with "this private key is passphrase protected" because Go's ssh.ParsePrivateKey can't prompt for a passphrase. An operator with an encrypted key just needs `ssh-add <key>` once. - Loopback create no longer passes --ip-net: the program rejects user-supplied ip_net on plain loopbacks (interface/create.rs:155- 162). Also the loopback-create branch now actually checks exit status / error markers rather than blindly logging "registered" on any non-"already exists" output. Review feedback (#3829): - @elitegreg: `dz device update --max-users` only ran inside the `if ! dz device get` create branch, so reruns with a larger --target-users would leave the onchain cap stuck. Moved the update outside the create guard so every run resyncs it. - @elitegreg: `trap - EXIT` was disarming the controller-cleanup trap the moment the gRPC port came up, leaving subsequent phases (access-pass setup, build, orchestrator + observer launch) able to orphan the controller under `set -e`. The disarm now happens only after the orchestrator has launched, so the trap covers the entire pre-orchestrator window.

Adds a sister script to run-stress-local.sh that exercises the same orchestrator + observer against a real Arista EOS device. The physical case differs from the containerized one in three substantive ways: - Ledger / serviceability live on devnet (DZ_RPC_URL + DZ_PROGRAM_ID), with the program pre-deployed but not yet initialized. The script walks the same init steps as e2e/internal/devnet/smartcontract_init.go (global-config, one location, one exchange, contributor co01), each guarded with a `get`-then-skip so reruns are safe. - The controller runs natively on the host via `go run controlplane/controller/cmd/controller start ...` rather than in a container; the script handles startup/teardown and waits for the gRPC port. CONTROLLER_ADVERTISE_ADDR auto-detects the host's address routable from the DUT (override if needed). - The DUT has no agent wrapper, so the orchestrator must build the full agent command itself. Three additive flags get this done while leaving the containerized harness path completely unchanged when none are set: --agent-binary path of doublezero-agent on the DUT --agent-command-prefix prepended verbatim (e.g. "/sbin/ip netns exec ns-management") --agent-pubkey appended as `-pubkey <value>` to the agent command The containerized wrapper script that injects -pubkey from /etc/doublezero/agent/pubkey continues to work because all three flags default to empty / "doublezero-agent" / empty. The script is structured to be familiar to anyone who has read run-stress-local.sh — same `log()` helper, same `xargs -P` access-pass loop (target swapped from `docker exec manager doublezero` to `doublezero --url ... --program-id ...`), same final stanza printing pids + log paths. This PR targets nikw9944/stress-test-harness so the existing reviewed PRs (#3796 / #3820 / #3821) aren't perturbed.

@elitegreg

Smoke-testing the harness against the real Arista device surfaced four defaults that were wrong for non-containerized hardware, and Greg pointed out two robustness issues in the script's controller lifecycle. Smoke-test fixes: - DUT defaults: DUT_HOST=10.0.0.15 (the real chi-dn-dzd5 address — I had misread "the device can reach 10.0.0.141" as the device's address; 10.0.0.141 is the host running the controller). DUT_SSH_USER defaults to `nik` and DUT_SSH_KEY to `$HOME/.ssh/nik@malbeclabs.com`. - AGENT_BINARY defaults to /mnt/flash/doublezero-agent: persistent across EOS reboots and writable by sudo. AGENT_COMMAND_PREFIX now defaults to `bash sudo /sbin/ip netns exec ns-management` — `bash` escapes EOS's RunCli login shell into the underlying shell, and `sudo` provides the CAP_SYS_ADMIN that `ip netns exec` requires. - DEVICE_PUBLIC_IP defaults to 9.210.180.5: a globally-routable IP (the program's CreateDevice rejects RFC1918) outside the device's own dz_prefix /29 (the program also rejects overlap). The previous default ($DUT_HOST, an RFC1918 address) failed both checks. - device-tunnel-block changed from 9.210.180.0/24 to 172.16.0.0/16 (the standard from smartcontract/test/start-test.sh). The program auto-allocates loopback IPs from this block on plain-loopback creates (interface/create.rs:213-218) and validates `is_private()` on the result (interface.rs:494). 9.210.180.0/24 is globally routable, so the validator rejected every allocation — the controller then refused to render config with `no or invalid VPNv4 loopback interface`. - SSH command prefixed with `bash`: EOS pins admin's NSS shell to RunCli, so `ssh nik@dut "test -x ..."` would be parsed by Cli, not bash. `bash test -x …` runs the rest under bash. - ssh.go grows an ssh-agent fall-back: when the configured key file is passphrase-protected, the runner now connects to $SSH_AUTH_SOCK and uses the agent's signers. The old code aborted with "this private key is passphrase protected" because Go's ssh.ParsePrivateKey can't prompt for a passphrase. An operator with an encrypted key just needs `ssh-add <key>` once. - Loopback create no longer passes --ip-net: the program rejects user-supplied ip_net on plain loopbacks (interface/create.rs:155- 162). Also the loopback-create branch now actually checks exit status / error markers rather than blindly logging "registered" on any non-"already exists" output. Review feedback (#3829): - @elitegreg: `dz device update --max-users` only ran inside the `if ! dz device get` create branch, so reruns with a larger --target-users would leave the onchain cap stuck. Moved the update outside the create guard so every run resyncs it. - @elitegreg: `trap - EXIT` was disarming the controller-cleanup trap the moment the gRPC port came up, leaving subsequent phases (access-pass setup, build, orchestrator + observer launch) able to orphan the controller under `set -e`. The disarm now happens only after the orchestrator has launched, so the trap covers the entire pre-orchestrator window.

… the doublezero CLI default Program IDs aren't checked in. The stress-test serviceability program ID lives in the private infra repo; require operators to export DZ_PROGRAM_ID explicitly and fail fast with a clear message when it's unset (`\${DZ_PROGRAM_ID:?...}`). The README's quick-start no longer shows the previous program ID — it'll get stale anyway since the program needs to be redeployed (`device_tunnel_block` is immutable after init). SOLANA_KEYPAIR now defaults to ~/.config/doublezero/id.json (the doublezero CLI's default keypair location) instead of ~/.config/solana/id.json. The script's \`dz\` wrapper reads the doublezero default implicitly, so matching its location keeps every signing call — init, device create, access passes, orchestrator — on the same operator authority by default. The solana CLI default may hold a different key entirely.

Two issues surfaced in the smoke run against chi-dn-dzd5. Port-conflict fail-fast: the script's `nohup go run controller ...` will silently fail to bind if a leftover controller from a previous run is still squatting on the listen port. The readiness check then succeeds against the stale process, the script proceeds, and the orchestrator talks to a controller pointing at the wrong program / RPC. Add an `ss -ltn 'sport = :$PORT'` check before launch — if any listener is present, die with a hint pointing at the prior run's $CONTROLLER_PID_FILE. Agent metrics flag: the containerized harness ships a device-side wrapper that injects `-metrics-enable -metrics-addr :50100`. The physical harness has no wrapper, so the orchestrator's SSH-exec'd agent command runs without metrics — the observer correctly reported "connection refused" scraping :50100. Add a new `--agent-metrics-addr` orchestrator flag mirroring the additive shape of `--agent-pubkey` and `--agent-command-prefix`: empty default preserves the containerized path (wrapper still handles it); the physical script sets it to `:$AGENT_METRICS_PORT` so the agent listens where the observer scrapes.

Real EOS devices require HTTP basic auth on eAPI; the containerized cEOS accepts an empty password and the script was hardcoding that. Add EAPI_USER (defaults to $DUT_SSH_USER) and EAPI_PASS (empty by default — preserves the containerized path) so the observer can authenticate against a physical DUT without code changes.

…EAPI_USER to admin Adds a "Prerequisites on the physical DUT" section that lists what the operator must have configured on the device before running the script: SSH access + bash user with passwordless sudo, the doublezero-agent binary on disk (with build command), the EOS SDK RPC agent (daemon eapilocal + management api eos-sdk-rpc) in VRF management, eAPI http-commands with a usable login, and the ns-management netns. The containerized harness gets away without these because cEOS provides the SDK implicitly and the harness renders its own startup-config; on a physical device they must already be there. Also documents EAPI_USER / EAPI_PASS in the env-var table, fixes the EAPI_USER default from $DUT_SSH_USER (a bash-shell operator account, typically not in eAPI's user table) to `admin` (the convention), and mentions reading EAPI_PASS interactively in the quick-start.

Two issues from running against chi-dn-dzd5. Pidfile bug: the script's controller startup is `nohup go run ...`, which compiles a temp binary and exec's it as a child of `go run`. We stored `$!` (the `go run` parent's PID) in the pidfile, but the port listener is the child. `kill $(cat controller.pid)` then killed only the shim and left the listener — making subsequent runs trip the new port-conflict fail-fast check. After the port comes up, `ss -ltnp` to discover the listener's PID and overwrite the pidfile with it. Cleanup now tries the listener pid first (preferred — killing it makes `go run` exit too) and falls back to the parent. mgmt-vrf mismatch: I had `--mgmt-vrf mgmt` hardcoded (matching cEOS), so the controller rendered `ntp server vrf mgmt ...` and real EOS rejected it at command 23/253 (the device's VRF is named `management` in full). Pull `DEVICE_MGMT_VRF` out as an env var defaulting to `management`, use it in both create and update, and include it in the rerun-resync block (same pattern as max-users). This recovers from a wrong-vrf onchain device on the next run without requiring an out-of-band fix.

…ort path too The 512-user run against chi-dn-dzd5 stranded 256 tunnels on the device even though deprovision completed onchain. Sequence: 1. observer's device_tunnel_gap trigger fired (false positive — `show gre tunnel static` returns empty when tunnels are down, even though they exist in running-config) 2. orchestrator detects the abort sentinel; provision returns context.Canceled (err != nil) 3. deprovision runs to completion under WithoutCancel — 512 deprovision_activates onchain 4. controller polls onchain state, generates a final config (~14k lines, ~400 KB) and pushes it to the agent 5. agent receives the config (visible in orchestrator.agent.log) 6. orchestrator's quiescence wait is GUARDED by `if err == nil && depErr == nil && agentErr == nil { wait }`. err == context.Canceled, so wait is skipped. 7. agentCancel() runs immediately; SSH session dies; the agent never gets to commit the 14k-line config. 8. Tunnel500..Tunnel755 stay in the running-config. This is the same class of bug commit 0791129 originally fixed for the success path. The abort triggers exist to flag suspicious off-device conditions ("agent silent for 120s", "device tunnel count doesn't match orchestrator") — they don't mean "kill the agent mid-commit". When deprovision has already completed onchain, we should give the agent its normal quiescence window to drain whatever the controller pushed before tearing down the SSH session. Two-line change: - Drop the `err == nil` requirement from the wait predicate (only `depErr == nil && agentErr == nil` matter — deprovision completed and the stream is healthy). - When ctx is already cancelled (abort path), pass a WithoutCancel'd context to the wait so it doesn't immediately bail at the first `<-ctx.Done()`. The AgentQuiescenceTimeout still caps total wait, and a signal- impatient operator can re-Ctrl-C the orchestrator. Adds TestRun_WaitsForAgentQuiescenceEvenOnAbort proving the wait engages on the ctx-cancellation path while deprovision still completes.

Quiescence-wait floor (the (a) fix from the smoke-test rerun): the existing predicate "agent silent for ≥ quietWindow" returned instantly when the agent had been silent BEFORE the wait engaged — exactly the shape seen against chi-dn-dzd5 where slow EOS commits take 60+ s and the agent emits no log lines during them. Add an elapsed-since-wait- start floor so the wait always blocks at least `quietWindow` after deprovision returns, giving the agent its full window to commit any post-deprovision controller config push. Updates the matching test to assert both branches are observed. Review feedback (#3829 follow-up): - M1: `agentAuthMethod` was leaking its ssh-agent unix-socket conn. Return the conn alongside the AuthMethod, stash it on the SSH struct, close it in `shutdown()` (which is already idempotent). - M2: an ssh-agent that's reachable but has zero loaded keys now errors out fast with a "ssh-add your key first" hint instead of silently producing an AuthMethod that fails opaquely later in the SSH handshake. - M3: the `ss -ltnp pid=` parse for the controller listener PID needs GNU grep AND socket ownership. Add a `pgrep -P` fallback that walks the process tree from the `go run` parent — works on BSD/macOS and in unprivileged shells where ss returns nothing. - L1: README now calls out the `bash sudo` prefix in the agent- invocation row (the `bash` keyword escapes EOS Cli into the shell; `sudo` provides CAP_SYS_ADMIN for `ip netns exec`). - L2: README's "Management netns" prereq now explains that the netns name tracks the VRF name, with `vrf management → ns-management` and `vrf mgmt → ns-mgmt` examples, and cross-references the `DZ_STRESS_DEVICE_MGMT_VRF` env var. - L3: rewrote `TestRun_WaitsForAgentQuiescenceEvenOnAbort` to emit the synthetic event well before cancellation and assert `elapsed ≥ quietWindow/2`. The previous version would have passed against a regression that reverted either the `err == nil` guard removal or the new elapsed-since-wait-start floor. - L4: moved `trap cleanup_controller EXIT` to before the `nohup go run ... &` so the trap arms before launch — closes a tiny window where a `set -e` failure between launch and trap arming could orphan the controller. `cleanup_controller` now reads both pid vars with `:-` defaults so a pre-launch fire is a no-op rather than a `kill ""` error. - N1: comment on the orchestrator's new agent flags now references `tools/stress/docker/device/agent-wrapper.sh` for readers following the wrapper trail.

nikw9944 added the skip-changelog label Jun 3, 2026

nikw9944 marked this pull request as draft June 3, 2026 16:22

nikw9944 requested a review from elitegreg June 3, 2026 16:29

nikw9944 linked an issue Jun 3, 2026 that may be closed by this pull request

stress: implement tools/stress/scripts/test-setup #3748

Closed

nikw9944 force-pushed the nikw9944/stress-test-harness branch from 1915a91 to 3b1498e Compare June 3, 2026 18:18

elitegreg approved these changes Jun 3, 2026

View reviewed changes

Comment thread tools/stress/scripts/run-stress-physical.sh Outdated

Comment thread tools/stress/scripts/run-stress-physical.sh Outdated

nikw9944 force-pushed the nikw9944/stress-test-harness branch from 3b1498e to ab48f61 Compare June 3, 2026 18:39

nikw9944 force-pushed the nikw9944/stress-physical-harness branch from 8bb0df4 to 4e3441d Compare June 3, 2026 18:48

nikw9944 marked this pull request as ready for review June 3, 2026 18:50

Base automatically changed from nikw9944/stress-test-harness to main June 3, 2026 18:56

nikw9944 added 9 commits June 4, 2026 01:22

nikw9944 force-pushed the nikw9944/stress-physical-harness branch from 5d0734e to 31e8177 Compare June 4, 2026 01:22

nikw9944 enabled auto-merge (squash) June 4, 2026 12:45

nikw9944 merged commit fbce2bf into main Jun 4, 2026
38 of 39 checks passed

nikw9944 deleted the nikw9944/stress-physical-harness branch June 4, 2026 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools/stress: physical-device harness driver#3829

tools/stress: physical-device harness driver#3829
nikw9944 merged 9 commits into
mainfrom
nikw9944/stress-physical-harness

nikw9944 commented Jun 3, 2026 •

edited

Loading

Uh oh!

elitegreg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nikw9944 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Diff Breakdown

Testing Verification

Uh oh!

elitegreg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nikw9944 commented Jun 3, 2026 •

edited

Loading