Skip to content

tools/stress: physical-device harness driver#3829

Merged
nikw9944 merged 9 commits into
mainfrom
nikw9944/stress-physical-harness
Jun 4, 2026
Merged

tools/stress: physical-device harness driver#3829
nikw9944 merged 9 commits into
mainfrom
nikw9944/stress-physical-harness

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Jun 3, 2026

Summary of Changes

  • New tools/stress/scripts/run-stress-physical.sh driver that runs the same orchestrator + observer (from PR 3747-4: abort decider + sentinel (~200 LOC code) #3796/tools/stress: fixes surfaced by running the 1024-user stress harness #3820/tools/stress: local containerized harness script #3821) against a physical Arista EOS device instead of the containerized cEOS. Idempotent serviceability init, idempotent device + loopback create, controller launch via go run against the devnet ledger, parallel access-pass setup, orchestrator + observer launch with a port-conflict fail-fast and a trap that cleans up the controller on any pre-launch error.
  • Four additive flags on the device-orchestrator so the SSH-exec'd agent command can be made self-contained on a DUT with no wrapper script: --agent-binary, --agent-command-prefix, --agent-pubkey, --agent-metrics-addr. All default to empty so the containerized harness path is byte-for-byte unchanged.
  • ssh-agent fallback in the orchestrator's SSH runner: when the configured key file is passphrase-protected (Go's ssh.ParsePrivateKey can't prompt), the runner connects to $SSH_AUTH_SOCK and uses the agent's loaded signers. Empty-agent and unreachable-agent paths fail fast with actionable diagnostics instead of opaque SSH handshake errors.
  • Post-deprovision agent quiescence wait now engages on the abort path too (not just clean success) and enforces a minimum window after deprovision returns. Without this, the observer's abort sentinel (legitimate or false-positive) could kill the SSH session mid-commit and strand tunnels in the device's running-config even though onchain deprovision completed.
  • README appendix on physical-DUT prerequisites: SSH user with passwordless sudo + bash shell, the agent binary on disk, EOS SDK RPC daemon in the management VRF, eAPI http-commands with credentials, and the management netns. Includes the env-var knob table.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +121 / -26 +95
Scaffolding 2 +534 / -5 +529
Tests 1 +91 / -0 +91
Docs 1 +142 / -0 +142
Total 6 +888 / -31 +857

Mostly scaffolding (the new driver script is ~500 of the 857 net lines); core-logic changes are concentrated in two files (ssh-agent fallback + quiescence-wait predicate) and accompanied by a test rewrite covering the new branch.

Key files (click to expand)

Testing Verification

  • TestRun_WaitsForAgentQuiescenceEvenOnAbort exercises both axes of the new branch: that the wait engages when ctx is cancelled (abort path), and that the elapsed-since-wait-start floor blocks even when the absolute "silent for ≥ window" predicate would return instantly.
  • TestRun_WaitsForAgentQuiescenceAfterDeprovision and TestRun_SkipsQuiescenceWaitWhenNoAppliedObserved continue to assert the success and no-event paths respectively.
  • Validated end-to-end against chi-dn-dzd5: 4-user sweep completes cleanly with 0 stranded tunnels; 512-user sweep provisions + deprovisions all 512 onchain. The 1024-user ceiling is a hardware limit on this DUT model (Tunnel ID names cap at 1023; with StartUserTunnelNum=500 that's 524 usable users on chi-dn-dzd5 specifically — different physical DUTs have higher caps).

@nikw9944 nikw9944 marked this pull request as draft June 3, 2026 16:22
@nikw9944 nikw9944 requested a review from elitegreg June 3, 2026 16:29
@nikw9944 nikw9944 linked an issue Jun 3, 2026 that may be closed by this pull request
@nikw9944 nikw9944 force-pushed the nikw9944/stress-test-harness branch from 1915a91 to 3b1498e Compare June 3, 2026 18:18
Copy link
Copy Markdown
Contributor

@elitegreg elitegreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed. The orchestrator change is minimal and cleanly backward compatible — with all three new flags at their defaults the agent command is byte-identical to the existing containerized path. The driver script is a faithful sibling of run-stress-local.sh. Approving.

Two items worth addressing (left inline):

  • --max-users is only set in the device-create branch, so reruns with a larger --target-users won't scale the onchain cap.
  • The controller EXIT trap disarms as soon as the gRPC port is up, orphaning the go run controller if a later phase fails.

Comment thread tools/stress/scripts/run-stress-physical.sh Outdated
Comment thread tools/stress/scripts/run-stress-physical.sh Outdated
@nikw9944 nikw9944 force-pushed the nikw9944/stress-test-harness branch from 3b1498e to ab48f61 Compare June 3, 2026 18:39
nikw9944 added a commit that referenced this pull request Jun 3, 2026
Smoke-testing the harness against the real Arista device surfaced four
defaults that were wrong for non-containerized hardware, and Greg
pointed out two robustness issues in the script's controller lifecycle.

Smoke-test fixes:

- DUT defaults: DUT_HOST=10.0.0.15 (the real chi-dn-dzd5 address — I
  had misread "the device can reach 10.0.0.141" as the device's
  address; 10.0.0.141 is the host running the controller). DUT_SSH_USER
  defaults to `nik` and DUT_SSH_KEY to `$HOME/.ssh/nik@malbeclabs.com`.

- AGENT_BINARY defaults to /mnt/flash/doublezero-agent: persistent
  across EOS reboots and writable by sudo. AGENT_COMMAND_PREFIX now
  defaults to `bash sudo /sbin/ip netns exec ns-management` — `bash`
  escapes EOS's RunCli login shell into the underlying shell, and
  `sudo` provides the CAP_SYS_ADMIN that `ip netns exec` requires.

- DEVICE_PUBLIC_IP defaults to 9.210.180.5: a globally-routable IP
  (the program's CreateDevice rejects RFC1918) outside the device's
  own dz_prefix /29 (the program also rejects overlap). The previous
  default ($DUT_HOST, an RFC1918 address) failed both checks.

- device-tunnel-block changed from 9.210.180.0/24 to 172.16.0.0/16
  (the standard from smartcontract/test/start-test.sh). The program
  auto-allocates loopback IPs from this block on plain-loopback
  creates (interface/create.rs:213-218) and validates `is_private()`
  on the result (interface.rs:494). 9.210.180.0/24 is globally
  routable, so the validator rejected every allocation — the
  controller then refused to render config with `no or invalid VPNv4
  loopback interface`.

- SSH command prefixed with `bash`: EOS pins admin's NSS shell to
  RunCli, so `ssh nik@dut "test -x ..."` would be parsed by Cli, not
  bash. `bash test -x …` runs the rest under bash.

- ssh.go grows an ssh-agent fall-back: when the configured key file
  is passphrase-protected, the runner now connects to
  $SSH_AUTH_SOCK and uses the agent's signers. The old code aborted
  with "this private key is passphrase protected" because Go's
  ssh.ParsePrivateKey can't prompt for a passphrase. An operator
  with an encrypted key just needs `ssh-add <key>` once.

- Loopback create no longer passes --ip-net: the program rejects
  user-supplied ip_net on plain loopbacks (interface/create.rs:155-
  162). Also the loopback-create branch now actually checks exit
  status / error markers rather than blindly logging "registered"
  on any non-"already exists" output.

Review feedback (#3829):

- @elitegreg: `dz device update --max-users` only ran inside the
  `if ! dz device get` create branch, so reruns with a larger
  --target-users would leave the onchain cap stuck. Moved the
  update outside the create guard so every run resyncs it.

- @elitegreg: `trap - EXIT` was disarming the controller-cleanup
  trap the moment the gRPC port came up, leaving subsequent phases
  (access-pass setup, build, orchestrator + observer launch) able
  to orphan the controller under `set -e`. The disarm now happens
  only after the orchestrator has launched, so the trap covers
  the entire pre-orchestrator window.
@nikw9944 nikw9944 force-pushed the nikw9944/stress-physical-harness branch from 8bb0df4 to 4e3441d Compare June 3, 2026 18:48
@nikw9944 nikw9944 marked this pull request as ready for review June 3, 2026 18:50
Base automatically changed from nikw9944/stress-test-harness to main June 3, 2026 18:56
nikw9944 added 9 commits June 4, 2026 01:22
Adds a sister script to run-stress-local.sh that exercises the same
orchestrator + observer against a real Arista EOS device. The physical
case differs from the containerized one in three substantive ways:

- Ledger / serviceability live on devnet (DZ_RPC_URL + DZ_PROGRAM_ID),
  with the program pre-deployed but not yet initialized. The script
  walks the same init steps as e2e/internal/devnet/smartcontract_init.go
  (global-config, one location, one exchange, contributor co01), each
  guarded with a `get`-then-skip so reruns are safe.

- The controller runs natively on the host via
  `go run controlplane/controller/cmd/controller start ...` rather than
  in a container; the script handles startup/teardown and waits for the
  gRPC port. CONTROLLER_ADVERTISE_ADDR auto-detects the host's address
  routable from the DUT (override if needed).

- The DUT has no agent wrapper, so the orchestrator must build the
  full agent command itself. Three additive flags get this done while
  leaving the containerized harness path completely unchanged when none
  are set:

    --agent-binary           path of doublezero-agent on the DUT
    --agent-command-prefix   prepended verbatim (e.g.
                             "/sbin/ip netns exec ns-management")
    --agent-pubkey           appended as `-pubkey <value>` to the
                             agent command

  The containerized wrapper script that injects -pubkey from
  /etc/doublezero/agent/pubkey continues to work because all three
  flags default to empty / "doublezero-agent" / empty.

The script is structured to be familiar to anyone who has read
run-stress-local.sh — same `log()` helper, same `xargs -P` access-pass
loop (target swapped from `docker exec manager doublezero` to
`doublezero --url ... --program-id ...`), same final stanza printing
pids + log paths.

This PR targets nikw9944/stress-test-harness so the existing reviewed
PRs (#3796 / #3820 / #3821) aren't perturbed.
Smoke-testing the harness against the real Arista device surfaced four
defaults that were wrong for non-containerized hardware, and Greg
pointed out two robustness issues in the script's controller lifecycle.

Smoke-test fixes:

- DUT defaults: DUT_HOST=10.0.0.15 (the real chi-dn-dzd5 address — I
  had misread "the device can reach 10.0.0.141" as the device's
  address; 10.0.0.141 is the host running the controller). DUT_SSH_USER
  defaults to `nik` and DUT_SSH_KEY to `$HOME/.ssh/nik@malbeclabs.com`.

- AGENT_BINARY defaults to /mnt/flash/doublezero-agent: persistent
  across EOS reboots and writable by sudo. AGENT_COMMAND_PREFIX now
  defaults to `bash sudo /sbin/ip netns exec ns-management` — `bash`
  escapes EOS's RunCli login shell into the underlying shell, and
  `sudo` provides the CAP_SYS_ADMIN that `ip netns exec` requires.

- DEVICE_PUBLIC_IP defaults to 9.210.180.5: a globally-routable IP
  (the program's CreateDevice rejects RFC1918) outside the device's
  own dz_prefix /29 (the program also rejects overlap). The previous
  default ($DUT_HOST, an RFC1918 address) failed both checks.

- device-tunnel-block changed from 9.210.180.0/24 to 172.16.0.0/16
  (the standard from smartcontract/test/start-test.sh). The program
  auto-allocates loopback IPs from this block on plain-loopback
  creates (interface/create.rs:213-218) and validates `is_private()`
  on the result (interface.rs:494). 9.210.180.0/24 is globally
  routable, so the validator rejected every allocation — the
  controller then refused to render config with `no or invalid VPNv4
  loopback interface`.

- SSH command prefixed with `bash`: EOS pins admin's NSS shell to
  RunCli, so `ssh nik@dut "test -x ..."` would be parsed by Cli, not
  bash. `bash test -x …` runs the rest under bash.

- ssh.go grows an ssh-agent fall-back: when the configured key file
  is passphrase-protected, the runner now connects to
  $SSH_AUTH_SOCK and uses the agent's signers. The old code aborted
  with "this private key is passphrase protected" because Go's
  ssh.ParsePrivateKey can't prompt for a passphrase. An operator
  with an encrypted key just needs `ssh-add <key>` once.

- Loopback create no longer passes --ip-net: the program rejects
  user-supplied ip_net on plain loopbacks (interface/create.rs:155-
  162). Also the loopback-create branch now actually checks exit
  status / error markers rather than blindly logging "registered"
  on any non-"already exists" output.

Review feedback (#3829):

- @elitegreg: `dz device update --max-users` only ran inside the
  `if ! dz device get` create branch, so reruns with a larger
  --target-users would leave the onchain cap stuck. Moved the
  update outside the create guard so every run resyncs it.

- @elitegreg: `trap - EXIT` was disarming the controller-cleanup
  trap the moment the gRPC port came up, leaving subsequent phases
  (access-pass setup, build, orchestrator + observer launch) able
  to orphan the controller under `set -e`. The disarm now happens
  only after the orchestrator has launched, so the trap covers
  the entire pre-orchestrator window.
… the doublezero CLI default

Program IDs aren't checked in. The stress-test serviceability program ID
lives in the private infra repo; require operators to export
DZ_PROGRAM_ID explicitly and fail fast with a clear message when it's
unset (`\${DZ_PROGRAM_ID:?...}`). The README's quick-start no longer
shows the previous program ID — it'll get stale anyway since the
program needs to be redeployed (`device_tunnel_block` is immutable
after init).

SOLANA_KEYPAIR now defaults to ~/.config/doublezero/id.json (the
doublezero CLI's default keypair location) instead of
~/.config/solana/id.json. The script's \`dz\` wrapper reads the
doublezero default implicitly, so matching its location keeps every
signing call — init, device create, access passes, orchestrator — on
the same operator authority by default. The solana CLI default may
hold a different key entirely.
Two issues surfaced in the smoke run against chi-dn-dzd5.

Port-conflict fail-fast: the script's `nohup go run controller ...`
will silently fail to bind if a leftover controller from a previous
run is still squatting on the listen port. The readiness check then
succeeds against the stale process, the script proceeds, and the
orchestrator talks to a controller pointing at the wrong program /
RPC. Add an `ss -ltn 'sport = :$PORT'` check before launch — if any
listener is present, die with a hint pointing at the prior run's
$CONTROLLER_PID_FILE.

Agent metrics flag: the containerized harness ships a device-side
wrapper that injects `-metrics-enable -metrics-addr :50100`. The
physical harness has no wrapper, so the orchestrator's SSH-exec'd
agent command runs without metrics — the observer correctly
reported "connection refused" scraping :50100. Add a new
`--agent-metrics-addr` orchestrator flag mirroring the additive
shape of `--agent-pubkey` and `--agent-command-prefix`: empty
default preserves the containerized path (wrapper still handles
it); the physical script sets it to `:$AGENT_METRICS_PORT` so the
agent listens where the observer scrapes.
Real EOS devices require HTTP basic auth on eAPI; the containerized
cEOS accepts an empty password and the script was hardcoding that.
Add EAPI_USER (defaults to $DUT_SSH_USER) and EAPI_PASS (empty by
default — preserves the containerized path) so the observer can
authenticate against a physical DUT without code changes.
…EAPI_USER to admin

Adds a "Prerequisites on the physical DUT" section that lists what the
operator must have configured on the device before running the script:
SSH access + bash user with passwordless sudo, the doublezero-agent
binary on disk (with build command), the EOS SDK RPC agent
(daemon eapilocal + management api eos-sdk-rpc) in VRF management,
eAPI http-commands with a usable login, and the ns-management netns.
The containerized harness gets away without these because cEOS
provides the SDK implicitly and the harness renders its own
startup-config; on a physical device they must already be there.

Also documents EAPI_USER / EAPI_PASS in the env-var table, fixes the
EAPI_USER default from $DUT_SSH_USER (a bash-shell operator account,
typically not in eAPI's user table) to `admin` (the convention), and
mentions reading EAPI_PASS interactively in the quick-start.
Two issues from running against chi-dn-dzd5.

Pidfile bug: the script's controller startup is `nohup go run ...`,
which compiles a temp binary and exec's it as a child of `go run`.
We stored `$!` (the `go run` parent's PID) in the pidfile, but the
port listener is the child. `kill $(cat controller.pid)` then killed
only the shim and left the listener — making subsequent runs trip
the new port-conflict fail-fast check. After the port comes up,
`ss -ltnp` to discover the listener's PID and overwrite the pidfile
with it. Cleanup now tries the listener pid first (preferred —
killing it makes `go run` exit too) and falls back to the parent.

mgmt-vrf mismatch: I had `--mgmt-vrf mgmt` hardcoded (matching
cEOS), so the controller rendered `ntp server vrf mgmt ...` and
real EOS rejected it at command 23/253 (the device's VRF is named
`management` in full). Pull `DEVICE_MGMT_VRF` out as an env var
defaulting to `management`, use it in both create and update, and
include it in the rerun-resync block (same pattern as max-users).
This recovers from a wrong-vrf onchain device on the next run
without requiring an out-of-band fix.
…ort path too

The 512-user run against chi-dn-dzd5 stranded 256 tunnels on the
device even though deprovision completed onchain. Sequence:

  1. observer's device_tunnel_gap trigger fired
     (false positive — `show gre tunnel static` returns empty when
     tunnels are down, even though they exist in running-config)
  2. orchestrator detects the abort sentinel; provision returns
     context.Canceled (err != nil)
  3. deprovision runs to completion under WithoutCancel — 512
     deprovision_activates onchain
  4. controller polls onchain state, generates a final config
     (~14k lines, ~400 KB) and pushes it to the agent
  5. agent receives the config (visible in orchestrator.agent.log)
  6. orchestrator's quiescence wait is GUARDED by
     `if err == nil && depErr == nil && agentErr == nil { wait }`.
     err == context.Canceled, so wait is skipped.
  7. agentCancel() runs immediately; SSH session dies; the agent
     never gets to commit the 14k-line config.
  8. Tunnel500..Tunnel755 stay in the running-config.

This is the same class of bug commit 0791129 originally fixed
for the success path. The abort triggers exist to flag suspicious
off-device conditions ("agent silent for 120s", "device tunnel
count doesn't match orchestrator") — they don't mean "kill the
agent mid-commit". When deprovision has already completed onchain,
we should give the agent its normal quiescence window to drain
whatever the controller pushed before tearing down the SSH session.

Two-line change:
  - Drop the `err == nil` requirement from the wait predicate
    (only `depErr == nil && agentErr == nil` matter — deprovision
    completed and the stream is healthy).
  - When ctx is already cancelled (abort path), pass a
    WithoutCancel'd context to the wait so it doesn't immediately
    bail at the first `<-ctx.Done()`. The
    AgentQuiescenceTimeout still caps total wait, and a signal-
    impatient operator can re-Ctrl-C the orchestrator.

Adds TestRun_WaitsForAgentQuiescenceEvenOnAbort proving the wait
engages on the ctx-cancellation path while deprovision still
completes.
Quiescence-wait floor (the (a) fix from the smoke-test rerun): the
existing predicate "agent silent for ≥ quietWindow" returned instantly
when the agent had been silent BEFORE the wait engaged — exactly the
shape seen against chi-dn-dzd5 where slow EOS commits take 60+ s and
the agent emits no log lines during them. Add an elapsed-since-wait-
start floor so the wait always blocks at least `quietWindow` after
deprovision returns, giving the agent its full window to commit any
post-deprovision controller config push. Updates the matching test to
assert both branches are observed.

Review feedback (#3829 follow-up):

- M1: `agentAuthMethod` was leaking its ssh-agent unix-socket conn.
  Return the conn alongside the AuthMethod, stash it on the SSH
  struct, close it in `shutdown()` (which is already idempotent).
- M2: an ssh-agent that's reachable but has zero loaded keys now
  errors out fast with a "ssh-add your key first" hint instead of
  silently producing an AuthMethod that fails opaquely later in the
  SSH handshake.
- M3: the `ss -ltnp pid=` parse for the controller listener PID
  needs GNU grep AND socket ownership. Add a `pgrep -P` fallback
  that walks the process tree from the `go run` parent — works on
  BSD/macOS and in unprivileged shells where ss returns nothing.
- L1: README now calls out the `bash sudo` prefix in the agent-
  invocation row (the `bash` keyword escapes EOS Cli into the shell;
  `sudo` provides CAP_SYS_ADMIN for `ip netns exec`).
- L2: README's "Management netns" prereq now explains that the
  netns name tracks the VRF name, with `vrf management → ns-management`
  and `vrf mgmt → ns-mgmt` examples, and cross-references the
  `DZ_STRESS_DEVICE_MGMT_VRF` env var.
- L3: rewrote `TestRun_WaitsForAgentQuiescenceEvenOnAbort` to emit
  the synthetic event well before cancellation and assert
  `elapsed ≥ quietWindow/2`. The previous version would have
  passed against a regression that reverted either the `err == nil`
  guard removal or the new elapsed-since-wait-start floor.
- L4: moved `trap cleanup_controller EXIT` to before the
  `nohup go run ... &` so the trap arms before launch — closes a
  tiny window where a `set -e` failure between launch and trap
  arming could orphan the controller. `cleanup_controller` now
  reads both pid vars with `:-` defaults so a pre-launch fire is
  a no-op rather than a `kill ""` error.
- N1: comment on the orchestrator's new agent flags now references
  `tools/stress/docker/device/agent-wrapper.sh` for readers
  following the wrapper trail.
@nikw9944 nikw9944 force-pushed the nikw9944/stress-physical-harness branch from 5d0734e to 31e8177 Compare June 4, 2026 01:22
@nikw9944 nikw9944 enabled auto-merge (squash) June 4, 2026 12:45
@nikw9944 nikw9944 merged commit fbce2bf into main Jun 4, 2026
38 of 39 checks passed
@nikw9944 nikw9944 deleted the nikw9944/stress-physical-harness branch June 4, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stress: implement tools/stress/scripts/test-setup

2 participants