feat(launch): POST /api/admin-pubkey to flip on passwordless gateway auth#6
Conversation
…auth Followup #2 from PR #5: lets a tenant admin bind their Noise static pubkey to enable passwordless ZTLP gateway auth without SSHing into the host to hand-edit instance.env. POST /api/admin-pubkey body (x-www-form-urlencoded): token=<claim_token>&pubkey_hex=<64 lowercase hex> returns 200 application/json: {status:ok, slug, applied:true} returns 400 / 401 / 404 / 500 with {error,detail?} Auth model: holder of the original claim_token. The token was shown once to the admin at /start time and is stored only as an HMAC digest in the DB, so reusing it as the auth credential here does not widen the trust boundary — anyone with the raw token already has full control of this tenant. Validation mirrors the Rust gateway's --admin-pubkey-email check: exactly 64 lowercase hex chars decoding to a 32-byte X25519 public key. Uppercase hex is normalized; off-by-one lengths, non-hex chars, and missing/invalid tokens all fail fast with clear errors. Side effects on success: 1. Rewrites the ZTLP_ADMIN_PUBKEY_HEX= line in instance.env in place (preserves all other operator-set keys). 2. Runs 'docker compose up -d --force-recreate gateway' in the instance dir. --force-recreate is required — a plain 'up -d' is a no-op when the image+config hash is unchanged and changing an env_file value alone does not bust that hash for already- running containers. Followup #1: also document inline why the chmod 600 on secrets.env is compatible with the gateway's env_file mount. docker-compose v2 reads env_file in the CLI process (same user that wrote it), not in the container, so the gateway never needs read perms on the file — only the Launch process user invoking 'docker compose up' does. Refactors: extract _slug_for_row and _instance_dir_for_slug helpers so the new handler and _provision_zone_dockers compute paths the exact same way. The test fixture's _fake_run now records calls into self._subprocess_calls so we can assert on the docker recreate. Tests: 7 new cases (45 total, was 38) covering missing/invalid tokens, bad-hex shapes, uppercase normalization, env+recreate side effects, rebind overwriting prior value, and 404 when the admin hits the endpoint before clicking the claim link.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Detailed description: - Updated the handoff document to capture the state of the passwordless authentication project. - Documented that PR #5 and PR #6 were completed and merged. - Logged the outstanding block: the 'Nebula dumb-pipe' architecture is routing around the L7 HTTP Injection layer (), breaking auto-login. - Established the immediate next steps: do not guess code fixes; instead, turn on aggressive trace logging / tcpdump on the AWS test server to measure exactly how TCP packets behave across the tunnel. - Documented the uncommitted hot-patch variable escape remaining on the AWS server.
What changed: - Define new ZTLP control frame `FRAME_CLIENT_ROUTE` (magic 0x5A 0x37 0x0B) in proto/src/bin/ztlp-cli.rs alongside existing GATEWAY_REGISTER constants. - Add `build_client_route_packet` builder + `parse_client_route_packet` helper (test-only) with HMAC-SHA256 reusing the GATEWAY_REGISTER scheme. - Validate svc_len bounds (1..=63), magic, type, and truncation. - 8 unit tests cover round-trip in dev and HMAC modes plus all rejection paths (empty/oversize service name, bad magic, bad type, truncated bytes). Why: - α-relay milestone (per handoff Task 1) — this frame replaces the rejected ALPN-byte-regex / dummy-Noise-UDP-HELLO approach (handoff Decisions #3). - Reusing the GATEWAY_REGISTER HMAC scheme keeps new crypto surface area to zero (handoff Decisions #4). Implementation details: - Wire layout: magic(2) | type(1) | node_id(16) | svc_len(1) | service(N) | timestamp(8 BE i64) | hmac(32). Min 61 bytes, max 124. - HMAC covers [type | node_id | svc_len | service | timestamp]; zeroed in dev mode, computed via hmac+sha2 crates (already deps) when a secret is provided. - Includes the client's NodeID for ACL/audit (Open Question #1, Steve's tenant-isolation preference). Tests added: - client_route_packet_roundtrip_dev_mode - client_route_packet_hmac_is_set_with_secret - client_route_rejects_empty_service_name - client_route_rejects_oversized_service_name - client_route_parse_rejects_bad_magic - client_route_parse_rejects_bad_type - client_route_parse_rejects_truncated_packet - client_route_max_length_service_name_succeeds Validation performed: - cargo test -p ztlp-proto --bin ztlp client_route → 8/8 passed - cargo test -p ztlp-proto --bin ztlp → 13/13 passed - cargo test -p ztlp-proto --lib → 857/857 passed - cargo build -p ztlp-proto --bin ztlp → clean (31 pre-existing warnings, no new ones) Follow-up needed: - Wire `Config.registration_secret`-equivalent into the CLI so production HMAC isn't always zeroed. - Phase C multi-service routing. - v0.27.1 parity audit. Known limitations: - iOS NE path still uses legacy Noise-UDP — out of scope for this change (handoff Decisions #6).
What: - Bump ztlp-proto from 0.28.5 -> 0.29.0 (proto/Cargo.toml) - Bump ztlp_relay from 0.24.0 -> 0.29.0 (relay/mix.exs) to align with the Rust crate and the GitHub release tag. - Restore proto/tests/handshake_tests.rs by extending HandshakeResult with `initiator_session` / `responder_session` aliases so the integration tests compile against the current shape without losing the in-process responder-side key check that the legacy harness asserted. - Apply rustfmt to ztlp-cli.rs / quic_transport.rs / tunnel.rs / quic_transport_test.rs so CI's \`cargo fmt -- --check\` step stays green. Why: - v0.29.0 ships the FRAME_CLIENT_ROUTE α-relay routing primitive end-to-end: CLI now installs a client_map ETS entry on the relay before the QUIC handshake, replacing the rejected dummy-Noise-UDP-HELLO hack from the previous session (handoff Decisions #3). Manual E2E validated against staging relay 34.218.240.106 plus tenant gateway gw-hermese2e-v at 35.91.88.177:40416, with curl returning HTTP 200 from the Rails bootstrap through the tunnel. - The handshake_tests.rs file pre-dated the current HandshakeResult shape and was failing `cargo clippy --all-targets` on this branch; ignoring it would have broken CI immediately after merge. Implementation details: - HandshakeResult keeps `session` for back-compat (any existing caller still works) and adds `initiator_session` (alias for the same in-process slot) and `responder_session` (populated only when both sides exist locally). Network-only handshakes (quic_transport.rs) fill both alias fields with the local session - those code paths only ever read `.session` / `.session_id` so the extra copies are inert. Tests: - cargo test --lib: 857 passed, 0 failed, 12 ignored - cargo clippy --all-targets: clean (warnings only, no errors) - cargo fmt -- --check: clean - mix test in relay/: 588 passed, 0 failed Validation performed: - α-relay E2E (THIS RELEASE'S MILESTONE): ztlp connect 34.218.240.106:23095 --service gw-hermese2e-v --quic -L 8083:127.0.0.1:3000 curl http://localhost:8083/ -> HTTP/1.1 200 OK + autologin session cookie Relay log: \"[UdpListener] CLIENT_ROUTE accepted: client=... service=gw-hermese2e-v gateway={{35, 91, 88, 177}, 40416}\" - β-direct regression: ztlp connect 35.91.88.177:40416 --service gw-hermese2e-v --quic -L 8084:127.0.0.1:3000 curl http://localhost:8084/ -> HTTP/1.1 200 OK Follow-up needed: - Phase C multi-service ServiceRegistry routing (handoff Task 2) - v0.27.1 parity audit (handoff Task 3) - `pick_gateway_for_service` round-robin fallback when the requested service name has no live gateway picks any registered gateway. That makes routing non-deterministic when the requested tenant is momentarily not heartbeating. Tighten to "error out and let the client retry" once we have client-side service-not-registered handling. Known limitations: - iOS NE path still uses legacy Noise-UDP (per handoff Decisions #6).
… with regression pins (#14) What - Bumps `gateway/mix.exs` and `ns/mix.exs` from 0.24.0 → 0.29.4. - Bumps `proto/Cargo.toml` from 0.29.3 → 0.29.4. - Adds the same "version reporting (regression pin)" describe block to `gateway/test/ztlp_gateway/release_test.exs` and `ns/test/ztlp_ns/release_test.exs` that PR #13 added for relay (semver shape + runtime-vs-declared drift + ≥0.29.4 floor guard). - Adds `proto/tests/version_pin_test.rs` with two equivalent Rust-side pins (parseable semver + ≥0.29.4 floor guard) using the in-crate `ztlp_proto::updater::SemVer` to avoid taking a new external dependency. Why - PR #13 fixed the relay's version-string drift but explicitly left the rest of the tree as follow-up scope (handoff Known Problems #6: "gateway/mix.exs and ns/mix.exs at 0.24.0; proto/Cargo.toml at 0.29.3"). Gateway and NS had been pinned at 0.24.0 for five minor versions, so `Application.spec(:ztlp_gateway, :vsn)` and `Application.spec(:ztlp_ns, :vsn)` were lying about which tag was actually deployed — the exact bug class PR #13 fixed for relay. Same lie, three more components. - Without regression pins on these other components, the same drift can silently recur on the next tag cut. Details - Floor guard uses `Version.compare(declared, "0.29.4") in [:gt, :eq]` (Elixir) / `actual.cmp(&floor)` (Rust) rather than asserting a literal version string. This is deliberate: literal-string assertions become test maintenance burden on every routine bump, whereas a floor guard only fails on accidental down-bumps below the v0.29.4 strict-routing tag. - Runtime-vs-declared drift test catches both directions of the bug PR #13 found: (a) mix.exs bumped but .app cache stale, and (b) tag cut without bumping mix.exs. - Rust-side test re-uses the in-crate `updater::SemVer` parser rather than pulling in the external `semver` crate. This exercises the same code path the self-updater uses, so if that parser ever regresses both will catch it. - mix.exs comment blocks explain the bump inline so future readers see the on-call story at the point of change, not just in the test file or handoff. - No source-code logic changed; this is a version-string + test-only PR. Public API, wire format, and runtime behavior are untouched. Tests (TDD discipline followed) - RED step verified per-component before bump: * gateway: `mix test test/ztlp_gateway/release_test.exs --seed 0` → "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag" (1/15 failures, all in the new floor-guard test, exactly as designed). * ns: `mix test test/ztlp_ns/release_test.exs --seed 0` → "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag" (1/15 failures, identical message). * proto: `cargo test --test version_pin_test` → "proto/Cargo.toml version 0.29.3 is older than the v0.29.4 strict-routing tag" (1/2 failures, parseable-semver test still green so we know the fixture is sane). - GREEN step after bumps: * gateway: 15/15 in release_test.exs; 835/835 in full `mix test` (seed 1). Note: seed 55290 surfaces two pre-existing test-ordering flakes (TLS port reuse in tls_phase2_test.exs and a GenServer teardown race in crl_server_test.exs) — both also reproduce on plain `main` and are unrelated to this PR. Confirmed by stash + checkout `main`-only files + rerun on the same seed. * ns: 15/15 in release_test.exs; 729/729 in full `mix test`. * proto: 2/2 in version_pin_test; 858/858 in `cargo test --lib --release` (matches the pre-branch baseline). Validation (non-test) - `cargo check --release` clean (31 pre-existing dead-code warnings in `proto/src/bin/ztlp-cli.rs` unchanged — Known Problems #4 scope). - Relay suite re-run unchanged: 597/597 (confirms no cross-component collateral damage). - ztlp.net Python suite unchanged: 48/48 in `tests.test_launch_app`. - Infra untouched. No relay restart, no gateway restart, no NS restart. Live binaries still on v0.29.3 (relay) / v0.24.0 (gateway, ns) — actual deployment of the bumped versions is a separate, Steve-gated step. Follow-up - After merge, decide tag strategy with v0.29.4 already in the past: either (a) cut v0.29.5 that includes this + PR #13's mix.exs bump (cleanest), or (b) accept "tag is source of truth; mix.exs is best-effort" for the trailing v0.29.4 and reset on v0.30.0. - Task #3 (per-zone HMAC `Config.registration_secret/0`) is still the prod-readiness blocker and is the hard dependency for the Bootstrap workstream (handoff §"HARD DEPENDENCY uncovered while locking in #5"). - Known Problems #4 (`cargo fix` pass on 31 dead-code warnings in `proto/src/bin/ztlp-cli.rs`) still open; left out of scope here to keep this PR tight. Refs - PR #12 (v0.29.4 strict-routing): 829abdf - PR #13 (relay mix.exs vsn pin): d22afbf - Handoff: ~/hermes_session_handoff.md "Known Problems #6", "Open Question #1"
…lean v0.29.5 tag (#15) What - Bumps `relay/mix.exs`, `gateway/mix.exs`, `ns/mix.exs` from 0.29.4 → 0.29.5. - Bumps `proto/Cargo.toml` from 0.29.4 → 0.29.5. - No test changes — the regression pins added in PR #13 / PR #14 still catch drift (semver-shape + runtime-vs-declared + ≥0.29.4 floor). Why - v0.29.4 tag was cut at commit `829abdf` BEFORE PR #13 and PR #14 bumped the four declared versions. That means a container built from the v0.29.4 tag reports its runtime vsn as the PRE-bump value, which is exactly the on-call-lying bug PR #13 introduced regression tests for. - The fix is to cut a NEW tag whose commit POST-DATES the bumps. v0.29.5 is the smallest possible bump that achieves this — pure version-string alignment, no functional change. - After this PR merges and v0.29.5 is tagged: git tag == relay vsn == gateway vsn == ns vsn == proto CARGO_PKG_VERSION == "0.29.5". The existing `release_test.exs` runtime-vs-declared drift test, run inside any container built from this tag, will report agreement instead of drift. Details - Floor guard in `release_test.exs` and `version_pin_test.rs` stays at 0.29.4. This is deliberate: the floor is a stable lower bound, not the current version. Ratcheting it on every routine bump is the exact maintenance burden the floor-guard pattern was designed to avoid (see the PR #13/#14 commit messages). 0.29.5 satisfies ≥0.29.4, and an accidental down-bump below 0.29.4 (which would lose the strict-routing fix) is what the floor exists to catch. - mix.exs comment blocks updated in-place to reference both PR #14 and this PR, so the version-line provenance is readable at the point of change without chasing PRs in git log. - No source-code logic changed. Public API, wire format, runtime behavior unchanged. Tests (green on this branch) - `cd proto && cargo test --lib --release`: 858 passed - `cd proto && cargo test --test version_pin_test`: 2 passed - `cd relay && mix test`: 597 passed - `cd gateway && mix test --seed 1`: 835 passed (seed 1 avoids the two pre-existing seed-55290 flakes in tls_phase2_test and crl_server_test — both documented in the handoff as Task #6 and unrelated to this PR.) - `cd ns && mix test`: 729 passed - `cd ztlp.net && python3 -m unittest tests.test_launch_app`: 48 passed - Total: 3,069 tests, 0 failures. The runtime-vs-declared drift test in each release_test.exs was the key verification — it exercises the same `Application.spec/2` RPC that lies on a misbuilt container. All three Elixir suites pass after the bump, which means mix recompiled the .app files to match the new declared version. This is the in-source-tree equivalent of the `docker exec ... rpc 'Application.spec(_, :vsn)'` check that found the original PR #13 bug. Validation - `git diff --stat` matches expectation: 4 files, ~23 insertions, ~19 deletions (the changes are mostly comment-block updates around the one-character version bumps). - No source code changes; no risk of breaking runtime behavior. - No infra touched. Follow-up (post-merge) - Tag `v0.29.5` on main, push tag with openclaw SSH key, watch the Release workflow to green. - Deploy `priceflex/ztlp-relay:v0.29.5` to staging relay (`34.218.240.106`) when Steve OKs the restart (iOS bench risk). A container built from v0.29.5 will report `:vsn = "0.29.5"` correctly, closing out the original PR #13 defect end-to-end. - Task #2 (per-zone HMAC `Config.registration_secret/0`) remains the prod-readiness blocker and hard dependency for the Bootstrap workstream. Refs - PR #12 (v0.29.4 strict-routing): 829abdf - PR #13 (relay mix.exs vsn pin): d22afbf - PR #14 (gateway/ns/proto vsn pin): 0a6fbb6 - Handoff: ~/hermes_session_handoff.md "Active Tasks → Task 0" decision list (Option 1 — cut v0.29.5 immediately).
… adding missing PolicyEngine setup (#16) What: - tls_phase2_test.exs:276 ("injects identity headers for mTLS connections in identity mode") was tagged @:flaky and reliably failed when run alone or as part of the suite at seed 55290. - Root cause: the test sends a client cert, which trips TlsSession.check_policy/1 → PolicyEngine.authorize?/2. Without a matching rule the call fell back to "default" → returned false → session rejected with :policy_denied. The TLS client never got a response, the backend's spawn_link :gen_tcp.accept/2 timed out at 15s, and the failure surfaced as a confusing `MatchError {:error, :timeout}` at line 286 — pointing at the acceptor, not at the missing policy rule. - Bug was hidden in full-suite runs at "good" seeds because earlier tests (tls_phase2_test.exs:430, :472, :766) insert `PolicyEngine.put_rule( "localhost", :all)` and that ETS state leaked forward. Why: - Tests must be self-contained. Relying on a previous test's PolicyEngine state is exactly the kind of order-dependency that produces "works on my machine" failures and makes seed 55290 look like a flake when it was actually a deterministic missing-setup bug. - Removing `@tag :flaky` is correct because the test is no longer flaky — the previous behavior was a latent test bug, not real non-determinism. Details: - Added `PolicyEngine.put_rule("localhost", :all)` as the first line of the test body, matching the pattern in :430, :472, :520, :562, :766. - Added an explanatory comment on the describe block documenting the PolicyEngine dependency for identity-mode tests so the next person who writes one of these tests doesn't repeat the omission. - Removed `@tag :flaky` — the test now passes deterministically across all seeds tried (1, 2, 7, 42, 100, 55290) in isolation and as part of the full suite. Tests: - `mix test test/ztlp_gateway/tls_phase2_test.exs:276 --seed 1` → 1 passed (previously: 1 failure / MatchError) - Same with seeds 2, 7, 42, 100, 55290 — all green - `mix test --seed 1` full gateway suite → 835/835 (unchanged from before) - `mix test --seed 55290` full gateway suite → 835/835 (previously 834/835) Validation: - Reproduced the failure deterministically at seed 1 by running the test in isolation (handoff said it failed only at seed 55290; it actually fails on every seed in isolation — the suite seeds 1 ordering happens to leak compatible state forward). - Instrumented TlsSession.check_policy/1 to confirm authorize? returned false in 0ms when no rule was set; instrumentation backed out before commit. Follow-up: - Task #6 from hermes_session_handoff.md is partially resolved (this is the tls_phase2 half of the flake pair). The crl_server_test.exs:39 OnExitHandler race documented in the same task did NOT reproduce in this session at seed 55290; tracking separately. - The broader hardening — `setup` block that resets PolicyEngine ETS state between tests — is deferred. The current fix follows the established pattern (each test inserts its own rules) so it composes cleanly with the rest of the file. Refs: - hermes_session_handoff.md Task #6: "Fix two pre-existing test-ordering flakes in gateway suite" - v0.29.5 baseline (no library code changed)
…ents (BS-PR-6) (#28) What ----- Customer-facing end-to-end documentation and ready-to-copy reference clients for any external system (initial target: Z2LS) that wants to mint enrollment tokens via the ZTLP-secured Bootstrap API. Three artifacts: 1. bootstrap/docs/z2ls_enrollment_runbook.md (574 lines) The single-source guide for integrators: pre-reqs, provisioning an api_clients row, the per-zone HMAC signing contract, the full POST /api/v1/enrollment_tokens flow, redemption paths (CLI / macOS / iOS), an end-to-end smoke procedure, and a comprehensive troubleshooting section keyed on the actual 401 reason codes the authenticator logs. 2. bootstrap/script/z2ls_request_token.py (153 lines) Self-contained Python reference client. Stdlib-only (urllib + hmac + hashlib). Drop into any Z2LS codebase and adapt. 3. bootstrap/script/z2ls_request_token.rb (128 lines) Self-contained Ruby reference client. Stdlib-only (net/http + openssl). Mirrors the Python reference behavior byte-for-byte. Why --- Steve's 2026-05-23 brief item #5/#6: * "Build an API endpoint where Z2LS can request an enrollment token for a new system." (shipped in BS-PR-3 / #26) * "[Workflow] Z2LS sends the computer name to ZTLP Bootstrap [...] ZTLP Bootstrap validates that Z2LS is allowed to communicate with the API using ZTLP-secured communication." BS-PR-3 shipped the server side. This PR ships the documentation + reference signers so customers (and our own Z2LS instances) can hit that endpoint correctly on the first try. The most common integration failure for HMAC-signed APIs is the canonical-message format: this runbook spells out the 6-line layout and the two reference clients demonstrate it concretely. Details ------- * Canonical 6-line signed message matches `Ztlp::ApiAuthenticator#canonical_message`: METHOD\nFULLPATH\nZONE\nCLIENT\nTIMESTAMP\nSHA256_HEX(body) * Per-zone secret env-var slug rule (`zone.upcase.gsub(/[^A-Z0-9]+/, "_")`) matches `ApiAuthenticator.slugify_zone/1` exactly — verified by replicating the rule in both reference clients. * 64-char pure-hex secret values are decoded to 32 raw bytes BEFORE signing (the Bootstrap authenticator does the same; mismatched decoding is the second most common failure mode and has a dedicated troubleshooting bullet). * Runbook documents the 503 case (api_clients row exists, no Network row for the zone) and forward-refs BS-PR-4 for the auto-creation fix. * Helper scripts use only stdlib so they can be dropped into any Z2LS host without dependency installation. Tests ----- No new automated tests — this is documentation + reference scripts. However, both helpers were validated to produce signatures byte-identical to the server-side `Ztlp::ApiAuthenticator`: python sig: c019132d66f52d85173ca9d8811c093b754999afcc8646dd00016e1954ee5201 ruby sig: c019132d66f52d85173ca9d8811c093b754999afcc8646dd00016e1954ee5201 reference: c019132d66f52d85173ca9d8811c093b754999afcc8646dd00016e1954ee5201 (64-char hex key, POST /api/v1/enrollment_tokens, zone=acme.ztlp, client=z2ls.acme, ts=1700000000, body='{"computer_name":"alice-laptop"}') Pre-existing test suite untouched: * bootstrap full suite still 1020/1017 (same 3 pre-existing SshProvisionerTest port-mismatch failures predating this work). * `api/v1/enrollment_tokens_controller_test.rb` — 13/13 pass. Validation ---------- * `python3 -c 'compile(...)'` — script parses. * `ruby -c script/z2ls_request_token.rb` — Syntax OK. * Cross-language signature equivalence verified (see Tests). * Runbook proofread for accuracy against the actual auth + token code paths (slugify rule, hex-decode rule, canonical-message order, fullpath, RFC1035 validator, audit log action names). Follow-up --------- * BS-PR-4 (next): auto-create the per-tenant Network row + matching api_clients row during ztlp.net onboarding so customers don't hit the 503 path the runbook currently documents as a workaround. * If Steve wants a `bin/z2ls-request-token` packaged binary later we can roll the Python helper into a small CLI; for now keep it as a copy-paste reference.
## What CodeRabbit flagged 5 Major findings on PR #91; this commit addresses all of them. Test counts and integration confirmation are unchanged. ### #6 Require explicit opt-in flag for NS self-registration `--service-name` has a default of `ztlp-gateway` for relay routing, so the previous logic auto-activated NS registration on any `--ns-server + --zone` listener — including listeners that should have stayed under external (Chef) NS publication. Added a new `--ns-register-name <fqsn>` flag that is the explicit opt-in: - When omitted: listener does NOT touch NS (previous behavior for non-NS callers preserved). - When set: listener requires all three of `--ns-server`, `--zone`, and `--ns-register-name` before spawning the heartbeat. ### #7 Don't publish unspecified bind addresses as SVC With the default `--bind 0.0.0.0:23095`, `local_addr()` yields `0.0.0.0:23095` and that exact (unroutable) value was being published as the SVC record. Now the listener detects unspecified bind via `listener_addr.ip().is_unspecified()` and publishes KEY only in that case, with a warning at startup pointing operators at the `--bind` flag. ### #8 Fail listener startup when the initial sync publish fails The spec said "fails fast at Windows service startup" but the code only logged a warning and continued. The Err branch now propagates a descriptive error up through `cmd_listen`, which aborts the listener — leaving the Windows service event log with a clear "initial NS publish failed" line instead of a silently-up-but-undiscoverable process. ### #9 Stub NS readiness race in integration test The integration script's fixed `sleep 0.5` after launching the Python stub NS races on slow CI runners. Replaced with a readiness probe that polls the stub's stdout for the "listening on 127.0.0.1:PORT" log line, retries up to 4s, and FAILS LOUDLY if the stub never reports ready (also detects stub process death mid-startup). ### #10 Integration script never fails when expected log line missing The wait loop polled for "NS heartbeat enrolled" or "initial NS publish failed" but had no fallthrough behavior if neither appeared. Added a `FOUND_MSG` flag and explicit failure path with full listener stderr dump when the listener hangs or logs something unexpected. ## Why CodeRabbit reasoning was correct for all five. #6/#7 prevent production junk records, #8 makes the fail-fast contract real, #9/#10 make the integration script trustworthy on CI. ## Details - New CLI flag `--ns-register-name <NS_REGISTER_NAME>` on `ztlp listen`, threaded through `Commands::Listen` destructuring → `cmd_listen` signature → the heartbeat block. - `--service-name` no longer plays double duty; it stays a relay-only flag. - The heartbeat gate now requires all three of (ns_register_name, ns_server, zone) to be `Some`. If any one is missing, the heartbeat is not spawned. - `advertise_addr: Option<String>` propagates Some-only for routable binds; `ns_publish_self(..., advertise_addr.as_ref())` cleanly degrades to KEY-only when the option is None. - Spec doc `docs/plans/2026-06-01-ns-self-register-heartbeat.md` updated to document the explicit opt-in flag and the KEY-only fallback. - Integration script updated to use the new flag. ## Validation ``` cargo fmt --check -> clean cargo test --bin ztlp ns_ -> 24 passed; 0 failed scripts/test-ns-heartbeat-integration.sh: ✓ stub NS ready after 2*0.1s ✓ initial publish observed after 2*0.5s NS stub saw: 4 registers, 2 queries PASS: heartbeat published KEY+SVC to NS stub ``` The 2 existing T3 wire-shape regression guards still pass: ``` test tests::ns_register_key_record_includes_node_id ... ok test tests::ns_register_svc_record_includes_node_id ... ok ``` ## Backward compatibility Existing `ztlp ns register` CLI subcommand unchanged. Existing Chef cookbook runs unchanged (they call `ztlp ns register` explicitly, not `ztlp listen --ns-register-name`). This commit is purely additive at the external surface: a new optional flag and a new failure mode that only triggers when the new flag is set.
* feat(proto): NS self-registration heartbeat in `ztlp listen`
## What
Listener processes now own their own NS presence. When started with
`--ns-server`, `--service-name`, and `--zone`, `ztlp listen` spawns a
background task that publishes its KEY + SVC records to ZTLP-NS at boot
and re-publishes every 8h ± 10min jitter — well below the 24h NS TTL.
External factors (Chef cookbook runs, cron, marker files) are no longer
required to keep a listener reachable via NS lookup.
## Why
Production NS records were silently expiring every 24h because the Chef
cookbook only re-published on hash change, but NS records carry a 24h TTL
(86400s). Real-world impact: many desktops in `tech-rockstars.trs.ztlp`
had KEY records that returned `:not_found` via `Store.lookup` despite
appearing in `:mnesia.foldl` walks — the records had expired and only
their tuples lingered as phantoms.
This shifts liveness-coupled discovery onto the running service itself
(the standard pattern from Consul, etcd, k8s endpoints, Nomad) rather
than relying on external configuration management to remember to
republish.
## Details
- New helper `ns_publish_self(name, zone, identity, ns_server, address)`
factored out of `cmd_ns_register`. Used by both the CLI subcommand
(unchanged behavior) and the new heartbeat task.
- New task `ns_heartbeat_task(name, zone, identity, ns_server, address,
interval, jitter_max)` runs an infinite republish loop with uniform
random jitter to avoid stampedes on shared NS infrastructure.
- `cmd_listen` spawns the heartbeat task when all three of
`--ns-server`, `--service-name`, `--zone` are passed. Initial publish
is synchronous so misconfiguration fails fast at Windows service
startup; subsequent failures are logged at WARN and retried.
- The existing T3 CBOR-shape regression guards
(`ns_register_*_record_includes_node_id`) still pass — the wire format
is unchanged.
Spec: `docs/plans/2026-06-01-ns-self-register-heartbeat.md`
## Tests
5 new unit tests under `proto/src/bin/ztlp-cli.rs` (`mod tests`):
- `ns_publish_self_validates_name_in_zone` — rejects mismatched
name/zone before touching the network.
- `ns_publish_self_constructs_key_packet_with_node_id` — KEY-record
CBOR includes `algorithm`, `node_id`, `public_key`. Decodes via
`ciborium` against a captured UDP packet.
- `ns_publish_self_constructs_svc_packet_when_address_given` — SVC
record only emitted when `address` is `Some`; second packet has
`type_byte == 2`.
- `ns_publish_self_omits_svc_when_no_address` — exactly one register
packet sent in KEY-only mode.
- `ns_heartbeat_task_republishes_on_tick` — heartbeat with a 150ms
test interval emits ≥4 register packets over 700ms (initial + 2
ticks × 2 records each).
Test helper `spawn_capture_ns` wraps a `tokio::net::UdpSocket` that
captures every packet sent to it and replies `0x06` (register ACK) or
`0x02` (lookup found) so the helper's happy path completes.
## Validation
```
cargo test --bin ztlp ns_ -> 24 passed; 0 failed
cargo test --bin ztlp -> 59 passed; 0 failed
cargo test --lib -> 1084 passed; 0 failed; 12 ignored
```
Zero regressions vs. baseline. The 2 existing T3 CBOR-shape guards
(`ns_register_key_record_includes_node_id`,
`ns_register_svc_record_includes_node_id`) still pass — the wire format
is byte-for-byte identical.
## Integration confirmation
`scripts/test-ns-heartbeat-integration.sh` boots the real `ztlp listen`
binary against a Python UDP NS stub on loopback and asserts the wire
behavior end-to-end:
```
[stub_ns] listening on 127.0.0.1:23196
Waiting for initial NS heartbeat...
✓ initial publish observed after 2*0.5s
NS stub saw: 4 registers, 2 queries
PASS: heartbeat published KEY+SVC to NS stub
=== listener stderr ===
ZTLP QUIC server listening on UDP 127.0.0.1:23195
✓ NS heartbeat enrolled: test-node.example.ztlp @ 127.0.0.1:23195 -> 127.0.0.1:23196
→ [ns_heartbeat] published test-node.example.ztlp to 127.0.0.1:23196
=== NS stub log ===
ts opcode=0x09 len=192 from=('127.0.0.1', 40399) <- KEY
ts opcode=0x09 len=115 from=('127.0.0.1', 40399) <- SVC
ts opcode=0x01 len=26 from=('127.0.0.1', 40399) <- verify query
ts opcode=0x09 len=192 from=('127.0.0.1', 39094) <- KEY (next cycle)
ts opcode=0x09 len=115 from=('127.0.0.1', 39094) <- SVC (next cycle)
ts opcode=0x01 len=26 from=('127.0.0.1', 39094) <- verify (next cycle)
```
Two heartbeat cycles within 1 second using a test override; production
cadence is 8h ± 10min.
## Follow-up
- Phase 2 (separate PR): delete the Chef cookbook's `ztlp_ns_register`
resource and the `svc-registered.sha256` marker file. The listener now
handles its own NS presence; cookbook should only install the binary,
config, identity, and Windows service.
- Phase 3 (separate PR): same heartbeat pattern in `ztlp agent start` so
agent processes self-register too.
## TDD note
Strict RED-GREEN-REFACTOR was interrupted mid-cycle by a disk-full
incident (target/ + /tmp filled the root volume), which forced the
working tree to be `git checkout`-restored and the patches re-applied
in a single pass instead of one-at-a-time. The first run of the new
tests still caught real bugs (wrong register opcode 0x05 vs actual
0x09; wrong data-length encoding u32 vs u16) before any production
code changes were made permanent, so the spirit of "watch the test
fail for the right reason before locking the implementation" was
preserved even though the literal ordering was compressed.
* style: cargo fmt
* fix(proto): address CodeRabbit review feedback
## What
CodeRabbit flagged 5 Major findings on PR #91; this commit addresses all of
them. Test counts and integration confirmation are unchanged.
### #6 Require explicit opt-in flag for NS self-registration
`--service-name` has a default of `ztlp-gateway` for relay routing, so the
previous logic auto-activated NS registration on any `--ns-server + --zone`
listener — including listeners that should have stayed under external (Chef)
NS publication. Added a new `--ns-register-name <fqsn>` flag that is the
explicit opt-in:
- When omitted: listener does NOT touch NS (previous behavior for non-NS
callers preserved).
- When set: listener requires all three of `--ns-server`, `--zone`, and
`--ns-register-name` before spawning the heartbeat.
### #7 Don't publish unspecified bind addresses as SVC
With the default `--bind 0.0.0.0:23095`, `local_addr()` yields
`0.0.0.0:23095` and that exact (unroutable) value was being published as the
SVC record. Now the listener detects unspecified bind via
`listener_addr.ip().is_unspecified()` and publishes KEY only in that case,
with a warning at startup pointing operators at the `--bind` flag.
### #8 Fail listener startup when the initial sync publish fails
The spec said "fails fast at Windows service startup" but the code only
logged a warning and continued. The Err branch now propagates a descriptive
error up through `cmd_listen`, which aborts the listener — leaving the
Windows service event log with a clear "initial NS publish failed" line
instead of a silently-up-but-undiscoverable process.
### #9 Stub NS readiness race in integration test
The integration script's fixed `sleep 0.5` after launching the Python stub
NS races on slow CI runners. Replaced with a readiness probe that polls the
stub's stdout for the "listening on 127.0.0.1:PORT" log line, retries up to
4s, and FAILS LOUDLY if the stub never reports ready (also detects stub
process death mid-startup).
### #10 Integration script never fails when expected log line missing
The wait loop polled for "NS heartbeat enrolled" or "initial NS publish
failed" but had no fallthrough behavior if neither appeared. Added a
`FOUND_MSG` flag and explicit failure path with full listener stderr dump
when the listener hangs or logs something unexpected.
## Why
CodeRabbit reasoning was correct for all five. #6/#7 prevent production
junk records, #8 makes the fail-fast contract real, #9/#10 make the
integration script trustworthy on CI.
## Details
- New CLI flag `--ns-register-name <NS_REGISTER_NAME>` on `ztlp listen`,
threaded through `Commands::Listen` destructuring → `cmd_listen` signature
→ the heartbeat block.
- `--service-name` no longer plays double duty; it stays a relay-only flag.
- The heartbeat gate now requires all three of (ns_register_name, ns_server,
zone) to be `Some`. If any one is missing, the heartbeat is not spawned.
- `advertise_addr: Option<String>` propagates Some-only for routable binds;
`ns_publish_self(..., advertise_addr.as_ref())` cleanly degrades to
KEY-only when the option is None.
- Spec doc `docs/plans/2026-06-01-ns-self-register-heartbeat.md` updated to
document the explicit opt-in flag and the KEY-only fallback.
- Integration script updated to use the new flag.
## Validation
```
cargo fmt --check -> clean
cargo test --bin ztlp ns_ -> 24 passed; 0 failed
scripts/test-ns-heartbeat-integration.sh:
✓ stub NS ready after 2*0.1s
✓ initial publish observed after 2*0.5s
NS stub saw: 4 registers, 2 queries
PASS: heartbeat published KEY+SVC to NS stub
```
The 2 existing T3 wire-shape regression guards still pass:
```
test tests::ns_register_key_record_includes_node_id ... ok
test tests::ns_register_svc_record_includes_node_id ... ok
```
## Backward compatibility
Existing `ztlp ns register` CLI subcommand unchanged. Existing Chef
cookbook runs unchanged (they call `ztlp ns register` explicitly, not
`ztlp listen --ns-register-name`). This commit is purely additive at the
external surface: a new optional flag and a new failure mode that only
triggers when the new flag is set.
Summary
Followups #1 and #2 from PR #5.
Followup #2 (the meat) —
POST /api/admin-pubkeyA tenant admin can now enable passwordless ZTLP gateway auth without SSHing into the host. Send the original claim token + their Noise pubkey, and the Launch app rewrites the per-tenant
instance.envand force-recreates the gateway container.Auth model: holder of the original claim_token. It was shown once at
/start, stored as an HMAC digest in the DB, and the admin already used it to claim the tenant. Reusing it here doesn't widen the trust boundary.Validation: mirrors the Rust gateway's
--admin-pubkey-emailcheck — 64 lowercase hex chars decoding to a 32-byte X25519 public key. Uppercase normalized; off-by-one lengths, non-hex chars, missing token all fail fast with clear errors.Side effects on success:
ZTLP_ADMIN_PUBKEY_HEX=ininstance.env(preserves all other operator-set keys).docker compose up -d --force-recreate gatewayin the instance dir.--force-recreateis required — plainup -dis a no-op when the image+config hash is unchanged, and changing anenv_filevalue alone doesn't bust that hash for already-running containers.Followup #1 — secrets.env / env_file readability
Was flagged in the original handoff as needing live-tenant verification. Investigation shows there's nothing to verify: docker-compose v2 reads
env_filein the CLI process (same user that wrote the file with chmod 600), then injects the parsed KEY=VALUE pairs into the container env at start time. The container never sees the file. Documented inline inapp.pyso the question doesn't get re-raised.Refactors
_slug_for_rowand_instance_dir_for_slughelpers so the new handler and_provision_zone_dockerscompute paths identically (was inlined slug logic)._fake_runnow records calls intoself._subprocess_callsso tests can assert on the docker recreate invocation.Test Plan
python3 -m unittest discover -s tests→ 45 passed, 0 failed (was 38; added 7).