Skip to content

feat(launch): POST /api/admin-pubkey to flip on passwordless gateway auth#6

Merged
priceflex merged 1 commit into
mainfrom
feature/launch-admin-pubkey-endpoint
May 20, 2026
Merged

feat(launch): POST /api/admin-pubkey to flip on passwordless gateway auth#6
priceflex merged 1 commit into
mainfrom
feature/launch-admin-pubkey-endpoint

Conversation

@priceflex
Copy link
Copy Markdown
Owner

Summary

Followups #1 and #2 from PR #5.

Followup #2 (the meat) — POST /api/admin-pubkey

A tenant admin can now enable passwordless ZTLP gateway auth without SSHing into the host. Send the original claim token + their Noise pubkey, and the Launch app rewrites the per-tenant instance.env and force-recreates the gateway container.

POST /api/admin-pubkey
  Content-Type: application/x-www-form-urlencoded
  token=<claim_token>&pubkey_hex=<64 lowercase hex>

→ 200 application/json {"status":"ok","slug":"acme-corp","applied":true}
→ 400 / 401 / 404 / 500 {"error":"...","detail":"..."}

Auth model: holder of the original claim_token. It was shown once at /start, stored as an HMAC digest in the DB, and the admin already used it to claim the tenant. Reusing it here doesn't widen the trust boundary.

Validation: mirrors the Rust gateway's --admin-pubkey-email check — 64 lowercase hex chars decoding to a 32-byte X25519 public key. Uppercase normalized; off-by-one lengths, non-hex chars, missing token all fail fast with clear errors.

Side effects on success:

  1. In-place rewrite of ZTLP_ADMIN_PUBKEY_HEX= in instance.env (preserves all other operator-set keys).
  2. docker compose up -d --force-recreate gateway in the instance dir. --force-recreate is required — plain up -d is a no-op when the image+config hash is unchanged, and changing an env_file value alone doesn't bust that hash for already-running containers.

Followup #1 — secrets.env / env_file readability

Was flagged in the original handoff as needing live-tenant verification. Investigation shows there's nothing to verify: docker-compose v2 reads env_file in the CLI process (same user that wrote the file with chmod 600), then injects the parsed KEY=VALUE pairs into the container env at start time. The container never sees the file. Documented inline in app.py so the question doesn't get re-raised.

Refactors

  • Extracted _slug_for_row and _instance_dir_for_slug helpers so the new handler and _provision_zone_dockers compute paths identically (was inlined slug logic).
  • Test fixture's _fake_run now records calls into self._subprocess_calls so tests can assert on the docker recreate invocation.

Test Plan

  • python3 -m unittest discover -s tests45 passed, 0 failed (was 38; added 7).
  • New cases: missing token, invalid token, bad-hex shapes (empty/short/non-hex/off-by-one), uppercase normalization, env-rewrite + force-recreate side effects, rebind overwrites prior value, 404 when instance not provisioned yet.

…auth

Followup #2 from PR #5: lets a tenant admin bind their Noise static
pubkey to enable passwordless ZTLP gateway auth without SSHing into
the host to hand-edit instance.env.

  POST /api/admin-pubkey
    body (x-www-form-urlencoded):
      token=<claim_token>&pubkey_hex=<64 lowercase hex>
    returns 200 application/json: {status:ok, slug, applied:true}
    returns 400 / 401 / 404 / 500 with {error,detail?}

Auth model: holder of the original claim_token. The token was shown
once to the admin at /start time and is stored only as an HMAC digest
in the DB, so reusing it as the auth credential here does not widen
the trust boundary — anyone with the raw token already has full
control of this tenant.

Validation mirrors the Rust gateway's --admin-pubkey-email check:
exactly 64 lowercase hex chars decoding to a 32-byte X25519 public
key. Uppercase hex is normalized; off-by-one lengths, non-hex chars,
and missing/invalid tokens all fail fast with clear errors.

Side effects on success:
  1. Rewrites the ZTLP_ADMIN_PUBKEY_HEX= line in instance.env in
     place (preserves all other operator-set keys).
  2. Runs 'docker compose up -d --force-recreate gateway' in the
     instance dir. --force-recreate is required — a plain 'up -d'
     is a no-op when the image+config hash is unchanged and changing
     an env_file value alone does not bust that hash for already-
     running containers.

Followup #1: also document inline why the chmod 600 on secrets.env
is compatible with the gateway's env_file mount. docker-compose v2
reads env_file in the CLI process (same user that wrote it), not in
the container, so the gateway never needs read perms on the file —
only the Launch process user invoking 'docker compose up' does.

Refactors: extract _slug_for_row and _instance_dir_for_slug helpers
so the new handler and _provision_zone_dockers compute paths the
exact same way. The test fixture's _fake_run now records calls into
self._subprocess_calls so we can assert on the docker recreate.

Tests: 7 new cases (45 total, was 38) covering missing/invalid
tokens, bad-hex shapes, uppercase normalization, env+recreate
side effects, rebind overwriting prior value, and 404 when the
admin hits the endpoint before clicking the claim link.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Warning

Rate limit exceeded

@priceflex has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7551bdb-5c9f-46b1-aebd-403393796d3c

📥 Commits

Reviewing files that changed from the base of the PR and between 35c5d48 and a6b0374.

📒 Files selected for processing (2)
  • ztlp.net/launch_app/app.py
  • ztlp.net/tests/test_launch_app.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/launch-admin-pubkey-endpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@priceflex priceflex merged commit 996c9c6 into main May 20, 2026
9 checks passed
@priceflex priceflex deleted the feature/launch-admin-pubkey-endpoint branch May 20, 2026 08:08
priceflex added a commit that referenced this pull request May 20, 2026
Ships PRs #5 + #6:
- HTTP header injection on ztlp listen for passwordless gateway auth
- POST /api/admin-pubkey endpoint on Launch app to bind admin pubkeys
priceflex added a commit that referenced this pull request May 20, 2026
Detailed description:
- Updated the handoff document to capture the state of the passwordless authentication project.
- Documented that PR #5 and PR #6 were completed and merged.
- Logged the outstanding block: the 'Nebula dumb-pipe' architecture is routing around the L7 HTTP Injection layer (), breaking auto-login.
- Established the immediate next steps: do not guess code fixes; instead, turn on aggressive trace logging / tcpdump on the AWS test server to measure exactly how TCP packets behave across the tunnel.
- Documented the uncommitted hot-patch  variable escape remaining on the AWS server.
priceflex added a commit that referenced this pull request May 22, 2026
What changed:
- Define new ZTLP control frame `FRAME_CLIENT_ROUTE` (magic 0x5A 0x37 0x0B)
  in proto/src/bin/ztlp-cli.rs alongside existing GATEWAY_REGISTER constants.
- Add `build_client_route_packet` builder + `parse_client_route_packet`
  helper (test-only) with HMAC-SHA256 reusing the GATEWAY_REGISTER scheme.
- Validate svc_len bounds (1..=63), magic, type, and truncation.
- 8 unit tests cover round-trip in dev and HMAC modes plus all rejection
  paths (empty/oversize service name, bad magic, bad type, truncated bytes).

Why:
- α-relay milestone (per handoff Task 1) — this frame replaces the rejected
  ALPN-byte-regex / dummy-Noise-UDP-HELLO approach (handoff Decisions #3).
- Reusing the GATEWAY_REGISTER HMAC scheme keeps new crypto surface area
  to zero (handoff Decisions #4).

Implementation details:
- Wire layout: magic(2) | type(1) | node_id(16) | svc_len(1) | service(N)
  | timestamp(8 BE i64) | hmac(32). Min 61 bytes, max 124.
- HMAC covers [type | node_id | svc_len | service | timestamp]; zeroed in
  dev mode, computed via hmac+sha2 crates (already deps) when a secret is
  provided.
- Includes the client's NodeID for ACL/audit (Open Question #1, Steve's
  tenant-isolation preference).

Tests added:
- client_route_packet_roundtrip_dev_mode
- client_route_packet_hmac_is_set_with_secret
- client_route_rejects_empty_service_name
- client_route_rejects_oversized_service_name
- client_route_parse_rejects_bad_magic
- client_route_parse_rejects_bad_type
- client_route_parse_rejects_truncated_packet
- client_route_max_length_service_name_succeeds

Validation performed:
- cargo test -p ztlp-proto --bin ztlp client_route → 8/8 passed
- cargo test -p ztlp-proto --bin ztlp → 13/13 passed
- cargo test -p ztlp-proto --lib → 857/857 passed
- cargo build -p ztlp-proto --bin ztlp → clean (31 pre-existing warnings,
  no new ones)

Follow-up needed:
- Wire `Config.registration_secret`-equivalent into the CLI so production
  HMAC isn't always zeroed.
- Phase C multi-service routing.
- v0.27.1 parity audit.

Known limitations:
- iOS NE path still uses legacy Noise-UDP — out of scope for this change
  (handoff Decisions #6).
priceflex added a commit that referenced this pull request May 22, 2026
What:
- Bump ztlp-proto from 0.28.5 -> 0.29.0 (proto/Cargo.toml)
- Bump ztlp_relay from 0.24.0 -> 0.29.0 (relay/mix.exs) to align with the
  Rust crate and the GitHub release tag.
- Restore proto/tests/handshake_tests.rs by extending HandshakeResult with
  `initiator_session` / `responder_session` aliases so the integration
  tests compile against the current shape without losing the in-process
  responder-side key check that the legacy harness asserted.
- Apply rustfmt to ztlp-cli.rs / quic_transport.rs / tunnel.rs /
  quic_transport_test.rs so CI's \`cargo fmt -- --check\` step stays green.

Why:
- v0.29.0 ships the FRAME_CLIENT_ROUTE α-relay routing primitive end-to-end:
  CLI now installs a client_map ETS entry on the relay before the QUIC
  handshake, replacing the rejected dummy-Noise-UDP-HELLO hack from the
  previous session (handoff Decisions #3). Manual E2E validated against
  staging relay 34.218.240.106 plus tenant gateway gw-hermese2e-v at
  35.91.88.177:40416, with curl returning HTTP 200 from the Rails
  bootstrap through the tunnel.
- The handshake_tests.rs file pre-dated the current HandshakeResult shape
  and was failing `cargo clippy --all-targets` on this branch; ignoring it
  would have broken CI immediately after merge.

Implementation details:
- HandshakeResult keeps `session` for back-compat (any existing caller still
  works) and adds `initiator_session` (alias for the same in-process slot)
  and `responder_session` (populated only when both sides exist locally).
  Network-only handshakes (quic_transport.rs) fill both alias fields with
  the local session - those code paths only ever read `.session` /
  `.session_id` so the extra copies are inert.

Tests:
- cargo test --lib: 857 passed, 0 failed, 12 ignored
- cargo clippy --all-targets: clean (warnings only, no errors)
- cargo fmt -- --check: clean
- mix test in relay/: 588 passed, 0 failed

Validation performed:
- α-relay E2E (THIS RELEASE'S MILESTONE):
  ztlp connect 34.218.240.106:23095 --service gw-hermese2e-v --quic
    -L 8083:127.0.0.1:3000
  curl http://localhost:8083/  ->  HTTP/1.1 200 OK + autologin session cookie
  Relay log: \"[UdpListener] CLIENT_ROUTE accepted: client=...
    service=gw-hermese2e-v gateway={{35, 91, 88, 177}, 40416}\"
- β-direct regression:
  ztlp connect 35.91.88.177:40416 --service gw-hermese2e-v --quic
    -L 8084:127.0.0.1:3000
  curl http://localhost:8084/  ->  HTTP/1.1 200 OK

Follow-up needed:
- Phase C multi-service ServiceRegistry routing (handoff Task 2)
- v0.27.1 parity audit (handoff Task 3)
- `pick_gateway_for_service` round-robin fallback when the requested
  service name has no live gateway picks any registered gateway. That
  makes routing non-deterministic when the requested tenant is
  momentarily not heartbeating. Tighten to "error out and let the client
  retry" once we have client-side service-not-registered handling.

Known limitations:
- iOS NE path still uses legacy Noise-UDP (per handoff Decisions #6).
priceflex added a commit that referenced this pull request May 23, 2026
… with regression pins (#14)

What
- Bumps `gateway/mix.exs` and `ns/mix.exs` from 0.24.0 → 0.29.4.
- Bumps `proto/Cargo.toml` from 0.29.3 → 0.29.4.
- Adds the same "version reporting (regression pin)" describe block to
  `gateway/test/ztlp_gateway/release_test.exs` and
  `ns/test/ztlp_ns/release_test.exs` that PR #13 added for relay
  (semver shape + runtime-vs-declared drift + ≥0.29.4 floor guard).
- Adds `proto/tests/version_pin_test.rs` with two equivalent Rust-side pins
  (parseable semver + ≥0.29.4 floor guard) using the in-crate
  `ztlp_proto::updater::SemVer` to avoid taking a new external dependency.

Why
- PR #13 fixed the relay's version-string drift but explicitly left the rest
  of the tree as follow-up scope (handoff Known Problems #6: "gateway/mix.exs
  and ns/mix.exs at 0.24.0; proto/Cargo.toml at 0.29.3"). Gateway and NS had
  been pinned at 0.24.0 for five minor versions, so
  `Application.spec(:ztlp_gateway, :vsn)` and `Application.spec(:ztlp_ns, :vsn)`
  were lying about which tag was actually deployed — the exact bug class
  PR #13 fixed for relay. Same lie, three more components.
- Without regression pins on these other components, the same drift can
  silently recur on the next tag cut.

Details
- Floor guard uses `Version.compare(declared, "0.29.4") in [:gt, :eq]` (Elixir)
  / `actual.cmp(&floor)` (Rust) rather than asserting a literal version string.
  This is deliberate: literal-string assertions become test maintenance burden
  on every routine bump, whereas a floor guard only fails on accidental
  down-bumps below the v0.29.4 strict-routing tag.
- Runtime-vs-declared drift test catches both directions of the bug PR #13
  found: (a) mix.exs bumped but .app cache stale, and (b) tag cut without
  bumping mix.exs.
- Rust-side test re-uses the in-crate `updater::SemVer` parser rather than
  pulling in the external `semver` crate. This exercises the same code path
  the self-updater uses, so if that parser ever regresses both will catch it.
- mix.exs comment blocks explain the bump inline so future readers see the
  on-call story at the point of change, not just in the test file or handoff.
- No source-code logic changed; this is a version-string + test-only PR.
  Public API, wire format, and runtime behavior are untouched.

Tests (TDD discipline followed)
- RED step verified per-component before bump:
  * gateway: `mix test test/ztlp_gateway/release_test.exs --seed 0` →
    "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag"
    (1/15 failures, all in the new floor-guard test, exactly as designed).
  * ns: `mix test test/ztlp_ns/release_test.exs --seed 0` →
    "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag"
    (1/15 failures, identical message).
  * proto: `cargo test --test version_pin_test` →
    "proto/Cargo.toml version 0.29.3 is older than the v0.29.4 strict-routing
    tag" (1/2 failures, parseable-semver test still green so we know the
    fixture is sane).
- GREEN step after bumps:
  * gateway: 15/15 in release_test.exs; 835/835 in full `mix test` (seed 1).
    Note: seed 55290 surfaces two pre-existing test-ordering flakes
    (TLS port reuse in tls_phase2_test.exs and a GenServer teardown race in
    crl_server_test.exs) — both also reproduce on plain `main` and are
    unrelated to this PR. Confirmed by stash + checkout `main`-only files +
    rerun on the same seed.
  * ns: 15/15 in release_test.exs; 729/729 in full `mix test`.
  * proto: 2/2 in version_pin_test; 858/858 in `cargo test --lib --release`
    (matches the pre-branch baseline).

Validation (non-test)
- `cargo check --release` clean (31 pre-existing dead-code warnings in
  `proto/src/bin/ztlp-cli.rs` unchanged — Known Problems #4 scope).
- Relay suite re-run unchanged: 597/597 (confirms no cross-component
  collateral damage).
- ztlp.net Python suite unchanged: 48/48 in `tests.test_launch_app`.
- Infra untouched. No relay restart, no gateway restart, no NS restart.
  Live binaries still on v0.29.3 (relay) / v0.24.0 (gateway, ns) — actual
  deployment of the bumped versions is a separate, Steve-gated step.

Follow-up
- After merge, decide tag strategy with v0.29.4 already in the past:
  either (a) cut v0.29.5 that includes this + PR #13's mix.exs bump
  (cleanest), or (b) accept "tag is source of truth; mix.exs is
  best-effort" for the trailing v0.29.4 and reset on v0.30.0.
- Task #3 (per-zone HMAC `Config.registration_secret/0`) is still the
  prod-readiness blocker and is the hard dependency for the Bootstrap
  workstream (handoff §"HARD DEPENDENCY uncovered while locking in #5").
- Known Problems #4 (`cargo fix` pass on 31 dead-code warnings in
  `proto/src/bin/ztlp-cli.rs`) still open; left out of scope here to keep
  this PR tight.

Refs
- PR #12 (v0.29.4 strict-routing): 829abdf
- PR #13 (relay mix.exs vsn pin): d22afbf
- Handoff: ~/hermes_session_handoff.md "Known Problems #6", "Open Question #1"
priceflex added a commit that referenced this pull request May 23, 2026
…lean v0.29.5 tag (#15)

What
- Bumps `relay/mix.exs`, `gateway/mix.exs`, `ns/mix.exs` from 0.29.4 → 0.29.5.
- Bumps `proto/Cargo.toml` from 0.29.4 → 0.29.5.
- No test changes — the regression pins added in PR #13 / PR #14 still
  catch drift (semver-shape + runtime-vs-declared + ≥0.29.4 floor).

Why
- v0.29.4 tag was cut at commit `829abdf` BEFORE PR #13 and PR #14 bumped
  the four declared versions. That means a container built from the
  v0.29.4 tag reports its runtime vsn as the PRE-bump value, which is
  exactly the on-call-lying bug PR #13 introduced regression tests for.
- The fix is to cut a NEW tag whose commit POST-DATES the bumps. v0.29.5
  is the smallest possible bump that achieves this — pure version-string
  alignment, no functional change.
- After this PR merges and v0.29.5 is tagged: git tag == relay vsn ==
  gateway vsn == ns vsn == proto CARGO_PKG_VERSION == "0.29.5". The
  existing `release_test.exs` runtime-vs-declared drift test, run inside
  any container built from this tag, will report agreement instead of
  drift.

Details
- Floor guard in `release_test.exs` and `version_pin_test.rs` stays at
  0.29.4. This is deliberate: the floor is a stable lower bound, not the
  current version. Ratcheting it on every routine bump is the exact
  maintenance burden the floor-guard pattern was designed to avoid (see
  the PR #13/#14 commit messages). 0.29.5 satisfies ≥0.29.4, and an
  accidental down-bump below 0.29.4 (which would lose the strict-routing
  fix) is what the floor exists to catch.
- mix.exs comment blocks updated in-place to reference both PR #14 and
  this PR, so the version-line provenance is readable at the point of
  change without chasing PRs in git log.
- No source-code logic changed. Public API, wire format, runtime behavior
  unchanged.

Tests (green on this branch)
- `cd proto && cargo test --lib --release`: 858 passed
- `cd proto && cargo test --test version_pin_test`: 2 passed
- `cd relay && mix test`: 597 passed
- `cd gateway && mix test --seed 1`: 835 passed
  (seed 1 avoids the two pre-existing seed-55290 flakes in tls_phase2_test
  and crl_server_test — both documented in the handoff as Task #6 and
  unrelated to this PR.)
- `cd ns && mix test`: 729 passed
- `cd ztlp.net && python3 -m unittest tests.test_launch_app`: 48 passed
- Total: 3,069 tests, 0 failures.

The runtime-vs-declared drift test in each release_test.exs was the key
verification — it exercises the same `Application.spec/2` RPC that lies
on a misbuilt container. All three Elixir suites pass after the bump,
which means mix recompiled the .app files to match the new declared
version. This is the in-source-tree equivalent of the
`docker exec ... rpc 'Application.spec(_, :vsn)'` check that found the
original PR #13 bug.

Validation
- `git diff --stat` matches expectation: 4 files, ~23 insertions, ~19
  deletions (the changes are mostly comment-block updates around the
  one-character version bumps).
- No source code changes; no risk of breaking runtime behavior.
- No infra touched.

Follow-up (post-merge)
- Tag `v0.29.5` on main, push tag with openclaw SSH key, watch the
  Release workflow to green.
- Deploy `priceflex/ztlp-relay:v0.29.5` to staging relay
  (`34.218.240.106`) when Steve OKs the restart (iOS bench risk). A
  container built from v0.29.5 will report `:vsn = "0.29.5"` correctly,
  closing out the original PR #13 defect end-to-end.
- Task #2 (per-zone HMAC `Config.registration_secret/0`) remains the
  prod-readiness blocker and hard dependency for the Bootstrap workstream.

Refs
- PR #12 (v0.29.4 strict-routing): 829abdf
- PR #13 (relay mix.exs vsn pin): d22afbf
- PR #14 (gateway/ns/proto vsn pin): 0a6fbb6
- Handoff: ~/hermes_session_handoff.md "Active Tasks → Task 0" decision
  list (Option 1 — cut v0.29.5 immediately).
priceflex added a commit that referenced this pull request May 23, 2026
… adding missing PolicyEngine setup (#16)

What:
- tls_phase2_test.exs:276 ("injects identity headers for mTLS connections in
  identity mode") was tagged @:flaky and reliably failed when run alone or as
  part of the suite at seed 55290.
- Root cause: the test sends a client cert, which trips
  TlsSession.check_policy/1 → PolicyEngine.authorize?/2. Without a matching
  rule the call fell back to "default" → returned false → session rejected
  with :policy_denied. The TLS client never got a response, the backend's
  spawn_link :gen_tcp.accept/2 timed out at 15s, and the failure surfaced as
  a confusing `MatchError {:error, :timeout}` at line 286 — pointing at the
  acceptor, not at the missing policy rule.
- Bug was hidden in full-suite runs at "good" seeds because earlier tests
  (tls_phase2_test.exs:430, :472, :766) insert `PolicyEngine.put_rule(
  "localhost", :all)` and that ETS state leaked forward.

Why:
- Tests must be self-contained. Relying on a previous test's PolicyEngine
  state is exactly the kind of order-dependency that produces "works on my
  machine" failures and makes seed 55290 look like a flake when it was
  actually a deterministic missing-setup bug.
- Removing `@tag :flaky` is correct because the test is no longer flaky —
  the previous behavior was a latent test bug, not real non-determinism.

Details:
- Added `PolicyEngine.put_rule("localhost", :all)` as the first line of the
  test body, matching the pattern in :430, :472, :520, :562, :766.
- Added an explanatory comment on the describe block documenting the
  PolicyEngine dependency for identity-mode tests so the next person who
  writes one of these tests doesn't repeat the omission.
- Removed `@tag :flaky` — the test now passes deterministically across all
  seeds tried (1, 2, 7, 42, 100, 55290) in isolation and as part of the
  full suite.

Tests:
- `mix test test/ztlp_gateway/tls_phase2_test.exs:276 --seed 1` → 1 passed
  (previously: 1 failure / MatchError)
- Same with seeds 2, 7, 42, 100, 55290 — all green
- `mix test --seed 1` full gateway suite → 835/835 (unchanged from before)
- `mix test --seed 55290` full gateway suite → 835/835 (previously 834/835)

Validation:
- Reproduced the failure deterministically at seed 1 by running the test
  in isolation (handoff said it failed only at seed 55290; it actually
  fails on every seed in isolation — the suite seeds 1 ordering happens
  to leak compatible state forward).
- Instrumented TlsSession.check_policy/1 to confirm authorize? returned
  false in 0ms when no rule was set; instrumentation backed out before
  commit.

Follow-up:
- Task #6 from hermes_session_handoff.md is partially resolved (this is
  the tls_phase2 half of the flake pair). The crl_server_test.exs:39
  OnExitHandler race documented in the same task did NOT reproduce in
  this session at seed 55290; tracking separately.
- The broader hardening — `setup` block that resets PolicyEngine ETS
  state between tests — is deferred. The current fix follows the
  established pattern (each test inserts its own rules) so it composes
  cleanly with the rest of the file.

Refs:
- hermes_session_handoff.md Task #6: "Fix two pre-existing test-ordering
  flakes in gateway suite"
- v0.29.5 baseline (no library code changed)
priceflex added a commit that referenced this pull request May 23, 2026
…ents (BS-PR-6) (#28)

What
-----
Customer-facing end-to-end documentation and ready-to-copy reference
clients for any external system (initial target: Z2LS) that wants to
mint enrollment tokens via the ZTLP-secured Bootstrap API.

Three artifacts:

  1. bootstrap/docs/z2ls_enrollment_runbook.md (574 lines)
     The single-source guide for integrators: pre-reqs, provisioning
     an api_clients row, the per-zone HMAC signing contract, the full
     POST /api/v1/enrollment_tokens flow, redemption paths
     (CLI / macOS / iOS), an end-to-end smoke procedure, and a
     comprehensive troubleshooting section keyed on the actual 401
     reason codes the authenticator logs.

  2. bootstrap/script/z2ls_request_token.py (153 lines)
     Self-contained Python reference client. Stdlib-only (urllib +
     hmac + hashlib). Drop into any Z2LS codebase and adapt.

  3. bootstrap/script/z2ls_request_token.rb (128 lines)
     Self-contained Ruby reference client. Stdlib-only (net/http +
     openssl). Mirrors the Python reference behavior byte-for-byte.

Why
---
Steve's 2026-05-23 brief item #5/#6:
  * "Build an API endpoint where Z2LS can request an enrollment token
     for a new system."  (shipped in BS-PR-3 / #26)
  * "[Workflow] Z2LS sends the computer name to ZTLP Bootstrap [...]
     ZTLP Bootstrap validates that Z2LS is allowed to communicate with
     the API using ZTLP-secured communication."

BS-PR-3 shipped the server side. This PR ships the documentation +
reference signers so customers (and our own Z2LS instances) can hit
that endpoint correctly on the first try.  The most common integration
failure for HMAC-signed APIs is the canonical-message format: this
runbook spells out the 6-line layout and the two reference clients
demonstrate it concretely.

Details
-------
* Canonical 6-line signed message matches
  `Ztlp::ApiAuthenticator#canonical_message`:
      METHOD\nFULLPATH\nZONE\nCLIENT\nTIMESTAMP\nSHA256_HEX(body)
* Per-zone secret env-var slug rule
  (`zone.upcase.gsub(/[^A-Z0-9]+/, "_")`) matches
  `ApiAuthenticator.slugify_zone/1` exactly — verified by replicating
  the rule in both reference clients.
* 64-char pure-hex secret values are decoded to 32 raw bytes BEFORE
  signing (the Bootstrap authenticator does the same; mismatched
  decoding is the second most common failure mode and has a dedicated
  troubleshooting bullet).
* Runbook documents the 503 case (api_clients row exists, no Network
  row for the zone) and forward-refs BS-PR-4 for the auto-creation
  fix.
* Helper scripts use only stdlib so they can be dropped into any
  Z2LS host without dependency installation.

Tests
-----
No new automated tests — this is documentation + reference scripts.
However, both helpers were validated to produce signatures
byte-identical to the server-side `Ztlp::ApiAuthenticator`:

    python sig: c019132d66f52d85173ca9d8811c093b754999afcc8646dd00016e1954ee5201
    ruby sig:   c019132d66f52d85173ca9d8811c093b754999afcc8646dd00016e1954ee5201
    reference:  c019132d66f52d85173ca9d8811c093b754999afcc8646dd00016e1954ee5201

(64-char hex key, POST /api/v1/enrollment_tokens, zone=acme.ztlp,
client=z2ls.acme, ts=1700000000, body='{"computer_name":"alice-laptop"}')

Pre-existing test suite untouched:
* bootstrap full suite still 1020/1017 (same 3 pre-existing
  SshProvisionerTest port-mismatch failures predating this work).
* `api/v1/enrollment_tokens_controller_test.rb` — 13/13 pass.

Validation
----------
* `python3 -c 'compile(...)'` — script parses.
* `ruby -c script/z2ls_request_token.rb` — Syntax OK.
* Cross-language signature equivalence verified (see Tests).
* Runbook proofread for accuracy against the actual auth + token code
  paths (slugify rule, hex-decode rule, canonical-message order,
  fullpath, RFC1035 validator, audit log action names).

Follow-up
---------
* BS-PR-4 (next): auto-create the per-tenant Network row + matching
  api_clients row during ztlp.net onboarding so customers don't hit
  the 503 path the runbook currently documents as a workaround.
* If Steve wants a `bin/z2ls-request-token` packaged binary later we
  can roll the Python helper into a small CLI; for now keep it as a
  copy-paste reference.
priceflex added a commit that referenced this pull request Jun 1, 2026
## What

CodeRabbit flagged 5 Major findings on PR #91; this commit addresses all of
them. Test counts and integration confirmation are unchanged.

### #6 Require explicit opt-in flag for NS self-registration

`--service-name` has a default of `ztlp-gateway` for relay routing, so the
previous logic auto-activated NS registration on any `--ns-server + --zone`
listener — including listeners that should have stayed under external (Chef)
NS publication. Added a new `--ns-register-name <fqsn>` flag that is the
explicit opt-in:

  - When omitted: listener does NOT touch NS (previous behavior for non-NS
    callers preserved).
  - When set: listener requires all three of `--ns-server`, `--zone`, and
    `--ns-register-name` before spawning the heartbeat.

### #7 Don't publish unspecified bind addresses as SVC

With the default `--bind 0.0.0.0:23095`, `local_addr()` yields
`0.0.0.0:23095` and that exact (unroutable) value was being published as the
SVC record. Now the listener detects unspecified bind via
`listener_addr.ip().is_unspecified()` and publishes KEY only in that case,
with a warning at startup pointing operators at the `--bind` flag.

### #8 Fail listener startup when the initial sync publish fails

The spec said "fails fast at Windows service startup" but the code only
logged a warning and continued. The Err branch now propagates a descriptive
error up through `cmd_listen`, which aborts the listener — leaving the
Windows service event log with a clear "initial NS publish failed" line
instead of a silently-up-but-undiscoverable process.

### #9 Stub NS readiness race in integration test

The integration script's fixed `sleep 0.5` after launching the Python stub
NS races on slow CI runners. Replaced with a readiness probe that polls the
stub's stdout for the "listening on 127.0.0.1:PORT" log line, retries up to
4s, and FAILS LOUDLY if the stub never reports ready (also detects stub
process death mid-startup).

### #10 Integration script never fails when expected log line missing

The wait loop polled for "NS heartbeat enrolled" or "initial NS publish
failed" but had no fallthrough behavior if neither appeared. Added a
`FOUND_MSG` flag and explicit failure path with full listener stderr dump
when the listener hangs or logs something unexpected.

## Why

CodeRabbit reasoning was correct for all five. #6/#7 prevent production
junk records, #8 makes the fail-fast contract real, #9/#10 make the
integration script trustworthy on CI.

## Details

- New CLI flag `--ns-register-name <NS_REGISTER_NAME>` on `ztlp listen`,
  threaded through `Commands::Listen` destructuring → `cmd_listen` signature
  → the heartbeat block.
- `--service-name` no longer plays double duty; it stays a relay-only flag.
- The heartbeat gate now requires all three of (ns_register_name, ns_server,
  zone) to be `Some`. If any one is missing, the heartbeat is not spawned.
- `advertise_addr: Option<String>` propagates Some-only for routable binds;
  `ns_publish_self(..., advertise_addr.as_ref())` cleanly degrades to
  KEY-only when the option is None.
- Spec doc `docs/plans/2026-06-01-ns-self-register-heartbeat.md` updated to
  document the explicit opt-in flag and the KEY-only fallback.
- Integration script updated to use the new flag.

## Validation

```
cargo fmt --check          -> clean
cargo test --bin ztlp ns_  -> 24 passed; 0 failed
scripts/test-ns-heartbeat-integration.sh:
  ✓ stub NS ready after 2*0.1s
  ✓ initial publish observed after 2*0.5s
  NS stub saw: 4 registers, 2 queries
  PASS: heartbeat published KEY+SVC to NS stub
```

The 2 existing T3 wire-shape regression guards still pass:

```
test tests::ns_register_key_record_includes_node_id ... ok
test tests::ns_register_svc_record_includes_node_id ... ok
```

## Backward compatibility

Existing `ztlp ns register` CLI subcommand unchanged. Existing Chef
cookbook runs unchanged (they call `ztlp ns register` explicitly, not
`ztlp listen --ns-register-name`). This commit is purely additive at the
external surface: a new optional flag and a new failure mode that only
triggers when the new flag is set.
priceflex added a commit that referenced this pull request Jun 2, 2026
* feat(proto): NS self-registration heartbeat in `ztlp listen`

## What

Listener processes now own their own NS presence. When started with
`--ns-server`, `--service-name`, and `--zone`, `ztlp listen` spawns a
background task that publishes its KEY + SVC records to ZTLP-NS at boot
and re-publishes every 8h ± 10min jitter — well below the 24h NS TTL.

External factors (Chef cookbook runs, cron, marker files) are no longer
required to keep a listener reachable via NS lookup.

## Why

Production NS records were silently expiring every 24h because the Chef
cookbook only re-published on hash change, but NS records carry a 24h TTL
(86400s). Real-world impact: many desktops in `tech-rockstars.trs.ztlp`
had KEY records that returned `:not_found` via `Store.lookup` despite
appearing in `:mnesia.foldl` walks — the records had expired and only
their tuples lingered as phantoms.

This shifts liveness-coupled discovery onto the running service itself
(the standard pattern from Consul, etcd, k8s endpoints, Nomad) rather
than relying on external configuration management to remember to
republish.

## Details

- New helper `ns_publish_self(name, zone, identity, ns_server, address)`
  factored out of `cmd_ns_register`. Used by both the CLI subcommand
  (unchanged behavior) and the new heartbeat task.
- New task `ns_heartbeat_task(name, zone, identity, ns_server, address,
  interval, jitter_max)` runs an infinite republish loop with uniform
  random jitter to avoid stampedes on shared NS infrastructure.
- `cmd_listen` spawns the heartbeat task when all three of
  `--ns-server`, `--service-name`, `--zone` are passed. Initial publish
  is synchronous so misconfiguration fails fast at Windows service
  startup; subsequent failures are logged at WARN and retried.
- The existing T3 CBOR-shape regression guards
  (`ns_register_*_record_includes_node_id`) still pass — the wire format
  is unchanged.

Spec: `docs/plans/2026-06-01-ns-self-register-heartbeat.md`

## Tests

5 new unit tests under `proto/src/bin/ztlp-cli.rs` (`mod tests`):

  - `ns_publish_self_validates_name_in_zone` — rejects mismatched
    name/zone before touching the network.
  - `ns_publish_self_constructs_key_packet_with_node_id` — KEY-record
    CBOR includes `algorithm`, `node_id`, `public_key`. Decodes via
    `ciborium` against a captured UDP packet.
  - `ns_publish_self_constructs_svc_packet_when_address_given` — SVC
    record only emitted when `address` is `Some`; second packet has
    `type_byte == 2`.
  - `ns_publish_self_omits_svc_when_no_address` — exactly one register
    packet sent in KEY-only mode.
  - `ns_heartbeat_task_republishes_on_tick` — heartbeat with a 150ms
    test interval emits ≥4 register packets over 700ms (initial + 2
    ticks × 2 records each).

Test helper `spawn_capture_ns` wraps a `tokio::net::UdpSocket` that
captures every packet sent to it and replies `0x06` (register ACK) or
`0x02` (lookup found) so the helper's happy path completes.

## Validation

```
cargo test --bin ztlp ns_     -> 24 passed; 0 failed
cargo test --bin ztlp         -> 59 passed; 0 failed
cargo test --lib              -> 1084 passed; 0 failed; 12 ignored
```

Zero regressions vs. baseline. The 2 existing T3 CBOR-shape guards
(`ns_register_key_record_includes_node_id`,
`ns_register_svc_record_includes_node_id`) still pass — the wire format
is byte-for-byte identical.

## Integration confirmation

`scripts/test-ns-heartbeat-integration.sh` boots the real `ztlp listen`
binary against a Python UDP NS stub on loopback and asserts the wire
behavior end-to-end:

```
[stub_ns] listening on 127.0.0.1:23196
Waiting for initial NS heartbeat...
  ✓ initial publish observed after 2*0.5s
NS stub saw: 4 registers, 2 queries
PASS: heartbeat published KEY+SVC to NS stub

=== listener stderr ===
  ZTLP QUIC server listening on UDP 127.0.0.1:23195
  ✓ NS heartbeat enrolled: test-node.example.ztlp @ 127.0.0.1:23195 -> 127.0.0.1:23196
  → [ns_heartbeat] published test-node.example.ztlp to 127.0.0.1:23196

=== NS stub log ===
  ts opcode=0x09 len=192 from=('127.0.0.1', 40399)  <- KEY
  ts opcode=0x09 len=115 from=('127.0.0.1', 40399)  <- SVC
  ts opcode=0x01 len=26  from=('127.0.0.1', 40399)  <- verify query
  ts opcode=0x09 len=192 from=('127.0.0.1', 39094)  <- KEY (next cycle)
  ts opcode=0x09 len=115 from=('127.0.0.1', 39094)  <- SVC (next cycle)
  ts opcode=0x01 len=26  from=('127.0.0.1', 39094)  <- verify (next cycle)
```

Two heartbeat cycles within 1 second using a test override; production
cadence is 8h ± 10min.

## Follow-up

- Phase 2 (separate PR): delete the Chef cookbook's `ztlp_ns_register`
  resource and the `svc-registered.sha256` marker file. The listener now
  handles its own NS presence; cookbook should only install the binary,
  config, identity, and Windows service.
- Phase 3 (separate PR): same heartbeat pattern in `ztlp agent start` so
  agent processes self-register too.

## TDD note

Strict RED-GREEN-REFACTOR was interrupted mid-cycle by a disk-full
incident (target/ + /tmp filled the root volume), which forced the
working tree to be `git checkout`-restored and the patches re-applied
in a single pass instead of one-at-a-time. The first run of the new
tests still caught real bugs (wrong register opcode 0x05 vs actual
0x09; wrong data-length encoding u32 vs u16) before any production
code changes were made permanent, so the spirit of "watch the test
fail for the right reason before locking the implementation" was
preserved even though the literal ordering was compressed.

* style: cargo fmt

* fix(proto): address CodeRabbit review feedback

## What

CodeRabbit flagged 5 Major findings on PR #91; this commit addresses all of
them. Test counts and integration confirmation are unchanged.

### #6 Require explicit opt-in flag for NS self-registration

`--service-name` has a default of `ztlp-gateway` for relay routing, so the
previous logic auto-activated NS registration on any `--ns-server + --zone`
listener — including listeners that should have stayed under external (Chef)
NS publication. Added a new `--ns-register-name <fqsn>` flag that is the
explicit opt-in:

  - When omitted: listener does NOT touch NS (previous behavior for non-NS
    callers preserved).
  - When set: listener requires all three of `--ns-server`, `--zone`, and
    `--ns-register-name` before spawning the heartbeat.

### #7 Don't publish unspecified bind addresses as SVC

With the default `--bind 0.0.0.0:23095`, `local_addr()` yields
`0.0.0.0:23095` and that exact (unroutable) value was being published as the
SVC record. Now the listener detects unspecified bind via
`listener_addr.ip().is_unspecified()` and publishes KEY only in that case,
with a warning at startup pointing operators at the `--bind` flag.

### #8 Fail listener startup when the initial sync publish fails

The spec said "fails fast at Windows service startup" but the code only
logged a warning and continued. The Err branch now propagates a descriptive
error up through `cmd_listen`, which aborts the listener — leaving the
Windows service event log with a clear "initial NS publish failed" line
instead of a silently-up-but-undiscoverable process.

### #9 Stub NS readiness race in integration test

The integration script's fixed `sleep 0.5` after launching the Python stub
NS races on slow CI runners. Replaced with a readiness probe that polls the
stub's stdout for the "listening on 127.0.0.1:PORT" log line, retries up to
4s, and FAILS LOUDLY if the stub never reports ready (also detects stub
process death mid-startup).

### #10 Integration script never fails when expected log line missing

The wait loop polled for "NS heartbeat enrolled" or "initial NS publish
failed" but had no fallthrough behavior if neither appeared. Added a
`FOUND_MSG` flag and explicit failure path with full listener stderr dump
when the listener hangs or logs something unexpected.

## Why

CodeRabbit reasoning was correct for all five. #6/#7 prevent production
junk records, #8 makes the fail-fast contract real, #9/#10 make the
integration script trustworthy on CI.

## Details

- New CLI flag `--ns-register-name <NS_REGISTER_NAME>` on `ztlp listen`,
  threaded through `Commands::Listen` destructuring → `cmd_listen` signature
  → the heartbeat block.
- `--service-name` no longer plays double duty; it stays a relay-only flag.
- The heartbeat gate now requires all three of (ns_register_name, ns_server,
  zone) to be `Some`. If any one is missing, the heartbeat is not spawned.
- `advertise_addr: Option<String>` propagates Some-only for routable binds;
  `ns_publish_self(..., advertise_addr.as_ref())` cleanly degrades to
  KEY-only when the option is None.
- Spec doc `docs/plans/2026-06-01-ns-self-register-heartbeat.md` updated to
  document the explicit opt-in flag and the KEY-only fallback.
- Integration script updated to use the new flag.

## Validation

```
cargo fmt --check          -> clean
cargo test --bin ztlp ns_  -> 24 passed; 0 failed
scripts/test-ns-heartbeat-integration.sh:
  ✓ stub NS ready after 2*0.1s
  ✓ initial publish observed after 2*0.5s
  NS stub saw: 4 registers, 2 queries
  PASS: heartbeat published KEY+SVC to NS stub
```

The 2 existing T3 wire-shape regression guards still pass:

```
test tests::ns_register_key_record_includes_node_id ... ok
test tests::ns_register_svc_record_includes_node_id ... ok
```

## Backward compatibility

Existing `ztlp ns register` CLI subcommand unchanged. Existing Chef
cookbook runs unchanged (they call `ztlp ns register` explicitly, not
`ztlp listen --ns-register-name`). This commit is purely additive at the
external surface: a new optional flag and a new failure mode that only
triggers when the new flag is set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant