fix(relay): strict-routing — pick_gateway_for_service returns :error on unknown service#12
Conversation
… service What ---- `ZtlpRelay.GatewayForwarder.pick_gateway_for_service/1` no longer silently round-robins to whichever other tenant is next in the index when the requested service hash/name does not match any registered gateway. It now returns `:error` for explicit unmatched-service intents. The all-zero 16-byte "no preference" sentinel keeps the legacy round-robin fallback for back-compat. Why --- Carry-over Task #2 from the v0.29.3 handoff. The previous fallback was a silent cross-tenant routing footgun: a CLI asking for `gw-test-org-2` could land on `gw-hermese2e-1779353410` and render the wrong tenant's Bootstrap UI when the gateway-index happened to point that way. With 9 tenant gateways in production, this was a security smell on the multi-tenant relay path. Details ------- * relay/lib/ztlp_relay/gateway_forwarder.ex - `handle_call({:pick_gateway_for_service, ...})`: when no live gateway matches the requested service AND the request was not the all-zero sentinel, return `{:reply, :error, state}` directly. - All-zero `<<0::128>>` sentinel still falls through to the original static-then-dynamic round-robin path. - In-code comment documents the v0.29.4 strict-routing contract and cross-references the two call sites (CLIENT_ROUTE in udp_listener and `forward_hello_to_gateway`). * relay/test/ztlp_relay/gateway_forwarder_test.exs (+2 tests, +0 -0 removed; the prior "falls back instead of crashing" test was rewritten to assert the strict :error contract): - "unknown 16-byte hash returns :error and does NOT round-robin" - "unknown service NAME returns :error" (legacy-string caller path) - "all-zero hash falls back to round-robin" (back-compat sanity) Tests ----- * `cd relay && mix test`: 594 passed (was 592; +2 strict tests, the prior fallback test is rewritten not added). * `mix format --check-formatted` clean. * No Rust changes; CLI experience of `:error` is unchanged from the existing no-gateway case (client times out instead of being silently routed to a different tenant). Validation ---------- * RED phase confirmed: the two new strict tests failed pre-fix with the expected silent-round-robin diagnosis (`Got: {:ok, gw_a}`). * GREEN phase confirmed: all 11 tests in the gateway_forwarder suite pass after the implementation tweak. * No callsite changes required: `udp_listener.ex:518` (CLIENT_ROUTE, QUIC α-relay path) already logs `"rejected: no gateway registered for service=..."` on `:error`. `udp_listener.ex:1106` (legacy HELLO `forward_hello_to_gateway`) already drops into half-open on `:error` so the client times out cleanly instead of being silently misrouted. Follow-up --------- * Task #3 (per-zone HMAC `Config.registration_secret/0`) is still the prod-readiness blocker on the relay. * Consider exposing a discrete `{:error, :unknown_service}` reason tuple later if anything wants to distinguish "no service match" from "no gateways at all" at the API layer — not needed yet because current callers just log/timeout in either case.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe gateway forwarder implements stricter routing semantics for service selection, distinguishing an all-zero "no preference" sentinel from explicit service intent. When explicit intent is provided and no matching live gateway exists, the function now returns ChangesStrict Gateway Routing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ion (#13) What ---- Bumps `relay/mix.exs` `version` from "0.29.3" to "0.29.4" so the OTP release reports `Application.spec(:ztlp_relay, :vsn) == "0.29.4"` and matches the git tag `v0.29.4` already pushed on `main` at `829abdf`. Adds three regression tests under `describe "version reporting (regression pin)"` in `relay/test/ztlp_relay/release_test.exs`: 1. Semver-shape guard on the declared version string. 2. Runtime-vs-declared drift detector — fails if the loaded `:ztlp_relay` application's vsn doesn't match `mix.exs`. 3. Floor guard at "0.29.4" — prevents an accidental down-bump that would silently re-ship the pre-strict-routing relay. Why --- The v0.29.4 strict-routing tag (PR #12, `829abdf`) was cut without bumping `relay/mix.exs`, so a container built directly from the tag reports `vsn = '0.29.3'`. Health checks and on-call diagnostics run `docker exec ztlp-relay /app/bin/ztlp_relay rpc 'Application.spec(...)'` and that value goes straight into status dashboards. It must not lie about which tag is actually running, especially when we have multiple relay versions in the fleet during a rollout. Details ------- - `relay/mix.exs`: single-line version bump. - `relay/test/ztlp_relay/release_test.exs`: new describe block with comments explaining why each assertion exists and how to debug it if it fails. - No source/runtime changes to the relay. The contract on `pick_gateway_for_service/1` from v0.29.4 strict-routing is preserved exactly. Tests ----- - `cd relay && mix test`: 597 passed, 0 failures (was 594 on main; +3 for the new vsn-pin tests). - TDD verified: with `mix.exs` still at "0.29.3", the floor-guard test fails with the expected message: `mix.exs version 0.29.3 is older than the v0.29.4 strict-routing tag` After the bump: 15/15 in `release_test.exs`. Validation ---------- - Locally built `priceflex/ztlp-relay:v0.29.4-vsnbump` from this branch. - Started the container and ran: docker exec ... /app/bin/ztlp_relay rpc \ 'IO.inspect({:vsn, to_string(Application.spec(:ztlp_relay, :vsn))})' Output: `{:vsn, "0.29.4"}` — fix verified end-to-end on the actual deploy artifact shape, not just the unit test. - Local cargo test --lib --release (proto) unchanged: 858 passed. - Local ztlp.net launch tests unchanged: 48 passed. Follow-up --------- - The existing pushed `v0.29.4` tag points at `829abdf`, which still has the pre-bump mix.exs. The bumped version will land on `main` via this PR; Steve to decide whether to cut a fresh tag (v0.29.5) that includes the bump, or to fold this into the next functional release. No rush — the strict-routing fix is the substantive change; this PR is the cosmetic/operational cleanup. - `gateway/mix.exs` and `ns/mix.exs` are still at "0.24.0" — pre-existing drift, NOT addressed here to keep this PR scoped. - `proto/Cargo.toml` is at "0.29.3" — also pre-existing drift on the Rust workspace, NOT addressed here.
… with regression pins (#14) What - Bumps `gateway/mix.exs` and `ns/mix.exs` from 0.24.0 → 0.29.4. - Bumps `proto/Cargo.toml` from 0.29.3 → 0.29.4. - Adds the same "version reporting (regression pin)" describe block to `gateway/test/ztlp_gateway/release_test.exs` and `ns/test/ztlp_ns/release_test.exs` that PR #13 added for relay (semver shape + runtime-vs-declared drift + ≥0.29.4 floor guard). - Adds `proto/tests/version_pin_test.rs` with two equivalent Rust-side pins (parseable semver + ≥0.29.4 floor guard) using the in-crate `ztlp_proto::updater::SemVer` to avoid taking a new external dependency. Why - PR #13 fixed the relay's version-string drift but explicitly left the rest of the tree as follow-up scope (handoff Known Problems #6: "gateway/mix.exs and ns/mix.exs at 0.24.0; proto/Cargo.toml at 0.29.3"). Gateway and NS had been pinned at 0.24.0 for five minor versions, so `Application.spec(:ztlp_gateway, :vsn)` and `Application.spec(:ztlp_ns, :vsn)` were lying about which tag was actually deployed — the exact bug class PR #13 fixed for relay. Same lie, three more components. - Without regression pins on these other components, the same drift can silently recur on the next tag cut. Details - Floor guard uses `Version.compare(declared, "0.29.4") in [:gt, :eq]` (Elixir) / `actual.cmp(&floor)` (Rust) rather than asserting a literal version string. This is deliberate: literal-string assertions become test maintenance burden on every routine bump, whereas a floor guard only fails on accidental down-bumps below the v0.29.4 strict-routing tag. - Runtime-vs-declared drift test catches both directions of the bug PR #13 found: (a) mix.exs bumped but .app cache stale, and (b) tag cut without bumping mix.exs. - Rust-side test re-uses the in-crate `updater::SemVer` parser rather than pulling in the external `semver` crate. This exercises the same code path the self-updater uses, so if that parser ever regresses both will catch it. - mix.exs comment blocks explain the bump inline so future readers see the on-call story at the point of change, not just in the test file or handoff. - No source-code logic changed; this is a version-string + test-only PR. Public API, wire format, and runtime behavior are untouched. Tests (TDD discipline followed) - RED step verified per-component before bump: * gateway: `mix test test/ztlp_gateway/release_test.exs --seed 0` → "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag" (1/15 failures, all in the new floor-guard test, exactly as designed). * ns: `mix test test/ztlp_ns/release_test.exs --seed 0` → "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag" (1/15 failures, identical message). * proto: `cargo test --test version_pin_test` → "proto/Cargo.toml version 0.29.3 is older than the v0.29.4 strict-routing tag" (1/2 failures, parseable-semver test still green so we know the fixture is sane). - GREEN step after bumps: * gateway: 15/15 in release_test.exs; 835/835 in full `mix test` (seed 1). Note: seed 55290 surfaces two pre-existing test-ordering flakes (TLS port reuse in tls_phase2_test.exs and a GenServer teardown race in crl_server_test.exs) — both also reproduce on plain `main` and are unrelated to this PR. Confirmed by stash + checkout `main`-only files + rerun on the same seed. * ns: 15/15 in release_test.exs; 729/729 in full `mix test`. * proto: 2/2 in version_pin_test; 858/858 in `cargo test --lib --release` (matches the pre-branch baseline). Validation (non-test) - `cargo check --release` clean (31 pre-existing dead-code warnings in `proto/src/bin/ztlp-cli.rs` unchanged — Known Problems #4 scope). - Relay suite re-run unchanged: 597/597 (confirms no cross-component collateral damage). - ztlp.net Python suite unchanged: 48/48 in `tests.test_launch_app`. - Infra untouched. No relay restart, no gateway restart, no NS restart. Live binaries still on v0.29.3 (relay) / v0.24.0 (gateway, ns) — actual deployment of the bumped versions is a separate, Steve-gated step. Follow-up - After merge, decide tag strategy with v0.29.4 already in the past: either (a) cut v0.29.5 that includes this + PR #13's mix.exs bump (cleanest), or (b) accept "tag is source of truth; mix.exs is best-effort" for the trailing v0.29.4 and reset on v0.30.0. - Task #3 (per-zone HMAC `Config.registration_secret/0`) is still the prod-readiness blocker and is the hard dependency for the Bootstrap workstream (handoff §"HARD DEPENDENCY uncovered while locking in #5"). - Known Problems #4 (`cargo fix` pass on 31 dead-code warnings in `proto/src/bin/ztlp-cli.rs`) still open; left out of scope here to keep this PR tight. Refs - PR #12 (v0.29.4 strict-routing): 829abdf - PR #13 (relay mix.exs vsn pin): d22afbf - Handoff: ~/hermes_session_handoff.md "Known Problems #6", "Open Question #1"
…lean v0.29.5 tag (#15) What - Bumps `relay/mix.exs`, `gateway/mix.exs`, `ns/mix.exs` from 0.29.4 → 0.29.5. - Bumps `proto/Cargo.toml` from 0.29.4 → 0.29.5. - No test changes — the regression pins added in PR #13 / PR #14 still catch drift (semver-shape + runtime-vs-declared + ≥0.29.4 floor). Why - v0.29.4 tag was cut at commit `829abdf` BEFORE PR #13 and PR #14 bumped the four declared versions. That means a container built from the v0.29.4 tag reports its runtime vsn as the PRE-bump value, which is exactly the on-call-lying bug PR #13 introduced regression tests for. - The fix is to cut a NEW tag whose commit POST-DATES the bumps. v0.29.5 is the smallest possible bump that achieves this — pure version-string alignment, no functional change. - After this PR merges and v0.29.5 is tagged: git tag == relay vsn == gateway vsn == ns vsn == proto CARGO_PKG_VERSION == "0.29.5". The existing `release_test.exs` runtime-vs-declared drift test, run inside any container built from this tag, will report agreement instead of drift. Details - Floor guard in `release_test.exs` and `version_pin_test.rs` stays at 0.29.4. This is deliberate: the floor is a stable lower bound, not the current version. Ratcheting it on every routine bump is the exact maintenance burden the floor-guard pattern was designed to avoid (see the PR #13/#14 commit messages). 0.29.5 satisfies ≥0.29.4, and an accidental down-bump below 0.29.4 (which would lose the strict-routing fix) is what the floor exists to catch. - mix.exs comment blocks updated in-place to reference both PR #14 and this PR, so the version-line provenance is readable at the point of change without chasing PRs in git log. - No source-code logic changed. Public API, wire format, runtime behavior unchanged. Tests (green on this branch) - `cd proto && cargo test --lib --release`: 858 passed - `cd proto && cargo test --test version_pin_test`: 2 passed - `cd relay && mix test`: 597 passed - `cd gateway && mix test --seed 1`: 835 passed (seed 1 avoids the two pre-existing seed-55290 flakes in tls_phase2_test and crl_server_test — both documented in the handoff as Task #6 and unrelated to this PR.) - `cd ns && mix test`: 729 passed - `cd ztlp.net && python3 -m unittest tests.test_launch_app`: 48 passed - Total: 3,069 tests, 0 failures. The runtime-vs-declared drift test in each release_test.exs was the key verification — it exercises the same `Application.spec/2` RPC that lies on a misbuilt container. All three Elixir suites pass after the bump, which means mix recompiled the .app files to match the new declared version. This is the in-source-tree equivalent of the `docker exec ... rpc 'Application.spec(_, :vsn)'` check that found the original PR #13 bug. Validation - `git diff --stat` matches expectation: 4 files, ~23 insertions, ~19 deletions (the changes are mostly comment-block updates around the one-character version bumps). - No source code changes; no risk of breaking runtime behavior. - No infra touched. Follow-up (post-merge) - Tag `v0.29.5` on main, push tag with openclaw SSH key, watch the Release workflow to green. - Deploy `priceflex/ztlp-relay:v0.29.5` to staging relay (`34.218.240.106`) when Steve OKs the restart (iOS bench risk). A container built from v0.29.5 will report `:vsn = "0.29.5"` correctly, closing out the original PR #13 defect end-to-end. - Task #2 (per-zone HMAC `Config.registration_secret/0`) remains the prod-readiness blocker and hard dependency for the Bootstrap workstream. Refs - PR #12 (v0.29.4 strict-routing): 829abdf - PR #13 (relay mix.exs vsn pin): d22afbf - PR #14 (gateway/ns/proto vsn pin): 0a6fbb6 - Handoff: ~/hermes_session_handoff.md "Active Tasks → Task 0" decision list (Option 1 — cut v0.29.5 immediately).
Summary
Carry-over Task #2 from the v0.29.3 handoff.
ZtlpRelay.GatewayForwarder.pick_gateway_for_service/1no longer silently round-robins to whichever other tenant is next in the index when the requested service hash/name doesn't match a live gateway. It now returns:errorfor explicit unmatched-service intents. The all-zero 16-byte "no preference" sentinel keeps the legacy round-robin fallback for back-compat.Why
Pre-v0.29.4 fallback was a silent cross-tenant routing footgun: a CLI asking for
gw-test-org-2could land ongw-hermese2e-1779353410and render the wrong tenant's Bootstrap UI when the gateway-index happened to point that way. With 9 tenant gateways in production, this was a security smell on the multi-tenant relay path.Changes
relay/lib/ztlp_relay/gateway_forwarder.exhandle_call({:pick_gateway_for_service, ...}): when no live gateway matches the requested service AND the request was not the all-zero sentinel, return{:reply, :error, state}directly.<<0::128>>sentinel still falls through to the original static-then-dynamic round-robin path.Tests
Followed TDD RED → GREEN.
pick_gateway_for_service/1 with a 16-byte hash that does not match any registered service returns :error and does NOT silently round-robin to a different tenantpick_gateway_for_service/1 with an unknown service NAME (string form) also returns :errorpick_gateway_for_service/1 with an all-zero (no-preference) hash falls back to round-robin"...falls back instead of selecting the wrong gateway by accident"test (it tolerated either outcome — now asserts strict:error).Test counts:
cd relay && mix test: 594 passed (was 592; +2 strict tests, prior fallback test rewritten not added).mix format --check-formatted: clean.Validation
Got: {:ok, gw_a}— exactly the silent-round-robin diagnosis we wanted to lock out.udp_listener.ex:518(CLIENT_ROUTE, QUIC α-relay path) already logs"rejected: no gateway registered for service=..."on:error.udp_listener.ex:1106(legacy HELLOforward_hello_to_gateway) already drops into half-open on:error, so the client times out cleanly instead of being silently misrouted.Follow-up
Config.registration_secret/0) is still the prod-readiness blocker on the relay.{:error, :unknown_service}reason tuple later if anything wants to distinguish "no service match" from "no gateways at all" at the API layer — not needed yet because current callers just log/timeout in either case.Summary by CodeRabbit
Bug Fixes