Skip to content

fix(relay): strict-routing — pick_gateway_for_service returns :error on unknown service#12

Merged
priceflex merged 1 commit into
mainfrom
feature/relay-strict-service-routing
May 22, 2026
Merged

fix(relay): strict-routing — pick_gateway_for_service returns :error on unknown service#12
priceflex merged 1 commit into
mainfrom
feature/relay-strict-service-routing

Conversation

@priceflex
Copy link
Copy Markdown
Owner

@priceflex priceflex commented May 22, 2026

Summary

Carry-over Task #2 from the v0.29.3 handoff. ZtlpRelay.GatewayForwarder.pick_gateway_for_service/1 no longer silently round-robins to whichever other tenant is next in the index when the requested service hash/name doesn't match a live gateway. It now returns :error for explicit unmatched-service intents. The all-zero 16-byte "no preference" sentinel keeps the legacy round-robin fallback for back-compat.

Why

Pre-v0.29.4 fallback was a silent cross-tenant routing footgun: a CLI asking for gw-test-org-2 could land on gw-hermese2e-1779353410 and render the wrong tenant's Bootstrap UI when the gateway-index happened to point that way. With 9 tenant gateways in production, this was a security smell on the multi-tenant relay path.

Changes

  • relay/lib/ztlp_relay/gateway_forwarder.ex
    • handle_call({:pick_gateway_for_service, ...}): when no live gateway matches the requested service AND the request was not the all-zero sentinel, return {:reply, :error, state} directly.
    • All-zero <<0::128>> sentinel still falls through to the original static-then-dynamic round-robin path.
    • In-code comment documents the v0.29.4 strict-routing contract and cross-references the two call sites.

Tests

Followed TDD RED → GREEN.

  • New: pick_gateway_for_service/1 with a 16-byte hash that does not match any registered service returns :error and does NOT silently round-robin to a different tenant
  • New: pick_gateway_for_service/1 with an unknown service NAME (string form) also returns :error
  • New: pick_gateway_for_service/1 with an all-zero (no-preference) hash falls back to round-robin
  • Rewrote prior "...falls back instead of selecting the wrong gateway by accident" test (it tolerated either outcome — now asserts strict :error).

Test counts:

  • cd relay && mix test: 594 passed (was 592; +2 strict tests, prior fallback test rewritten not added).
  • mix format --check-formatted: clean.

Validation

  • RED phase confirmed: pre-fix the new strict tests failed with Got: {:ok, gw_a} — exactly the silent-round-robin diagnosis we wanted to lock out.
  • GREEN phase confirmed: post-fix all 11 tests in the suite pass.
  • No call-site changes required:
    • udp_listener.ex:518 (CLIENT_ROUTE, QUIC α-relay path) already logs "rejected: no gateway registered for service=..." on :error.
    • udp_listener.ex:1106 (legacy HELLO forward_hello_to_gateway) already drops into half-open on :error, so the client times out cleanly instead of being silently misrouted.

Follow-up

  • Task docs: automatic documentation maintenance #3 (per-zone HMAC Config.registration_secret/0) is still the prod-readiness blocker on the relay.
  • Consider exposing a discrete {:error, :unknown_service} reason tuple later if anything wants to distinguish "no service match" from "no gateways at all" at the API layer — not needed yet because current callers just log/timeout in either case.

Summary by CodeRabbit

Bug Fixes

  • Enhanced gateway routing to return explicit errors when a specific service request cannot be matched with an available gateway, preventing silent fallback behavior. The system now differentiates between explicit "no preference" selection and specific service requests, ensuring more predictable and secure routing when services are unavailable.

Review Change Stack

… service

What
----
`ZtlpRelay.GatewayForwarder.pick_gateway_for_service/1` no longer silently
round-robins to whichever other tenant is next in the index when the
requested service hash/name does not match any registered gateway. It
now returns `:error` for explicit unmatched-service intents. The
all-zero 16-byte "no preference" sentinel keeps the legacy round-robin
fallback for back-compat.

Why
---
Carry-over Task #2 from the v0.29.3 handoff. The previous fallback was
a silent cross-tenant routing footgun: a CLI asking for `gw-test-org-2`
could land on `gw-hermese2e-1779353410` and render the wrong tenant's
Bootstrap UI when the gateway-index happened to point that way. With
9 tenant gateways in production, this was a security smell on the
multi-tenant relay path.

Details
-------
* relay/lib/ztlp_relay/gateway_forwarder.ex
  - `handle_call({:pick_gateway_for_service, ...})`: when no live
    gateway matches the requested service AND the request was not the
    all-zero sentinel, return `{:reply, :error, state}` directly.
  - All-zero `<<0::128>>` sentinel still falls through to the original
    static-then-dynamic round-robin path.
  - In-code comment documents the v0.29.4 strict-routing contract and
    cross-references the two call sites (CLIENT_ROUTE in udp_listener
    and `forward_hello_to_gateway`).

* relay/test/ztlp_relay/gateway_forwarder_test.exs (+2 tests, +0 -0
  removed; the prior "falls back instead of crashing" test was rewritten
  to assert the strict :error contract):
  - "unknown 16-byte hash returns :error and does NOT round-robin"
  - "unknown service NAME returns :error" (legacy-string caller path)
  - "all-zero hash falls back to round-robin" (back-compat sanity)

Tests
-----
* `cd relay && mix test`: 594 passed (was 592; +2 strict tests, the
  prior fallback test is rewritten not added).
* `mix format --check-formatted` clean.
* No Rust changes; CLI experience of `:error` is unchanged from the
  existing no-gateway case (client times out instead of being silently
  routed to a different tenant).

Validation
----------
* RED phase confirmed: the two new strict tests failed pre-fix with
  the expected silent-round-robin diagnosis (`Got: {:ok, gw_a}`).
* GREEN phase confirmed: all 11 tests in the gateway_forwarder suite
  pass after the implementation tweak.
* No callsite changes required: `udp_listener.ex:518` (CLIENT_ROUTE,
  QUIC α-relay path) already logs `"rejected: no gateway registered for
  service=..."` on `:error`. `udp_listener.ex:1106` (legacy HELLO
  `forward_hello_to_gateway`) already drops into half-open on `:error`
  so the client times out cleanly instead of being silently misrouted.

Follow-up
---------
* Task #3 (per-zone HMAC `Config.registration_secret/0`) is still the
  prod-readiness blocker on the relay.
* Consider exposing a discrete `{:error, :unknown_service}` reason
  tuple later if anything wants to distinguish "no service match" from
  "no gateways at all" at the API layer — not needed yet because
  current callers just log/timeout in either case.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a53bb1cf-745b-4ca7-ad59-743b4a2b6ce7

📥 Commits

Reviewing files that changed from the base of the PR and between 28d1fac and 9f7226d.

📒 Files selected for processing (2)
  • relay/lib/ztlp_relay/gateway_forwarder.ex
  • relay/test/ztlp_relay/gateway_forwarder_test.exs

📝 Walkthrough

Walkthrough

The gateway forwarder implements stricter routing semantics for service selection, distinguishing an all-zero "no preference" sentinel from explicit service intent. When explicit intent is provided and no matching live gateway exists, the function now returns :error instead of silently falling back. Tests verify strict error behavior and legacy fallback for the all-zero sentinel.

Changes

Strict Gateway Routing

Layer / File(s) Summary
Strict routing logic and sentinel detection
relay/lib/ztlp_relay/gateway_forwarder.ex
pick_gateway_for_service/1 introduces all_zero_hash? detection to distinguish the <<0::128>> no-preference sentinel from real service intent. When a non-zero intent has no matching live dynamic gateway, the function now returns :error instead of falling back to any available gateway; the all-zero sentinel retains legacy fallback behavior. Spec formatting and variable assignment formatting are adjusted for clarity.
Test coverage for strict routing semantics
relay/test/ztlp_relay/gateway_forwarder_test.exs
Tests are updated to enforce strict routing: non-matching 16-byte hashes and unknown service names must return :error (preventing silent cross-tenant round-robin), while the all-zero "no preference" sentinel falls back to round-robin behavior by returning {:ok, _}.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit routes with stricter care,
Where service intent must have its pair—
No silent fallback, code runs clear,
Yet "no preference" finds gateway near.
Sentinel zero lets fate decide,
Which gateway serves the other side! 🛣️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: implementing strict routing behavior where pick_gateway_for_service returns :error for unknown services instead of falling back silently.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/relay-strict-service-routing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@priceflex priceflex merged commit 829abdf into main May 22, 2026
8 checks passed
@priceflex priceflex deleted the feature/relay-strict-service-routing branch May 22, 2026 17:06
priceflex added a commit that referenced this pull request May 22, 2026
…ion (#13)

What
----
Bumps `relay/mix.exs` `version` from "0.29.3" to "0.29.4" so the OTP
release reports `Application.spec(:ztlp_relay, :vsn) == "0.29.4"` and
matches the git tag `v0.29.4` already pushed on `main` at `829abdf`.

Adds three regression tests under `describe "version reporting (regression pin)"`
in `relay/test/ztlp_relay/release_test.exs`:
  1. Semver-shape guard on the declared version string.
  2. Runtime-vs-declared drift detector — fails if the loaded
     `:ztlp_relay` application's vsn doesn't match `mix.exs`.
  3. Floor guard at "0.29.4" — prevents an accidental down-bump
     that would silently re-ship the pre-strict-routing relay.

Why
---
The v0.29.4 strict-routing tag (PR #12, `829abdf`) was cut without
bumping `relay/mix.exs`, so a container built directly from the tag
reports `vsn = '0.29.3'`. Health checks and on-call diagnostics run
`docker exec ztlp-relay /app/bin/ztlp_relay rpc 'Application.spec(...)'`
and that value goes straight into status dashboards. It must not lie
about which tag is actually running, especially when we have multiple
relay versions in the fleet during a rollout.

Details
-------
- `relay/mix.exs`: single-line version bump.
- `relay/test/ztlp_relay/release_test.exs`: new describe block with
  comments explaining why each assertion exists and how to debug
  it if it fails.
- No source/runtime changes to the relay. The contract on
  `pick_gateway_for_service/1` from v0.29.4 strict-routing is
  preserved exactly.

Tests
-----
- `cd relay && mix test`: 597 passed, 0 failures
  (was 594 on main; +3 for the new vsn-pin tests).
- TDD verified: with `mix.exs` still at "0.29.3", the floor-guard
  test fails with the expected message:
    `mix.exs version 0.29.3 is older than the v0.29.4 strict-routing tag`
  After the bump: 15/15 in `release_test.exs`.

Validation
----------
- Locally built `priceflex/ztlp-relay:v0.29.4-vsnbump` from this branch.
- Started the container and ran:
    docker exec ... /app/bin/ztlp_relay rpc \
      'IO.inspect({:vsn, to_string(Application.spec(:ztlp_relay, :vsn))})'
  Output: `{:vsn, "0.29.4"}` — fix verified end-to-end on the actual
  deploy artifact shape, not just the unit test.
- Local cargo test --lib --release (proto) unchanged: 858 passed.
- Local ztlp.net launch tests unchanged: 48 passed.

Follow-up
---------
- The existing pushed `v0.29.4` tag points at `829abdf`, which still
  has the pre-bump mix.exs. The bumped version will land on `main`
  via this PR; Steve to decide whether to cut a fresh tag (v0.29.5)
  that includes the bump, or to fold this into the next functional
  release. No rush — the strict-routing fix is the substantive change;
  this PR is the cosmetic/operational cleanup.
- `gateway/mix.exs` and `ns/mix.exs` are still at "0.24.0" — pre-existing
  drift, NOT addressed here to keep this PR scoped.
- `proto/Cargo.toml` is at "0.29.3" — also pre-existing drift on the
  Rust workspace, NOT addressed here.
priceflex added a commit that referenced this pull request May 23, 2026
… with regression pins (#14)

What
- Bumps `gateway/mix.exs` and `ns/mix.exs` from 0.24.0 → 0.29.4.
- Bumps `proto/Cargo.toml` from 0.29.3 → 0.29.4.
- Adds the same "version reporting (regression pin)" describe block to
  `gateway/test/ztlp_gateway/release_test.exs` and
  `ns/test/ztlp_ns/release_test.exs` that PR #13 added for relay
  (semver shape + runtime-vs-declared drift + ≥0.29.4 floor guard).
- Adds `proto/tests/version_pin_test.rs` with two equivalent Rust-side pins
  (parseable semver + ≥0.29.4 floor guard) using the in-crate
  `ztlp_proto::updater::SemVer` to avoid taking a new external dependency.

Why
- PR #13 fixed the relay's version-string drift but explicitly left the rest
  of the tree as follow-up scope (handoff Known Problems #6: "gateway/mix.exs
  and ns/mix.exs at 0.24.0; proto/Cargo.toml at 0.29.3"). Gateway and NS had
  been pinned at 0.24.0 for five minor versions, so
  `Application.spec(:ztlp_gateway, :vsn)` and `Application.spec(:ztlp_ns, :vsn)`
  were lying about which tag was actually deployed — the exact bug class
  PR #13 fixed for relay. Same lie, three more components.
- Without regression pins on these other components, the same drift can
  silently recur on the next tag cut.

Details
- Floor guard uses `Version.compare(declared, "0.29.4") in [:gt, :eq]` (Elixir)
  / `actual.cmp(&floor)` (Rust) rather than asserting a literal version string.
  This is deliberate: literal-string assertions become test maintenance burden
  on every routine bump, whereas a floor guard only fails on accidental
  down-bumps below the v0.29.4 strict-routing tag.
- Runtime-vs-declared drift test catches both directions of the bug PR #13
  found: (a) mix.exs bumped but .app cache stale, and (b) tag cut without
  bumping mix.exs.
- Rust-side test re-uses the in-crate `updater::SemVer` parser rather than
  pulling in the external `semver` crate. This exercises the same code path
  the self-updater uses, so if that parser ever regresses both will catch it.
- mix.exs comment blocks explain the bump inline so future readers see the
  on-call story at the point of change, not just in the test file or handoff.
- No source-code logic changed; this is a version-string + test-only PR.
  Public API, wire format, and runtime behavior are untouched.

Tests (TDD discipline followed)
- RED step verified per-component before bump:
  * gateway: `mix test test/ztlp_gateway/release_test.exs --seed 0` →
    "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag"
    (1/15 failures, all in the new floor-guard test, exactly as designed).
  * ns: `mix test test/ztlp_ns/release_test.exs --seed 0` →
    "mix.exs version 0.24.0 is older than the v0.29.4 strict-routing tag"
    (1/15 failures, identical message).
  * proto: `cargo test --test version_pin_test` →
    "proto/Cargo.toml version 0.29.3 is older than the v0.29.4 strict-routing
    tag" (1/2 failures, parseable-semver test still green so we know the
    fixture is sane).
- GREEN step after bumps:
  * gateway: 15/15 in release_test.exs; 835/835 in full `mix test` (seed 1).
    Note: seed 55290 surfaces two pre-existing test-ordering flakes
    (TLS port reuse in tls_phase2_test.exs and a GenServer teardown race in
    crl_server_test.exs) — both also reproduce on plain `main` and are
    unrelated to this PR. Confirmed by stash + checkout `main`-only files +
    rerun on the same seed.
  * ns: 15/15 in release_test.exs; 729/729 in full `mix test`.
  * proto: 2/2 in version_pin_test; 858/858 in `cargo test --lib --release`
    (matches the pre-branch baseline).

Validation (non-test)
- `cargo check --release` clean (31 pre-existing dead-code warnings in
  `proto/src/bin/ztlp-cli.rs` unchanged — Known Problems #4 scope).
- Relay suite re-run unchanged: 597/597 (confirms no cross-component
  collateral damage).
- ztlp.net Python suite unchanged: 48/48 in `tests.test_launch_app`.
- Infra untouched. No relay restart, no gateway restart, no NS restart.
  Live binaries still on v0.29.3 (relay) / v0.24.0 (gateway, ns) — actual
  deployment of the bumped versions is a separate, Steve-gated step.

Follow-up
- After merge, decide tag strategy with v0.29.4 already in the past:
  either (a) cut v0.29.5 that includes this + PR #13's mix.exs bump
  (cleanest), or (b) accept "tag is source of truth; mix.exs is
  best-effort" for the trailing v0.29.4 and reset on v0.30.0.
- Task #3 (per-zone HMAC `Config.registration_secret/0`) is still the
  prod-readiness blocker and is the hard dependency for the Bootstrap
  workstream (handoff §"HARD DEPENDENCY uncovered while locking in #5").
- Known Problems #4 (`cargo fix` pass on 31 dead-code warnings in
  `proto/src/bin/ztlp-cli.rs`) still open; left out of scope here to keep
  this PR tight.

Refs
- PR #12 (v0.29.4 strict-routing): 829abdf
- PR #13 (relay mix.exs vsn pin): d22afbf
- Handoff: ~/hermes_session_handoff.md "Known Problems #6", "Open Question #1"
priceflex added a commit that referenced this pull request May 23, 2026
…lean v0.29.5 tag (#15)

What
- Bumps `relay/mix.exs`, `gateway/mix.exs`, `ns/mix.exs` from 0.29.4 → 0.29.5.
- Bumps `proto/Cargo.toml` from 0.29.4 → 0.29.5.
- No test changes — the regression pins added in PR #13 / PR #14 still
  catch drift (semver-shape + runtime-vs-declared + ≥0.29.4 floor).

Why
- v0.29.4 tag was cut at commit `829abdf` BEFORE PR #13 and PR #14 bumped
  the four declared versions. That means a container built from the
  v0.29.4 tag reports its runtime vsn as the PRE-bump value, which is
  exactly the on-call-lying bug PR #13 introduced regression tests for.
- The fix is to cut a NEW tag whose commit POST-DATES the bumps. v0.29.5
  is the smallest possible bump that achieves this — pure version-string
  alignment, no functional change.
- After this PR merges and v0.29.5 is tagged: git tag == relay vsn ==
  gateway vsn == ns vsn == proto CARGO_PKG_VERSION == "0.29.5". The
  existing `release_test.exs` runtime-vs-declared drift test, run inside
  any container built from this tag, will report agreement instead of
  drift.

Details
- Floor guard in `release_test.exs` and `version_pin_test.rs` stays at
  0.29.4. This is deliberate: the floor is a stable lower bound, not the
  current version. Ratcheting it on every routine bump is the exact
  maintenance burden the floor-guard pattern was designed to avoid (see
  the PR #13/#14 commit messages). 0.29.5 satisfies ≥0.29.4, and an
  accidental down-bump below 0.29.4 (which would lose the strict-routing
  fix) is what the floor exists to catch.
- mix.exs comment blocks updated in-place to reference both PR #14 and
  this PR, so the version-line provenance is readable at the point of
  change without chasing PRs in git log.
- No source-code logic changed. Public API, wire format, runtime behavior
  unchanged.

Tests (green on this branch)
- `cd proto && cargo test --lib --release`: 858 passed
- `cd proto && cargo test --test version_pin_test`: 2 passed
- `cd relay && mix test`: 597 passed
- `cd gateway && mix test --seed 1`: 835 passed
  (seed 1 avoids the two pre-existing seed-55290 flakes in tls_phase2_test
  and crl_server_test — both documented in the handoff as Task #6 and
  unrelated to this PR.)
- `cd ns && mix test`: 729 passed
- `cd ztlp.net && python3 -m unittest tests.test_launch_app`: 48 passed
- Total: 3,069 tests, 0 failures.

The runtime-vs-declared drift test in each release_test.exs was the key
verification — it exercises the same `Application.spec/2` RPC that lies
on a misbuilt container. All three Elixir suites pass after the bump,
which means mix recompiled the .app files to match the new declared
version. This is the in-source-tree equivalent of the
`docker exec ... rpc 'Application.spec(_, :vsn)'` check that found the
original PR #13 bug.

Validation
- `git diff --stat` matches expectation: 4 files, ~23 insertions, ~19
  deletions (the changes are mostly comment-block updates around the
  one-character version bumps).
- No source code changes; no risk of breaking runtime behavior.
- No infra touched.

Follow-up (post-merge)
- Tag `v0.29.5` on main, push tag with openclaw SSH key, watch the
  Release workflow to green.
- Deploy `priceflex/ztlp-relay:v0.29.5` to staging relay
  (`34.218.240.106`) when Steve OKs the restart (iOS bench risk). A
  container built from v0.29.5 will report `:vsn = "0.29.5"` correctly,
  closing out the original PR #13 defect end-to-end.
- Task #2 (per-zone HMAC `Config.registration_secret/0`) remains the
  prod-readiness blocker and hard dependency for the Bootstrap workstream.

Refs
- PR #12 (v0.29.4 strict-routing): 829abdf
- PR #13 (relay mix.exs vsn pin): d22afbf
- PR #14 (gateway/ns/proto vsn pin): 0a6fbb6
- Handoff: ~/hermes_session_handoff.md "Active Tasks → Task 0" decision
  list (Option 1 — cut v0.29.5 immediately).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant