Skip to content

feat: rotation-SNI discovery + rapid-eviction pin set#603

Merged
Evrard-Nil merged 2 commits into
mainfrom
feat/rotation-discovery
May 18, 2026
Merged

feat: rotation-SNI discovery + rapid-eviction pin set#603
Evrard-Nil merged 2 commits into
mainfrom
feat/rotation-discovery

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Collaborator

Why

Cumulative TLS-SPKI discovery used to stall on multi-backend models because model-proxy's least-connections LB collapses our fresh-TCP probes onto a stable subset of backends. We worked around it with parallelism (5 calls per new provider, 2 per refresh cycle) and inter-model staggering, but the shape was fundamentally O(luck): some replicas kept missing some backends forever and got TLS-handshake-rejected when the LB later routed them there.

Customer-visible symptom: ~42 % of /v1/attestation/report?model=zai-org/GLM-latest calls fail with error sending request. Investigation thread in PR #595 + analysis in nearai/model-proxy#27.

model-proxy PR #27 (merged, deployed to staging) published two routing knobs:

  • GET /backends/count?domain=<host> → reports the current healthy backend count for a domain.
  • Synthetic SNI <canonical>-i<N>.<base> → routes to healthy_backends_sorted_by_address[N % healthy], bypassing the LB.

This PR rewrites cloud-api's discover_model on top of those two pieces.

What changes

Each discovery cycle now:

  1. Fetches the healthy backend count from /backends/count (single round-trip, ~ms, no caching).
  2. Fans out one fresh-TCP attestation call per backend index, in parallel, no stagger. Each call lands on a distinct backend by construction.
  3. Applies the verified fingerprints with the new apply_pin_update policy:
    • Complete coverage (no failures, verify_failures == 0, distinct observed fingerprints == backend_count): REPLACE the pin set with the observed set. A backend that just went unhealthy or had its cert rotated drops out within one refresh interval — rapid eviction.
    • Anything less: additive merge. A transient hiccup never evicts verified fingerprints we just couldn't reconfirm this cycle.

What goes away

  • ATTESTATION_DISCOVERY_PARALLELISM = 5
  • CUMULATIVE_DISCOVERY_CALLS = 2
  • STAGGER_MS = 200 (intra-model)
  • MODEL_DISCOVERY_STAGGER_MS = 2_000 (inter-model)

discover_model loses its num_calls parameter; both call sites (new-provider phase + cumulative refresh) become identical.

What's not in this PR (out of scope)

  • No feature flag.
  • No env-var override for the base domain — it's derived from each inference URL's host.
  • No /backends/count caching across cycles — every cycle re-fetches.
  • No bucket-warm rotation — that's issue #600.

Observability

DiscoveryOutcome adds two fields, surfaced on the existing INFO logs from #595:

  • backend_count: usize — healthy count this cycle (0 if the fetch failed; see failure_reasons).
  • replaced_state: bool — true iff complete coverage was achieved and the pin set was wholesale replaced.

DD query to validate rapid eviction once staging is on:

service:cloud-api host:*-staging
@fields.message:"Cumulative discovery cycle (no new fingerprints)"
@fields.model:zai-org/GLM-5.1-FP8

Expect backend_count == observed_fingerprints.len() == total_pinned, replaced_state: true.

URL handling

The base domain is derived per URL by stripping the leftmost DNS label of the host. Works for every URL we have today (*.completions{,-stg}.near.ai). URLs that don't fit (one-label hosts, IP literals) return an empty outcome with a url_parse: failure reason; existing fail-closed logic handles them.

Tests

  • spki_verifier: 4 new state-transition tests for replace_with (Bootstrap → Pinned, Pinned shrink, Blocked → Pinned recovery, empty set).
  • rotation: 10 URL-helper tests covering canonicals with internal dashes, case insensitivity, port preservation, IP-host / one-label rejection, count-URL shape.
  • inference_provider_pool::pin_update_*: 8 policy tests covering steady state, eviction on shrinking count, partial-cycle additive preservation, duplicate-observation safety, verify_failures blocking replacement, zero-count safety, bootstrap first cycle.

511 workspace lib tests pass. Clippy clean.

Rollout

PR #595 already shipped the observability needed to validate this. After merge:

  1. Build pushes staging image to docker hub.
  2. Existing rolling deploy applies it.
  3. Verify per the DD query above on each replica.
  4. Smoke /v1/attestation/report?model=zai-org/GLM-latest in a loop and confirm 0 failures.

Rollback is a normal revert + redeploy. No state migration, no config change.

Follow-ups

  • #600 — apply the same rotation pattern to bucket pre-warm.
  • Lower MODEL_DISCOVERY_REFRESH_INTERVAL (currently 300s) once we have a few days of clean data. The new path is cheap enough that a 60 s interval would give us ~1 min eviction latency at a still-tiny load.

cumulative discovery used to stall on multi-backend models because the
proxy's least-connections LB collapses fresh-TCP probes onto a stable
subset of backends. We worked around this with parallelism (5 calls per
new provider, 2 per refresh cycle) and inter-model staggering, but the
shape was fundamentally O(luck): some replicas kept missing some
backends forever and got TLS-handshake-rejected when the LB later
routed them there. The customer-visible symptom is ~42% of
/v1/attestation/report calls for GLM-5.1-FP8 failing with
"error sending request".

model-proxy PR #27 published a deterministic routing knob: rotation SNI
'<canonical>-i<N>.<base>' routes to 'healthy_backends_sorted[N % healthy]',
and GET /backends/count?domain=<host> reports the current healthy count.
This PR rewrites discover_model on top of those two pieces.

Per-cycle flow:
- Fetch the healthy backend count from /backends/count.
- Fan out one fresh-TCP attestation call per backend index, in parallel,
  no stagger. Each call lands on a distinct backend by construction, so
  per-backend GPU evidence pressure per cycle is exactly one attestation
  regardless of how many models refresh together.
- Apply the cycle's verified fingerprints to the shared pin set
  according to apply_pin_update():
    * Complete coverage (no failures, verify_failures == 0, distinct
      observed fingerprints == backend_count): REPLACE the pin set with
      the observed set. A backend that just went unhealthy or had its
      cert rotated drops out of the pin set within one refresh
      interval — rapid eviction.
    * Anything less: additive merge. A transient hiccup never evicts
      verified fingerprints we just couldn't reconfirm.

Eliminates:
- ATTESTATION_DISCOVERY_PARALLELISM (was 5)
- CUMULATIVE_DISCOVERY_CALLS (was 2)
- STAGGER_MS (intra-model, was 200)
- MODEL_DISCOVERY_STAGGER_MS (inter-model, was 2_000)

discover_model loses the num_calls parameter. Both call sites (the
new-provider phase in load_inference_url_models and the cumulative
refresh path) become identical.

DiscoveryOutcome gains:
- backend_count: healthy count from /backends/count this cycle, 0 if
  the fetch failed (failure_reasons[0] then carries the reason).
- replaced_state: true iff this cycle achieved complete coverage and
  the pin set was wholesale replaced rather than additively merged.

Both fields are surfaced on the existing INFO log lines (initial
discovery, cumulative expansion, cumulative no-new-fingerprints) for
DD-side observability.

URL handling derives the base domain by stripping the leftmost DNS
label of the inference URL host. Works for every URL we have today
('*.completions{,-stg}.near.ai'); URLs that don't fit (one-label hosts,
IP literals) return an empty outcome with a 'url_parse:' failure
reason and the existing fail-closed path handles eviction.

Tests:
- spki_verifier: replace_with state transitions (Bootstrap->Pinned,
  Pinned shrink, Blocked->Pinned recovery, empty set).
- rotation: 10 URL-helper tests covering canonicals with internal
  dashes, case insensitivity, port preservation, IP/one-label
  rejection, count-URL shape.
- inference_provider_pool: 8 apply_pin_update policy tests covering
  steady state, eviction on shrinking count, partial-cycle additive
  preservation, duplicate-observation safety, verify_failure
  blocking replacement, zero-count safety, bootstrap first cycle.

Followup #600: rotation SNI for chat-completion bucket pre-warm.
Copilot AI review requested due to automatic review settings May 18, 2026 15:39
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 18, 2026 15:39 — with GitHub Actions Inactive
@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Review — rotation-SNI discovery + rapid-eviction

Reviewed against PR description and CLAUDE.md privacy rules. Strong tests, clean diff (net –160 lines), no privacy-sensitive logging added (only URLs / fingerprints / counts).

Critical issues

None blocking. A few notes worth considering before/after merge:

Notable observations

1. No upper bound on backend_count fan-outcrates/services/src/inference_provider_pool/mod.rs:817

```rust
let futures = (0..backend_count).map(|i| { ... })
```

If model-proxy ever reports a bogus large count (registry race, mis-deploy, partial split), discover_model will spawn that many parallel reqwest::Clients + TLS handshakes at once, for every model, on every refresh. Trust boundary is small (proxy is internal), but a sanity cap (e.g. min(backend_count, 256)) would be cheap defense-in-depth and bound the worst case.

2. CountFetch::Ok(0) vs error are only distinguishable via failure_reasonsdiscover_model lines 778–793

When proxy authoritatively reports 0 healthy backends, the INFO log shows backend_count=0, failure_reasons=[]. When the count fetch failed, the same log shows backend_count=0, failure_reasons=["count_*: ..."]. Distinguishable but subtle — a DD query for "model has zero backends right now" needs to match on empty failure_reasons. Probably fine, just worth knowing.

3. pin_update_verify_failure_blocks_replacement test inputs are logically impossiblemod.rs:3133

Test passes backend_count=4, verify_failures=1, observed=4 unique fps. In practice the cycle fans out exactly backend_count calls, so 4 backends + 1 verify failure = at most 3 successful verifications. The policy assertion is correct (verify_failures > 0 ⇒ no replace regardless), but the test scenario doesn't reflect what discover_model actually produces. Consider observed=["a","b","c"], backend_count=4, verify_failures=1 for realism — the assertion still holds.

4. replace_with(empty) is dead in productionspki_verifier.rs:74

apply_pin_update only calls replace_with when complete_coverage is true, which requires backend_count > 0 && verified.len() == backend_count. So verified is never empty at the call site, and test_replace_with_empty_set_is_permitted exercises a path the production caller can't reach. The defensive doc + test are fine but worth noting if future refactors lean on the "callers may pass empty" guarantee.

5. No retry on count fetch failurerotation::fetch_backend_count

A transient blip on /backends/count (3s timeout, single attempt) skips the entire cycle. Existing pin state is preserved (additive path is the floor), so this fails safely — but with the new wholesale-replacement design, multiple consecutive count failures across cycles delay rapid eviction. Acceptable given 300s refresh interval and the issue surfaces in failure_reasons for observability.

Looks good

  • Lock discipline in apply_pin_update (single write-lock for read-then-write — no TOCTOU)
  • Coverage policy is correctly gated on failed_calls == 0 && verify_failures == 0 && backend_count > 0 && verified.len() == backend_count — duplicate observations from registry race fall through to additive (correctly tested)
  • Privacy: new INFO logs only carry infra metadata (URL, fingerprint, index, counts). Compliant with CLAUDE.md
  • Backward compatibility: rollout/rollback is a normal revert; no state migration, no config change
  • Bootstrap → Pinned → (replace) → Pinned and Blocked → Pinned recovery paths are explicitly tested

✅ Approved — non-blocking suggestions only.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates InferenceProviderPool’s attestation discovery to use model-proxy’s rotation-SNI scheme plus a per-cycle /backends/count fetch, enabling deterministic “one probe per backend” coverage and a new pin-update policy that can rapidly evict fingerprints when full coverage is achieved.

Changes:

  • Add rotation URL parsing/building + backend-count fetching helpers to drive deterministic per-backend discovery fanout.
  • Introduce apply_pin_update policy: replace the pinned set on complete coverage; otherwise do additive merge.
  • Extend discovery outcome logging fields (backend_count, replaced_state) and remove legacy discovery parallelism/stagger constants.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
crates/services/src/inference_provider_pool/rotation.rs New helpers for splitting inference URLs, constructing rotation/count URLs, and fetching healthy backend count.
crates/services/src/inference_provider_pool/mod.rs Rework discover_model to use rotation SNI + count fanout; add pin replacement policy and new outcome fields/logging.
crates/services/src/attestation/verification.rs Remove legacy discovery constants no longer used by rotation-driven discovery.
crates/inference_providers/src/spki_verifier.rs Add FingerprintState::replace_with and tests to support pin-set replacement on complete coverage.
Comments suppressed due to low confidence (2)

crates/services/src/inference_provider_pool/mod.rs:789

  • When /backends/count returns Ok(0), the cycle bails out via empty_outcome without recording any failure reason, and backend_count is also 0 (same value used for "count fetch failed"). For observability it would help to record an explicit reason (e.g. count_zero) and/or propagate the authoritative count into the outcome so downstream logs can distinguish "no healthy backends" from "couldn't fetch count".
        let backend_count =
            match rotation::fetch_backend_count(&count_client, &parts, COUNT_TIMEOUT).await {
                rotation::CountFetch::Ok(0) => {
                    // Authoritatively no healthy backends right now. Don't issue
                    // calls; don't replace the pin set (transient registry hiccup
                    // shouldn't evict verified state). Provider-level fail-closed
                    // paths handle the no-backend case at request time.
                    return Self::empty_outcome(&fingerprint_state, 0, failure_reasons);
                }

crates/services/src/inference_provider_pool/mod.rs:880

  • failure_reasons currently includes format!("{category}: {e}") for reqwest send errors. reqwest::Error display strings often embed the full request URL; here that URL includes a random nonce query param, which can create extremely high-cardinality INFO logs (and potential cost/ingest issues) when failure_reasons is emitted. Consider redacting the URL (e.g. replace e.url() with <redacted> or drop query params) and logging only stable detail fields at INFO, keeping full error text at DEBUG.
                            let category = if e.is_connect() {
                                "connect"
                            } else if e.is_timeout() {
                                "send_timeout"
                            } else if e.is_request() {
                                "request"
                            } else {
                                "send"
                            };
                            return Err(format!("{category}: {e}"));
                        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +47 to +49
/// `0` means we couldn't get a count (model-proxy unreachable, 404, etc.)
/// — see `failure_reasons` for the category. `discover_model` returns
/// without issuing any rotation calls in that case.
Comment on lines +72 to +73
/// is one of: `count_connect`, `count_timeout`, `count_send`,
/// `count_status`, `count_decode`, `client_build`, `query_encode`,
Comment on lines +137 to +155
if complete_coverage {
let newly_pinned: Vec<String> = verified.difference(&before).cloned().collect();
let evicted: Vec<String> = before.difference(verified).cloned().collect();
state.replace_with(verified.clone());
PinUpdate {
newly_pinned,
evicted,
total_pinned: state.pinned_count(),
replaced: true,
}
} else {
let mut newly_pinned: Vec<String> = Vec::new();
for fp in verified {
let before_count = state.pinned_count();
state.add_fingerprint(fp.clone());
if state.pinned_count() > before_count {
newly_pinned.push(fp.clone());
}
}
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a rotation-SNI scheme for attestation discovery, enabling deterministic routing to backends and more efficient fingerprint management. Key changes include the introduction of a replace_with method for wholesale fingerprint state updates, a new rotation module for URL handling and backend count fetching, and a refactored discovery process that achieves full coverage in a single cycle. Feedback identifies a potential regression in single-backend models where only one signing algorithm would be discovered, potentially breaking end-to-end encryption for some clients. Additionally, an optimization was suggested for the fingerprint merging logic to reduce redundant state checks and simplify the implementation.

// Step 2: fan out one attestation call per backend index, in
// parallel (no stagger — each call lands on a distinct backend, so
// per-backend pressure is exactly one attestation per cycle).
let futures = (0..backend_count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For models with only a single healthy backend (backend_count == 1), this loop will only issue one call, typically using the ecdsa algorithm. This results in a regression where the ed25519 signing public key is never discovered for single-backend models, breaking E2EE routing for clients requesting that algorithm. Consider ensuring at least two calls are made (alternating algorithms) even when only one backend is reported, as the rotation SNI scheme correctly handles wrapping indices.

        let futures = (0..backend_count.max(2))

Comment on lines +148 to +155
let mut newly_pinned: Vec<String> = Vec::new();
for fp in verified {
let before_count = state.pinned_count();
state.add_fingerprint(fp.clone());
if state.pinned_count() > before_count {
newly_pinned.push(fp.clone());
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The additive merge path can be simplified and made more efficient by using the newly_pinned logic already present in the complete_coverage branch. This avoids multiple calls to state.pinned_count() inside the loop, consolidating duplicated logic and ensuring shared data is fetched once before the loop for better performance.

        let newly_pinned: Vec<String> = verified.difference(&before).cloned().collect();
        for fp in &newly_pinned {
            state.add_fingerprint(fp.clone());
        }
        PinUpdate {
            newly_pinned,
            evicted: Vec::new(),
            total_pinned: state.pinned_count(),
            replaced: false,
        }
References
  1. To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.
  2. When processing a batch of items, fetch shared data once before the loop and pass it into the processing logic to avoid redundant database queries (N queries for a batch of N items).

Address bot review feedback on #603:

- count_zero vs count-fetch-failure are now distinguishable in
  failure_reasons. Previously both rendered as empty / generic
  count_*:; now Ok(0) records 'count_zero: proxy reports 0 healthy
  backends' explicitly.

- Sanity-cap rotation fan-out at 256 backends per model per cycle.
  A bogus registry reading (race during deploy, partial split) would
  otherwise spawn an unbounded number of fresh-TCP TLS handshakes.
  Hitting the cap is logged and recorded in failure_reasons.

- Strip the request URL from every reqwest error in failure_reasons
  via Error::without_url(). The URLs embed our random per-call nonce,
  which would otherwise create unbounded label cardinality in DD when
  any reqwest error path fires. Full error stays available at DEBUG
  via the existing debug! lines.

- pin_update_verify_failure_blocks_replacement test now uses an input
  shape that the production caller can actually produce
  (backend_count=4, verified=3, verify_failures=1). The policy
  assertion is unchanged.
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 18, 2026 15:51 — with GitHub Actions Inactive
@Evrard-Nil
Copy link
Copy Markdown
Collaborator Author

Pushed 786f31c addressing the bot review feedback:

  • count_zero vs count-fetch-failure: Ok(0) now records count_zero: proxy reports 0 healthy backends in failure_reasons so DD can distinguish it from a count-fetch error (which carries a count_*: reason). (copilot #1 / claude Develop #2)
  • Sanity cap on fan-out: hard-capped at MAX_ROTATION_FANOUT = 256 per model per cycle. A bogus registry reading would otherwise spawn an unbounded fan-out of fresh-TCP TLS handshakes; hitting the cap warns + records count_capped in failure_reasons. (claude #1)
  • URL redaction in failure_reasons: every reqwest::Error formatted into failure_reasons now goes through Error::without_url(). The URLs embed our per-call random nonce, which would have created unbounded DD label cardinality on any failing path. Full error remains available at DEBUG via the existing debug! lines. (copilot Develop #2)
  • Realistic test inputs: pin_update_verify_failure_blocks_replacement now uses backend_count=4, verified=3, verify_failures=1 — what the production caller can actually produce. The policy assertion is unchanged. (claude Add TLS support #3)

Not addressed (intentional):

  • No retry on count fetch failure (claude Issue: missing endpoints and data #5): the cycle fails safely (additive path is the floor, existing pin set preserved), and the failure surfaces in failure_reasons. Existing 5-min refresh interval is the retry. Worth revisiting separately if we lower the interval.
  • replace_with(empty) is unreachable from apply_pin_update (claude [Bug] auth session #4): defensive doc + test stay — small footprint, and useful insurance for refactors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants