feat: rotation-SNI discovery + rapid-eviction pin set#603
Conversation
cumulative discovery used to stall on multi-backend models because the proxy's least-connections LB collapses fresh-TCP probes onto a stable subset of backends. We worked around this with parallelism (5 calls per new provider, 2 per refresh cycle) and inter-model staggering, but the shape was fundamentally O(luck): some replicas kept missing some backends forever and got TLS-handshake-rejected when the LB later routed them there. The customer-visible symptom is ~42% of /v1/attestation/report calls for GLM-5.1-FP8 failing with "error sending request". model-proxy PR #27 published a deterministic routing knob: rotation SNI '<canonical>-i<N>.<base>' routes to 'healthy_backends_sorted[N % healthy]', and GET /backends/count?domain=<host> reports the current healthy count. This PR rewrites discover_model on top of those two pieces. Per-cycle flow: - Fetch the healthy backend count from /backends/count. - Fan out one fresh-TCP attestation call per backend index, in parallel, no stagger. Each call lands on a distinct backend by construction, so per-backend GPU evidence pressure per cycle is exactly one attestation regardless of how many models refresh together. - Apply the cycle's verified fingerprints to the shared pin set according to apply_pin_update(): * Complete coverage (no failures, verify_failures == 0, distinct observed fingerprints == backend_count): REPLACE the pin set with the observed set. A backend that just went unhealthy or had its cert rotated drops out of the pin set within one refresh interval — rapid eviction. * Anything less: additive merge. A transient hiccup never evicts verified fingerprints we just couldn't reconfirm. Eliminates: - ATTESTATION_DISCOVERY_PARALLELISM (was 5) - CUMULATIVE_DISCOVERY_CALLS (was 2) - STAGGER_MS (intra-model, was 200) - MODEL_DISCOVERY_STAGGER_MS (inter-model, was 2_000) discover_model loses the num_calls parameter. Both call sites (the new-provider phase in load_inference_url_models and the cumulative refresh path) become identical. DiscoveryOutcome gains: - backend_count: healthy count from /backends/count this cycle, 0 if the fetch failed (failure_reasons[0] then carries the reason). - replaced_state: true iff this cycle achieved complete coverage and the pin set was wholesale replaced rather than additively merged. Both fields are surfaced on the existing INFO log lines (initial discovery, cumulative expansion, cumulative no-new-fingerprints) for DD-side observability. URL handling derives the base domain by stripping the leftmost DNS label of the inference URL host. Works for every URL we have today ('*.completions{,-stg}.near.ai'); URLs that don't fit (one-label hosts, IP literals) return an empty outcome with a 'url_parse:' failure reason and the existing fail-closed path handles eviction. Tests: - spki_verifier: replace_with state transitions (Bootstrap->Pinned, Pinned shrink, Blocked->Pinned recovery, empty set). - rotation: 10 URL-helper tests covering canonicals with internal dashes, case insensitivity, port preservation, IP/one-label rejection, count-URL shape. - inference_provider_pool: 8 apply_pin_update policy tests covering steady state, eviction on shrinking count, partial-cycle additive preservation, duplicate-observation safety, verify_failure blocking replacement, zero-count safety, bootstrap first cycle. Followup #600: rotation SNI for chat-completion bucket pre-warm.
Review — rotation-SNI discovery + rapid-evictionReviewed against PR description and CLAUDE.md privacy rules. Strong tests, clean diff (net –160 lines), no privacy-sensitive logging added (only URLs / fingerprints / counts). Critical issuesNone blocking. A few notes worth considering before/after merge: Notable observations1. No upper bound on ```rust If model-proxy ever reports a bogus large count (registry race, mis-deploy, partial split), 2. When proxy authoritatively reports 0 healthy backends, the INFO log shows 3. Test passes 4.
5. No retry on count fetch failure — A transient blip on Looks good
✅ Approved — non-blocking suggestions only. |
There was a problem hiding this comment.
Pull request overview
This PR updates InferenceProviderPool’s attestation discovery to use model-proxy’s rotation-SNI scheme plus a per-cycle /backends/count fetch, enabling deterministic “one probe per backend” coverage and a new pin-update policy that can rapidly evict fingerprints when full coverage is achieved.
Changes:
- Add rotation URL parsing/building + backend-count fetching helpers to drive deterministic per-backend discovery fanout.
- Introduce
apply_pin_updatepolicy: replace the pinned set on complete coverage; otherwise do additive merge. - Extend discovery outcome logging fields (
backend_count,replaced_state) and remove legacy discovery parallelism/stagger constants.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| crates/services/src/inference_provider_pool/rotation.rs | New helpers for splitting inference URLs, constructing rotation/count URLs, and fetching healthy backend count. |
| crates/services/src/inference_provider_pool/mod.rs | Rework discover_model to use rotation SNI + count fanout; add pin replacement policy and new outcome fields/logging. |
| crates/services/src/attestation/verification.rs | Remove legacy discovery constants no longer used by rotation-driven discovery. |
| crates/inference_providers/src/spki_verifier.rs | Add FingerprintState::replace_with and tests to support pin-set replacement on complete coverage. |
Comments suppressed due to low confidence (2)
crates/services/src/inference_provider_pool/mod.rs:789
- When
/backends/countreturnsOk(0), the cycle bails out viaempty_outcomewithout recording any failure reason, andbackend_countis also 0 (same value used for "count fetch failed"). For observability it would help to record an explicit reason (e.g.count_zero) and/or propagate the authoritative count into the outcome so downstream logs can distinguish "no healthy backends" from "couldn't fetch count".
let backend_count =
match rotation::fetch_backend_count(&count_client, &parts, COUNT_TIMEOUT).await {
rotation::CountFetch::Ok(0) => {
// Authoritatively no healthy backends right now. Don't issue
// calls; don't replace the pin set (transient registry hiccup
// shouldn't evict verified state). Provider-level fail-closed
// paths handle the no-backend case at request time.
return Self::empty_outcome(&fingerprint_state, 0, failure_reasons);
}
crates/services/src/inference_provider_pool/mod.rs:880
failure_reasonscurrently includesformat!("{category}: {e}")for reqwest send errors.reqwest::Errordisplay strings often embed the full request URL; here that URL includes a randomnoncequery param, which can create extremely high-cardinality INFO logs (and potential cost/ingest issues) whenfailure_reasonsis emitted. Consider redacting the URL (e.g. replacee.url()with<redacted>or drop query params) and logging only stable detail fields at INFO, keeping full error text at DEBUG.
let category = if e.is_connect() {
"connect"
} else if e.is_timeout() {
"send_timeout"
} else if e.is_request() {
"request"
} else {
"send"
};
return Err(format!("{category}: {e}"));
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// `0` means we couldn't get a count (model-proxy unreachable, 404, etc.) | ||
| /// — see `failure_reasons` for the category. `discover_model` returns | ||
| /// without issuing any rotation calls in that case. |
| /// is one of: `count_connect`, `count_timeout`, `count_send`, | ||
| /// `count_status`, `count_decode`, `client_build`, `query_encode`, |
| if complete_coverage { | ||
| let newly_pinned: Vec<String> = verified.difference(&before).cloned().collect(); | ||
| let evicted: Vec<String> = before.difference(verified).cloned().collect(); | ||
| state.replace_with(verified.clone()); | ||
| PinUpdate { | ||
| newly_pinned, | ||
| evicted, | ||
| total_pinned: state.pinned_count(), | ||
| replaced: true, | ||
| } | ||
| } else { | ||
| let mut newly_pinned: Vec<String> = Vec::new(); | ||
| for fp in verified { | ||
| let before_count = state.pinned_count(); | ||
| state.add_fingerprint(fp.clone()); | ||
| if state.pinned_count() > before_count { | ||
| newly_pinned.push(fp.clone()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Code Review
This pull request implements a rotation-SNI scheme for attestation discovery, enabling deterministic routing to backends and more efficient fingerprint management. Key changes include the introduction of a replace_with method for wholesale fingerprint state updates, a new rotation module for URL handling and backend count fetching, and a refactored discovery process that achieves full coverage in a single cycle. Feedback identifies a potential regression in single-backend models where only one signing algorithm would be discovered, potentially breaking end-to-end encryption for some clients. Additionally, an optimization was suggested for the fingerprint merging logic to reduce redundant state checks and simplify the implementation.
| // Step 2: fan out one attestation call per backend index, in | ||
| // parallel (no stagger — each call lands on a distinct backend, so | ||
| // per-backend pressure is exactly one attestation per cycle). | ||
| let futures = (0..backend_count) |
There was a problem hiding this comment.
For models with only a single healthy backend (backend_count == 1), this loop will only issue one call, typically using the ecdsa algorithm. This results in a regression where the ed25519 signing public key is never discovered for single-backend models, breaking E2EE routing for clients requesting that algorithm. Consider ensuring at least two calls are made (alternating algorithms) even when only one backend is reported, as the rotation SNI scheme correctly handles wrapping indices.
let futures = (0..backend_count.max(2))| let mut newly_pinned: Vec<String> = Vec::new(); | ||
| for fp in verified { | ||
| let before_count = state.pinned_count(); | ||
| state.add_fingerprint(fp.clone()); | ||
| if state.pinned_count() > before_count { | ||
| newly_pinned.push(fp.clone()); | ||
| } | ||
| } |
There was a problem hiding this comment.
The additive merge path can be simplified and made more efficient by using the newly_pinned logic already present in the complete_coverage branch. This avoids multiple calls to state.pinned_count() inside the loop, consolidating duplicated logic and ensuring shared data is fetched once before the loop for better performance.
let newly_pinned: Vec<String> = verified.difference(&before).cloned().collect();
for fp in &newly_pinned {
state.add_fingerprint(fp.clone());
}
PinUpdate {
newly_pinned,
evicted: Vec::new(),
total_pinned: state.pinned_count(),
replaced: false,
}References
- To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.
- When processing a batch of items, fetch shared data once before the loop and pass it into the processing logic to avoid redundant database queries (N queries for a batch of N items).
Address bot review feedback on #603: - count_zero vs count-fetch-failure are now distinguishable in failure_reasons. Previously both rendered as empty / generic count_*:; now Ok(0) records 'count_zero: proxy reports 0 healthy backends' explicitly. - Sanity-cap rotation fan-out at 256 backends per model per cycle. A bogus registry reading (race during deploy, partial split) would otherwise spawn an unbounded number of fresh-TCP TLS handshakes. Hitting the cap is logged and recorded in failure_reasons. - Strip the request URL from every reqwest error in failure_reasons via Error::without_url(). The URLs embed our random per-call nonce, which would otherwise create unbounded label cardinality in DD when any reqwest error path fires. Full error stays available at DEBUG via the existing debug! lines. - pin_update_verify_failure_blocks_replacement test now uses an input shape that the production caller can actually produce (backend_count=4, verified=3, verify_failures=1). The policy assertion is unchanged.
|
Pushed 786f31c addressing the bot review feedback:
Not addressed (intentional):
|
Why
Cumulative TLS-SPKI discovery used to stall on multi-backend models because model-proxy's least-connections LB collapses our fresh-TCP probes onto a stable subset of backends. We worked around it with parallelism (5 calls per new provider, 2 per refresh cycle) and inter-model staggering, but the shape was fundamentally O(luck): some replicas kept missing some backends forever and got TLS-handshake-rejected when the LB later routed them there.
Customer-visible symptom: ~42 % of
/v1/attestation/report?model=zai-org/GLM-latestcalls fail witherror sending request. Investigation thread in PR #595 + analysis in nearai/model-proxy#27.model-proxy PR #27 (merged, deployed to staging) published two routing knobs:
GET /backends/count?domain=<host>→ reports the current healthy backend count for a domain.<canonical>-i<N>.<base>→ routes tohealthy_backends_sorted_by_address[N % healthy], bypassing the LB.This PR rewrites cloud-api's
discover_modelon top of those two pieces.What changes
Each discovery cycle now:
/backends/count(single round-trip, ~ms, no caching).apply_pin_updatepolicy:verify_failures == 0, distinct observed fingerprints ==backend_count): REPLACE the pin set with the observed set. A backend that just went unhealthy or had its cert rotated drops out within one refresh interval — rapid eviction.What goes away
ATTESTATION_DISCOVERY_PARALLELISM = 5CUMULATIVE_DISCOVERY_CALLS = 2STAGGER_MS = 200(intra-model)MODEL_DISCOVERY_STAGGER_MS = 2_000(inter-model)discover_modelloses itsnum_callsparameter; both call sites (new-provider phase + cumulative refresh) become identical.What's not in this PR (out of scope)
/backends/countcaching across cycles — every cycle re-fetches.Observability
DiscoveryOutcomeadds two fields, surfaced on the existing INFO logs from #595:backend_count: usize— healthy count this cycle (0if the fetch failed; seefailure_reasons).replaced_state: bool— true iff complete coverage was achieved and the pin set was wholesale replaced.DD query to validate rapid eviction once staging is on:
Expect
backend_count == observed_fingerprints.len() == total_pinned,replaced_state: true.URL handling
The base domain is derived per URL by stripping the leftmost DNS label of the host. Works for every URL we have today (
*.completions{,-stg}.near.ai). URLs that don't fit (one-label hosts, IP literals) return an empty outcome with aurl_parse:failure reason; existing fail-closed logic handles them.Tests
spki_verifier: 4 new state-transition tests forreplace_with(Bootstrap → Pinned, Pinned shrink, Blocked → Pinned recovery, empty set).rotation: 10 URL-helper tests covering canonicals with internal dashes, case insensitivity, port preservation, IP-host / one-label rejection, count-URL shape.inference_provider_pool::pin_update_*: 8 policy tests covering steady state, eviction on shrinking count, partial-cycle additive preservation, duplicate-observation safety,verify_failuresblocking replacement, zero-count safety, bootstrap first cycle.511 workspace lib tests pass. Clippy clean.
Rollout
PR #595 already shipped the observability needed to validate this. After merge:
stagingimage to docker hub./v1/attestation/report?model=zai-org/GLM-latestin a loop and confirm 0 failures.Rollback is a normal revert + redeploy. No state migration, no config change.
Follow-ups
MODEL_DISCOVERY_REFRESH_INTERVAL(currently300s) once we have a few days of clean data. The new path is cheap enough that a 60 s interval would give us ~1 min eviction latency at a still-tiny load.