feat: rotation-SNI discovery + rapid-eviction pin set by Evrard-Nil · Pull Request #603 · nearai/cloud-api

Evrard-Nil · 2026-05-18T15:39:28Z

Why

Cumulative TLS-SPKI discovery used to stall on multi-backend models because model-proxy's least-connections LB collapses our fresh-TCP probes onto a stable subset of backends. We worked around it with parallelism (5 calls per new provider, 2 per refresh cycle) and inter-model staggering, but the shape was fundamentally O(luck): some replicas kept missing some backends forever and got TLS-handshake-rejected when the LB later routed them there.

Customer-visible symptom: ~42 % of /v1/attestation/report?model=zai-org/GLM-latest calls fail with error sending request. Investigation thread in PR #595 + analysis in nearai/model-proxy#27.

model-proxy PR #27 (merged, deployed to staging) published two routing knobs:

GET /backends/count?domain=<host> → reports the current healthy backend count for a domain.
Synthetic SNI <canonical>-i<N>.<base> → routes to healthy_backends_sorted_by_address[N % healthy], bypassing the LB.

This PR rewrites cloud-api's discover_model on top of those two pieces.

What changes

Each discovery cycle now:

Fetches the healthy backend count from /backends/count (single round-trip, ~ms, no caching).
Fans out one fresh-TCP attestation call per backend index, in parallel, no stagger. Each call lands on a distinct backend by construction.
Applies the verified fingerprints with the new apply_pin_update policy:
- Complete coverage (no failures, verify_failures == 0, distinct observed fingerprints == backend_count): REPLACE the pin set with the observed set. A backend that just went unhealthy or had its cert rotated drops out within one refresh interval — rapid eviction.
- Anything less: additive merge. A transient hiccup never evicts verified fingerprints we just couldn't reconfirm this cycle.

What goes away

ATTESTATION_DISCOVERY_PARALLELISM = 5
CUMULATIVE_DISCOVERY_CALLS = 2
STAGGER_MS = 200 (intra-model)
MODEL_DISCOVERY_STAGGER_MS = 2_000 (inter-model)

discover_model loses its num_calls parameter; both call sites (new-provider phase + cumulative refresh) become identical.

What's not in this PR (out of scope)

No feature flag.
No env-var override for the base domain — it's derived from each inference URL's host.
No /backends/count caching across cycles — every cycle re-fetches.
No bucket-warm rotation — that's issue #600.

Observability

DiscoveryOutcome adds two fields, surfaced on the existing INFO logs from #595:

backend_count: usize — healthy count this cycle (0 if the fetch failed; see failure_reasons).
replaced_state: bool — true iff complete coverage was achieved and the pin set was wholesale replaced.

DD query to validate rapid eviction once staging is on:

service:cloud-api host:*-staging
@fields.message:"Cumulative discovery cycle (no new fingerprints)"
@fields.model:zai-org/GLM-5.1-FP8

Expect backend_count == observed_fingerprints.len() == total_pinned, replaced_state: true.

URL handling

The base domain is derived per URL by stripping the leftmost DNS label of the host. Works for every URL we have today (*.completions{,-stg}.near.ai). URLs that don't fit (one-label hosts, IP literals) return an empty outcome with a url_parse: failure reason; existing fail-closed logic handles them.

Tests

spki_verifier: 4 new state-transition tests for replace_with (Bootstrap → Pinned, Pinned shrink, Blocked → Pinned recovery, empty set).
rotation: 10 URL-helper tests covering canonicals with internal dashes, case insensitivity, port preservation, IP-host / one-label rejection, count-URL shape.
inference_provider_pool::pin_update_*: 8 policy tests covering steady state, eviction on shrinking count, partial-cycle additive preservation, duplicate-observation safety, verify_failures blocking replacement, zero-count safety, bootstrap first cycle.

511 workspace lib tests pass. Clippy clean.

Rollout

PR #595 already shipped the observability needed to validate this. After merge:

Build pushes staging image to docker hub.
Existing rolling deploy applies it.
Verify per the DD query above on each replica.
Smoke /v1/attestation/report?model=zai-org/GLM-latest in a loop and confirm 0 failures.

Rollback is a normal revert + redeploy. No state migration, no config change.

Follow-ups

#600 — apply the same rotation pattern to bucket pre-warm.
Lower MODEL_DISCOVERY_REFRESH_INTERVAL (currently 300s) once we have a few days of clean data. The new path is cheap enough that a 60 s interval would give us ~1 min eviction latency at a still-tiny load.

cumulative discovery used to stall on multi-backend models because the proxy's least-connections LB collapses fresh-TCP probes onto a stable subset of backends. We worked around this with parallelism (5 calls per new provider, 2 per refresh cycle) and inter-model staggering, but the shape was fundamentally O(luck): some replicas kept missing some backends forever and got TLS-handshake-rejected when the LB later routed them there. The customer-visible symptom is ~42% of /v1/attestation/report calls for GLM-5.1-FP8 failing with "error sending request". model-proxy PR #27 published a deterministic routing knob: rotation SNI '<canonical>-i<N>.<base>' routes to 'healthy_backends_sorted[N % healthy]', and GET /backends/count?domain=<host> reports the current healthy count. This PR rewrites discover_model on top of those two pieces. Per-cycle flow: - Fetch the healthy backend count from /backends/count. - Fan out one fresh-TCP attestation call per backend index, in parallel, no stagger. Each call lands on a distinct backend by construction, so per-backend GPU evidence pressure per cycle is exactly one attestation regardless of how many models refresh together. - Apply the cycle's verified fingerprints to the shared pin set according to apply_pin_update(): * Complete coverage (no failures, verify_failures == 0, distinct observed fingerprints == backend_count): REPLACE the pin set with the observed set. A backend that just went unhealthy or had its cert rotated drops out of the pin set within one refresh interval — rapid eviction. * Anything less: additive merge. A transient hiccup never evicts verified fingerprints we just couldn't reconfirm. Eliminates: - ATTESTATION_DISCOVERY_PARALLELISM (was 5) - CUMULATIVE_DISCOVERY_CALLS (was 2) - STAGGER_MS (intra-model, was 200) - MODEL_DISCOVERY_STAGGER_MS (inter-model, was 2_000) discover_model loses the num_calls parameter. Both call sites (the new-provider phase in load_inference_url_models and the cumulative refresh path) become identical. DiscoveryOutcome gains: - backend_count: healthy count from /backends/count this cycle, 0 if the fetch failed (failure_reasons[0] then carries the reason). - replaced_state: true iff this cycle achieved complete coverage and the pin set was wholesale replaced rather than additively merged. Both fields are surfaced on the existing INFO log lines (initial discovery, cumulative expansion, cumulative no-new-fingerprints) for DD-side observability. URL handling derives the base domain by stripping the leftmost DNS label of the inference URL host. Works for every URL we have today ('*.completions{,-stg}.near.ai'); URLs that don't fit (one-label hosts, IP literals) return an empty outcome with a 'url_parse:' failure reason and the existing fail-closed path handles eviction. Tests: - spki_verifier: replace_with state transitions (Bootstrap->Pinned, Pinned shrink, Blocked->Pinned recovery, empty set). - rotation: 10 URL-helper tests covering canonicals with internal dashes, case insensitivity, port preservation, IP/one-label rejection, count-URL shape. - inference_provider_pool: 8 apply_pin_update policy tests covering steady state, eviction on shrinking count, partial-cycle additive preservation, duplicate-observation safety, verify_failure blocking replacement, zero-count safety, bootstrap first cycle. Followup #600: rotation SNI for chat-completion bucket pre-warm.

claude · 2026-05-18T15:42:29Z

Review — rotation-SNI discovery + rapid-eviction

Reviewed against PR description and CLAUDE.md privacy rules. Strong tests, clean diff (net –160 lines), no privacy-sensitive logging added (only URLs / fingerprints / counts).

Critical issues

None blocking. A few notes worth considering before/after merge:

Notable observations

1. No upper bound on backend_count fan-out — crates/services/src/inference_provider_pool/mod.rs:817

```rust
let futures = (0..backend_count).map(|i| { ... })
```

If model-proxy ever reports a bogus large count (registry race, mis-deploy, partial split), discover_model will spawn that many parallel reqwest::Clients + TLS handshakes at once, for every model, on every refresh. Trust boundary is small (proxy is internal), but a sanity cap (e.g. min(backend_count, 256)) would be cheap defense-in-depth and bound the worst case.

2. CountFetch::Ok(0) vs error are only distinguishable via failure_reasons — discover_model lines 778–793

When proxy authoritatively reports 0 healthy backends, the INFO log shows backend_count=0, failure_reasons=[]. When the count fetch failed, the same log shows backend_count=0, failure_reasons=["count_*: ..."]. Distinguishable but subtle — a DD query for "model has zero backends right now" needs to match on empty failure_reasons. Probably fine, just worth knowing.

3. pin_update_verify_failure_blocks_replacement test inputs are logically impossible — mod.rs:3133

Test passes backend_count=4, verify_failures=1, observed=4 unique fps. In practice the cycle fans out exactly backend_count calls, so 4 backends + 1 verify failure = at most 3 successful verifications. The policy assertion is correct (verify_failures > 0 ⇒ no replace regardless), but the test scenario doesn't reflect what discover_model actually produces. Consider observed=["a","b","c"], backend_count=4, verify_failures=1 for realism — the assertion still holds.

4. replace_with(empty) is dead in production — spki_verifier.rs:74

apply_pin_update only calls replace_with when complete_coverage is true, which requires backend_count > 0 && verified.len() == backend_count. So verified is never empty at the call site, and test_replace_with_empty_set_is_permitted exercises a path the production caller can't reach. The defensive doc + test are fine but worth noting if future refactors lean on the "callers may pass empty" guarantee.

5. No retry on count fetch failure — rotation::fetch_backend_count

A transient blip on /backends/count (3s timeout, single attempt) skips the entire cycle. Existing pin state is preserved (additive path is the floor), so this fails safely — but with the new wholesale-replacement design, multiple consecutive count failures across cycles delay rapid eviction. Acceptable given 300s refresh interval and the issue surfaces in failure_reasons for observability.

Looks good

Lock discipline in apply_pin_update (single write-lock for read-then-write — no TOCTOU)
Coverage policy is correctly gated on failed_calls == 0 && verify_failures == 0 && backend_count > 0 && verified.len() == backend_count — duplicate observations from registry race fall through to additive (correctly tested)
Privacy: new INFO logs only carry infra metadata (URL, fingerprint, index, counts). Compliant with CLAUDE.md
Backward compatibility: rollout/rollback is a normal revert; no state migration, no config change
Bootstrap → Pinned → (replace) → Pinned and Blocked → Pinned recovery paths are explicitly tested

✅ Approved — non-blocking suggestions only.

Copilot

Pull request overview

This PR updates InferenceProviderPool’s attestation discovery to use model-proxy’s rotation-SNI scheme plus a per-cycle /backends/count fetch, enabling deterministic “one probe per backend” coverage and a new pin-update policy that can rapidly evict fingerprints when full coverage is achieved.

Changes:

Add rotation URL parsing/building + backend-count fetching helpers to drive deterministic per-backend discovery fanout.
Introduce apply_pin_update policy: replace the pinned set on complete coverage; otherwise do additive merge.
Extend discovery outcome logging fields (backend_count, replaced_state) and remove legacy discovery parallelism/stagger constants.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
crates/services/src/inference_provider_pool/rotation.rs	New helpers for splitting inference URLs, constructing rotation/count URLs, and fetching healthy backend count.
crates/services/src/inference_provider_pool/mod.rs	Rework `discover_model` to use rotation SNI + count fanout; add pin replacement policy and new outcome fields/logging.
crates/services/src/attestation/verification.rs	Remove legacy discovery constants no longer used by rotation-driven discovery.
crates/inference_providers/src/spki_verifier.rs	Add `FingerprintState::replace_with` and tests to support pin-set replacement on complete coverage.

Comments suppressed due to low confidence (2)

crates/services/src/inference_provider_pool/mod.rs:789

When /backends/count returns Ok(0), the cycle bails out via empty_outcome without recording any failure reason, and backend_count is also 0 (same value used for "count fetch failed"). For observability it would help to record an explicit reason (e.g. count_zero) and/or propagate the authoritative count into the outcome so downstream logs can distinguish "no healthy backends" from "couldn't fetch count".

        let backend_count =
            match rotation::fetch_backend_count(&count_client, &parts, COUNT_TIMEOUT).await {
                rotation::CountFetch::Ok(0) => {
                    // Authoritatively no healthy backends right now. Don't issue
                    // calls; don't replace the pin set (transient registry hiccup
                    // shouldn't evict verified state). Provider-level fail-closed
                    // paths handle the no-backend case at request time.
                    return Self::empty_outcome(&fingerprint_state, 0, failure_reasons);
                }

crates/services/src/inference_provider_pool/mod.rs:880

failure_reasons currently includes format!("{category}: {e}") for reqwest send errors. reqwest::Error display strings often embed the full request URL; here that URL includes a random nonce query param, which can create extremely high-cardinality INFO logs (and potential cost/ingest issues) when failure_reasons is emitted. Consider redacting the URL (e.g. replace e.url() with <redacted> or drop query params) and logging only stable detail fields at INFO, keeping full error text at DEBUG.

                            let category = if e.is_connect() {
                                "connect"
                            } else if e.is_timeout() {
                                "send_timeout"
                            } else if e.is_request() {
                                "request"
                            } else {
                                "send"
                            };
                            return Err(format!("{category}: {e}"));
                        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// `0` means we couldn't get a count (model-proxy unreachable, 404, etc.)
+    /// — see `failure_reasons` for the category. `discover_model` returns
+    /// without issuing any rotation calls in that case.


+    /// is one of: `count_connect`, `count_timeout`, `count_send`,
+    /// `count_status`, `count_decode`, `client_build`, `query_encode`,


+    if complete_coverage {
+        let newly_pinned: Vec<String> = verified.difference(&before).cloned().collect();
+        let evicted: Vec<String> = before.difference(verified).cloned().collect();
+        state.replace_with(verified.clone());
+        PinUpdate {
+            newly_pinned,
+            evicted,
+            total_pinned: state.pinned_count(),
+            replaced: true,
+        }
+    } else {
+        let mut newly_pinned: Vec<String> = Vec::new();
+        for fp in verified {
+            let before_count = state.pinned_count();
+            state.add_fingerprint(fp.clone());
+            if state.pinned_count() > before_count {
+                newly_pinned.push(fp.clone());
+            }
+        }


gemini-code-assist

Code Review

This pull request implements a rotation-SNI scheme for attestation discovery, enabling deterministic routing to backends and more efficient fingerprint management. Key changes include the introduction of a replace_with method for wholesale fingerprint state updates, a new rotation module for URL handling and backend count fetching, and a refactored discovery process that achieves full coverage in a single cycle. Feedback identifies a potential regression in single-backend models where only one signing algorithm would be discovered, potentially breaking end-to-end encryption for some clients. Additionally, an optimization was suggested for the fingerprint merging logic to reduce redundant state checks and simplify the implementation.

gemini-code-assist · 2026-05-18T15:50:10Z

+        // Step 2: fan out one attestation call per backend index, in
+        // parallel (no stagger — each call lands on a distinct backend, so
+        // per-backend pressure is exactly one attestation per cycle).
+        let futures = (0..backend_count)


For models with only a single healthy backend (backend_count == 1), this loop will only issue one call, typically using the ecdsa algorithm. This results in a regression where the ed25519 signing public key is never discovered for single-backend models, breaking E2EE routing for clients requesting that algorithm. Consider ensuring at least two calls are made (alternating algorithms) even when only one backend is reported, as the rotation SNI scheme correctly handles wrapping indices.

let futures = (0..backend_count.max(2))

gemini-code-assist · 2026-05-18T15:50:10Z

+        let mut newly_pinned: Vec<String> = Vec::new();
+        for fp in verified {
+            let before_count = state.pinned_count();
+            state.add_fingerprint(fp.clone());
+            if state.pinned_count() > before_count {
+                newly_pinned.push(fp.clone());
+            }
+        }


The additive merge path can be simplified and made more efficient by using the newly_pinned logic already present in the complete_coverage branch. This avoids multiple calls to state.pinned_count() inside the loop, consolidating duplicated logic and ensuring shared data is fetched once before the loop for better performance.

let newly_pinned: Vec<String> = verified.difference(&before).cloned().collect(); for fp in &newly_pinned { state.add_fingerprint(fp.clone()); } PinUpdate { newly_pinned, evicted: Vec::new(), total_pinned: state.pinned_count(), replaced: false, }

References

To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.

When processing a batch of items, fetch shared data once before the loop and pass it into the processing logic to avoid redundant database queries (N queries for a batch of N items).

Address bot review feedback on #603: - count_zero vs count-fetch-failure are now distinguishable in failure_reasons. Previously both rendered as empty / generic count_*:; now Ok(0) records 'count_zero: proxy reports 0 healthy backends' explicitly. - Sanity-cap rotation fan-out at 256 backends per model per cycle. A bogus registry reading (race during deploy, partial split) would otherwise spawn an unbounded number of fresh-TCP TLS handshakes. Hitting the cap is logged and recorded in failure_reasons. - Strip the request URL from every reqwest error in failure_reasons via Error::without_url(). The URLs embed our random per-call nonce, which would otherwise create unbounded label cardinality in DD when any reqwest error path fires. Full error stays available at DEBUG via the existing debug! lines. - pin_update_verify_failure_blocks_replacement test now uses an input shape that the production caller can actually produce (backend_count=4, verified=3, verify_failures=1). The policy assertion is unchanged.

Evrard-Nil · 2026-05-18T15:52:15Z

Pushed 786f31c addressing the bot review feedback:

count_zero vs count-fetch-failure: Ok(0) now records count_zero: proxy reports 0 healthy backends in failure_reasons so DD can distinguish it from a count-fetch error (which carries a count_*: reason). (copilot #1 / claude Develop #2)
Sanity cap on fan-out: hard-capped at MAX_ROTATION_FANOUT = 256 per model per cycle. A bogus registry reading would otherwise spawn an unbounded fan-out of fresh-TCP TLS handshakes; hitting the cap warns + records count_capped in failure_reasons. (claude #1)
URL redaction in failure_reasons: every reqwest::Error formatted into failure_reasons now goes through Error::without_url(). The URLs embed our per-call random nonce, which would have created unbounded DD label cardinality on any failing path. Full error remains available at DEBUG via the existing debug! lines. (copilot Develop #2)
Realistic test inputs: pin_update_verify_failure_blocks_replacement now uses backend_count=4, verified=3, verify_failures=1 — what the production caller can actually produce. The policy assertion is unchanged. (claude Add TLS support #3)

Not addressed (intentional):

No retry on count fetch failure (claude Issue: missing endpoints and data #5): the cycle fails safely (additive path is the floor, existing pin set preserved), and the failure surfaces in failure_reasons. Existing 5-min refresh interval is the retry. Worth revisiting separately if we lower the interval.
replace_with(empty) is unreachable from apply_pin_update (claude [Bug] auth session #4): defensive doc + test stay — small footprint, and useful insurance for refactors.

Copilot AI review requested due to automatic review settings May 18, 2026 15:39

Evrard-Nil temporarily deployed to Cloud API test env May 18, 2026 15:39 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Evrard-Nil May 18, 2026 15:40 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

Evrard-Nil temporarily deployed to Cloud API test env May 18, 2026 15:51 — with GitHub Actions Inactive

Evrard-Nil merged commit ca6cc73 into main May 18, 2026
3 checks passed

Evrard-Nil mentioned this pull request May 19, 2026

fix: refresh tls_cert_fingerprint when certbot rotates the cert nearai/inference-proxy#125

Merged

lloydmak99 mentioned this pull request May 20, 2026

perf(inference_provider_pool): skip attestation on cold-bucket-fill when fingerprint already pinned #626

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rotation-SNI discovery + rapid-eviction pin set#603

feat: rotation-SNI discovery + rapid-eviction pin set#603
Evrard-Nil merged 2 commits into
mainfrom
feat/rotation-discovery

Evrard-Nil commented May 18, 2026

Uh oh!

claude Bot commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Evrard-Nil commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// is one of: `count_connect`, `count_timeout`, `count_send`,
		/// `count_status`, `count_decode`, `client_build`, `query_encode`,

Conversation

Evrard-Nil commented May 18, 2026

Why

What changes

What goes away

What's not in this PR (out of scope)

Observability

URL handling

Tests

Rollout

Follow-ups

Uh oh!

claude Bot commented May 18, 2026

Review — rotation-SNI discovery + rapid-eviction

Critical issues

Notable observations

Looks good

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Evrard-Nil commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants