observability: surface cumulative discovery cycles at INFO by Evrard-Nil · Pull Request #595 · nearai/cloud-api

Evrard-Nil · 2026-05-15T10:19:54Z

Why

Production cloud-api intermittently returns 503 for GET /v1/attestation/report?model=zai-org/GLM-latest (and other multi-backend models). Root cause is that some replicas only have a partial TLS SPKI pin set for glm-5-1.completions.near.ai (5 backends behind the L4 SNI proxy), and cumulative discovery is not expanding the set after startup.

After startup we see zero "Pinned new TLS SPKI fingerprint" or "Cumulative discovery expanded pinned backend set" events for GLM on either prod host in 6+ hours, even though "Loaded inference_url models" fires every 5 min. The no-new-fingerprints branch logs at DEBUG (filtered in prod), so we can't tell from logs whether cumulative actually ran, which backends the L4 LB routed it to, or whether per-call failures (TLS reject / timeout / 5xx) are silently eating cycles.

What this PR does

Adds two fields to DiscoveryOutcome so every discovery pass produces enough INFO-level data to find the root cause on prod:

observed_fingerprints: Vec<String> — every verified TLS SPKI fingerprint a cycle saw, in call order. Repeats mean the L4 LB sent multiple calls to the same backend (load-balancing collapse). Missing-from-prod-pin-set entries mean the LB did route to a new backend but something else (e.g. verify) failed.
failure_reasons: Vec<String> — a category-prefixed reason per call that didn't produce a fingerprint. Categories: connect / send_timeout / timeout / request / send / status:NNN / malformed_json / query_encode / client_build / verify.

These fields are surfaced on three existing log events:

Cumulative discovery cycle (no new fingerprints) — promoted from DEBUG to INFO and renamed. This is the key signal — it now fires once per provider per refresh cycle in prod.
Cumulative discovery expanded pinned backend set — already INFO; gains observed_fingerprints and failure_reasons.
Initial attestation discovery complete — already INFO; same additions, so startup diagnostics show what each replica saw.

Per-call DEBUG logs are unchanged.

Volume

Aggregate INFO log fires once per provider per cumulative cycle. With default refresh_interval_secs = 300 and ~11 reused models, that's ~2-3 log-lines/sec per replica during steady state — well within budget.

Privacy

No customer data. Fingerprints are server-cert SPKI hashes (already logged at INFO in "Pinned new TLS SPKI fingerprint"). Failure reasons are bounded categorical strings + reqwest error display, no request/response bodies.

What we expect to see once this ships

For the GLM stall, one of:

observed_fingerprints contains only the already-pinned values every cycle → L4 LB is collapsing onto known backends; fix is on the model-proxy / discovery-routing side (PR for select_backend_excluding).
failure_reasons contains connect / timeout / verify consistently → upstream backend or attestation path is the blocker.
Mixed → we can quantify each.

Without this PR we cannot distinguish these on prod.

Test plan

cargo check -p services clean
cargo test --lib -p services inference_provider_pool — 31/31 passing
Deploy to staging, confirm Cumulative discovery cycle (no new fingerprints) events appear with non-empty observed_fingerprints for healthy multi-backend models
Deploy to prod, query DD for service:cloud-api @fields.model:zai-org/GLM-5.1-FP8 @fields.message:"Cumulative discovery cycle", confirm whether observations repeat or rotate

Follow-ups (out of scope)

Once we have the data we can pick the right fix from:

Bump CUMULATIVE_DISCOVERY_CALLS from 2 → 5 to match initial parallelism.
Have discovery request select_backend_excluding(already_pinned) so model-proxy rotates to undiscovered backends.
Run one cumulative cycle ~30s after startup instead of waiting for the first interval tick.

The cumulative discovery refresh on prod is silently failing to expand the TLS SPKI pin set for multi-backend models (notably GLM-5.1-FP8 with 5 backends behind the L4 SNI proxy). After startup, no further "Pinned new TLS SPKI fingerprint" or "Cumulative discovery expanded pinned backend set" events appear, even though load_inference_url_models runs every 5 min. The no-new-fingerprints branch logged at DEBUG and was filtered out in production, so the cycle was invisible — we could not tell whether discovery was running, what backends the L4 LB routed to, or whether calls were failing. This commit adds two fields to DiscoveryOutcome: - observed_fingerprints: every verified TLS SPKI fingerprint a cumulative cycle saw, in call order. Repeats mean the L4 LB sent multiple calls to the same backend — the load-balancing-bias signal needed to diagnose why pin-set expansion stalled. - failure_reasons: a category-prefixed reason per call that didn't produce a fingerprint (connect / send_timeout / timeout / status / malformed_json / query_encode / client_build / verify). The per-call DEBUG logs are unchanged. The aggregate is logged once per provider per refresh cycle, so volume is bounded by refresh_interval × model count (with default 300s and ~11 models this is ~10 INFO log-lines/minute per replica). The "Cumulative discovery cycle" log is promoted from DEBUG to INFO and renamed to "Cumulative discovery cycle (no new fingerprints)" to distinguish it from the expansion event. The same observed/failure fields are also added to the expansion event and the initial-discovery event so the same diagnostic data is available at startup and during recovery. No customer data is logged: fingerprints are server cert hashes, failure reasons are categorized error types.

claude · 2026-05-15T10:22:01Z

Code Review

Observability-only change, low risk. A couple of correctness nits worth fixing before merge so the new logs read the way the PR description claims:

⚠️ Issues

1. Doc comment on observed_fingerprints has the ordering backwards (crates/services/src/inference_provider_pool/mod.rs:55-60)

/// Order is the order in which discovery's parallel calls returned (not the
/// order they were launched).

futures::future::join_all preserves the order of the input vector — that's launch order, not completion order. This matters for the PR's stated debug strategy: the count of duplicates is still meaningful for spotting LB collapse, but anyone reading the log and reasoning about timing ("call #2 hit backend X, call #3 also hit X") will be misled. Suggest:

/// Per-call verified TLS fingerprints observed in this pass, in launch order
/// (i.e. by `attempt` index — `join_all` preserves input order, not completion
/// order). Repeats mean the L4 LB routed multiple calls to the same backend.

2. failure_reasons and failed_calls count different things — debug logs will look inconsistent

verify failures get pushed into failure_reasons (line 821) but the call has already been counted as successful_calls on line 777 (HTTP success ⇒ successful_calls += 1 before verification runs). So in the INFO line for Cumulative discovery cycle (no new fingerprints), an operator can legitimately see failed_calls=0, failure_reasons=[\"verify: ...\"] and wonder which counter is wrong.

Either:

Add a verify_failures counter alongside failed_calls, or
Note this explicitly in the failure_reasons doc comment ("includes post-HTTP verify failures, so failure_reasons.len() >= failed_calls")

Given this PR exists because the previous logs were ambiguous, I'd lean toward separate counters.

Nits

PR description says category status:NNN but code emits \"status: {status}\" (line 728) with a space. Trivial, but log parsers / DD facets keyed off the prefix will need to know.
observed_fingerprints.push(vfp.clone()) (line 782) double-clones with the existing verified_this_round.insert(vfp.clone()) on the next line. SHA-256 hex strings are tiny, so this is fine — flagging only because the existing code already had one clone here.

Privacy / safety

✅ No customer data. Fingerprints are SPKI hashes (already INFO-logged via Pinned new TLS SPKI fingerprint).
✅ format!(\"{category}: {e}\") on reqwest errors: reqwest's Display may include the request URL, but that URL is the internal inference endpoint, not user content. Acceptable.
✅ Volume claim is conservative — PR says ~2-3 lines/sec but the math (11 models / 300s) is ~0.04 lines/sec per replica. Plenty of headroom.

Verdict

⚠️ Two doc/counter-clarity fixes requested. Once those are addressed, this is good to ship.

Copilot

Pull request overview

Adds INFO-level observability for attestation discovery so production can diagnose why cumulative TLS SPKI discovery is not expanding backend pin sets.

Changes:

Tracks observed TLS fingerprints and per-call discovery failure reasons in DiscoveryOutcome.
Surfaces those fields on initial discovery and cumulative discovery log events.
Promotes the no-new-fingerprints cumulative discovery log from DEBUG to INFO.

Comments suppressed due to low confidence (1)

crates/services/src/inference_provider_pool/mod.rs:63

These reasons are not recorded in occurrence order: the futures are awaited with join_all and then processed in input/attempt order. Update the doc or collect results as each future completes if occurrence order is important for the diagnostic.

    /// Per-call failure reasons that prevented a fingerprint observation, in
    /// the order they occurred. Empty when all calls succeeded.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

                Ok(verified) => {
                    if let Some(ref vfp) = verified.tls_cert_fingerprint {
+                        observed_fingerprints.push(vfp.clone());


+                            } else {
+                                "send"
+                            };
+                            return Err(format!("{category}: {e}"));


+    /// Per-call verified TLS fingerprints observed in this pass, in call order.
+    /// Order is the order in which discovery's parallel calls returned (not the
+    /// order they were launched). Same fingerprint appears multiple times when
+    /// the L4 LB routed several calls to the same backend — exactly the signal
+    /// needed to debug why cumulative discovery isn't expanding the pin set.
+    observed_fingerprints: Vec<String>,
+    /// Per-call failure reasons that prevented a fingerprint observation, in
+    /// the order they occurred. Empty when all calls succeeded.


gemini-code-assist

Code Review

This pull request enhances the discovery process within the inference provider pool by introducing tracking for observed TLS fingerprints and specific failure reasons. It updates discovery calls to return detailed error categories instead of simple options, facilitating better debugging of why the pinned backend set might not be expanding. The review feedback highlights a correction needed in the documentation regarding the order of parallel call results and recommends sanitizing error messages to prevent sensitive information leakage in logs.

gemini-code-assist · 2026-05-15T10:26:28Z

+    /// Order is the order in which discovery's parallel calls returned (not the
+    /// order they were launched). Same fingerprint appears multiple times when


The futures::future::join_all function used in discover_model preserves the order of the input futures. Therefore, the results will be in the order they were launched (call 0, call 1, etc.), not the order in which they returned. This is actually beneficial for debugging as it allows correlating results with the specific algorithm used for each call index.

/// Order is the order in which discovery's parallel calls were launched /// (preserved by join_all). Same fingerprint appears multiple times when

gemini-code-assist · 2026-05-15T10:26:28Z

+                            } else {
+                                "send"
+                            };
+                            return Err(format!("{category}: {e}"));


The error message from reqwest::Error may contain sensitive information such as internal URLs or IP addresses. Since these failure reasons are now surfaced in INFO logs, it is recommended to use the existing sanitize_error_message utility to redact sensitive details.

return Err(format!("{category}: {}", Self::sanitize_error_message(&e.to_string())));

References

To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.

gemini-code-assist · 2026-05-15T10:26:28Z

                        error = %e,
                        "Attestation verification failed for discovered backend"
                    );
+                    failure_reasons.push(format!("verify: {e}"));


Attestation verification errors can sometimes embed details about the environment or specific backend responses. It is safer to sanitize these strings before including them in the DiscoveryOutcome which is logged at INFO level.

failure_reasons.push(format!("verify: {}", Self::sanitize_error_message(&e.to_string())));

References

To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.

Address two issues from the bot reviews: 1. The doc on observed_fingerprints/failure_reasons said the order was the order calls *returned*. join_all preserves input order, so it's actually launch order (by attempt index). Updated both docs. 2. failed_calls only counted transport-level failures, but failure_reasons also included post-HTTP verify failures. An operator reading the INFO log could see failed_calls=0 alongside failure_reasons=["verify: ..."] and reasonably wonder which counter was correct. Added a separate verify_failures counter so the relationship is explicit: failure_reasons.len() == failed_calls + verify_failures and surfaced it on all three INFO/WARN log sites (initial discovery, cumulative expansion, cumulative no-new-fingerprints). Also documented the category vocabulary on failure_reasons inline so operators don't have to grep the source to know what 'client_build:' or 'send_timeout:' mean.

Evrard-Nil · 2026-05-15T10:40:56Z

Pushed ca45bd8 addressing the review feedback:

Ordering doc fix (claude / copilot / gemini): futures::future::join_all preserves input order, not completion order. Updated both observed_fingerprints and failure_reasons docs to say "launch order (by attempt index)".
Counter clarity (claude): added a separate verify_failures: usize to DiscoveryOutcome so post-HTTP verification failures are not silently rolled into the failure_reasons list while failed_calls shows 0. The relationship is now explicit:
```
failure_reasons.len() == failed_calls + verify_failures
```
verify_failures is surfaced on the three INFO/WARN log sites (initial discovery, cumulative expansion, cumulative no-new-fingerprints). Also documented the category vocabulary on failure_reasons inline.

Not addressed (intentional):

Status format with space (claude nit): kept "{category}: {detail}" consistent across all categories rather than removing the space for status:NNN only. PR description was wrong; categories all use "{cat}: ".
Double clone on push+insert (claude nit): left as-is — SHA-256 hex strings are 64 bytes, cost is negligible.

Copilot AI review requested due to automatic review settings May 15, 2026 10:19

Evrard-Nil temporarily deployed to Cloud API test env May 15, 2026 10:19 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Evrard-Nil May 15, 2026 10:20 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Evrard-Nil temporarily deployed to Cloud API test env May 15, 2026 10:40 — with GitHub Actions Inactive

Merge branch 'main' into fix/observability-cumulative-discovery

1273f98

Evrard-Nil temporarily deployed to Cloud API test env May 15, 2026 10:46 — with GitHub Actions Inactive

Evrard-Nil merged commit 86a4a8b into main May 15, 2026
3 checks passed

This was referenced May 18, 2026

Use rotation SNI for chat-completion bucket pre-warm #600

Open

feat: rotation-SNI discovery + rapid-eviction pin set #603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observability: surface cumulative discovery cycles at INFO#595

observability: surface cumulative discovery cycles at INFO#595
Evrard-Nil merged 3 commits into
mainfrom
fix/observability-cumulative-discovery

Evrard-Nil commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

Evrard-Nil commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// Order is the order in which discovery's parallel calls returned (not the
		/// order they were launched). Same fingerprint appears multiple times when

Conversation

Evrard-Nil commented May 15, 2026

Why

What this PR does

Volume

Privacy

What we expect to see once this ships

Test plan

Follow-ups (out of scope)

Uh oh!

claude Bot commented May 15, 2026

Code Review

⚠️ Issues

Nits

Privacy / safety

Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Evrard-Nil commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants