Skip to content

observability: surface cumulative discovery cycles at INFO#595

Merged
Evrard-Nil merged 3 commits into
mainfrom
fix/observability-cumulative-discovery
May 15, 2026
Merged

observability: surface cumulative discovery cycles at INFO#595
Evrard-Nil merged 3 commits into
mainfrom
fix/observability-cumulative-discovery

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Collaborator

Why

Production cloud-api intermittently returns 503 for GET /v1/attestation/report?model=zai-org/GLM-latest (and other multi-backend models). Root cause is that some replicas only have a partial TLS SPKI pin set for glm-5-1.completions.near.ai (5 backends behind the L4 SNI proxy), and cumulative discovery is not expanding the set after startup.

After startup we see zero "Pinned new TLS SPKI fingerprint" or "Cumulative discovery expanded pinned backend set" events for GLM on either prod host in 6+ hours, even though "Loaded inference_url models" fires every 5 min. The no-new-fingerprints branch logs at DEBUG (filtered in prod), so we can't tell from logs whether cumulative actually ran, which backends the L4 LB routed it to, or whether per-call failures (TLS reject / timeout / 5xx) are silently eating cycles.

What this PR does

Adds two fields to DiscoveryOutcome so every discovery pass produces enough INFO-level data to find the root cause on prod:

  • observed_fingerprints: Vec<String> — every verified TLS SPKI fingerprint a cycle saw, in call order. Repeats mean the L4 LB sent multiple calls to the same backend (load-balancing collapse). Missing-from-prod-pin-set entries mean the LB did route to a new backend but something else (e.g. verify) failed.
  • failure_reasons: Vec<String> — a category-prefixed reason per call that didn't produce a fingerprint. Categories: connect / send_timeout / timeout / request / send / status:NNN / malformed_json / query_encode / client_build / verify.

These fields are surfaced on three existing log events:

  1. Cumulative discovery cycle (no new fingerprints) — promoted from DEBUG to INFO and renamed. This is the key signal — it now fires once per provider per refresh cycle in prod.
  2. Cumulative discovery expanded pinned backend set — already INFO; gains observed_fingerprints and failure_reasons.
  3. Initial attestation discovery complete — already INFO; same additions, so startup diagnostics show what each replica saw.

Per-call DEBUG logs are unchanged.

Volume

Aggregate INFO log fires once per provider per cumulative cycle. With default refresh_interval_secs = 300 and ~11 reused models, that's ~2-3 log-lines/sec per replica during steady state — well within budget.

Privacy

No customer data. Fingerprints are server-cert SPKI hashes (already logged at INFO in "Pinned new TLS SPKI fingerprint"). Failure reasons are bounded categorical strings + reqwest error display, no request/response bodies.

What we expect to see once this ships

For the GLM stall, one of:

  • observed_fingerprints contains only the already-pinned values every cycle → L4 LB is collapsing onto known backends; fix is on the model-proxy / discovery-routing side (PR for select_backend_excluding).
  • failure_reasons contains connect / timeout / verify consistently → upstream backend or attestation path is the blocker.
  • Mixed → we can quantify each.

Without this PR we cannot distinguish these on prod.

Test plan

  • cargo check -p services clean
  • cargo test --lib -p services inference_provider_pool — 31/31 passing
  • Deploy to staging, confirm Cumulative discovery cycle (no new fingerprints) events appear with non-empty observed_fingerprints for healthy multi-backend models
  • Deploy to prod, query DD for service:cloud-api @fields.model:zai-org/GLM-5.1-FP8 @fields.message:"Cumulative discovery cycle", confirm whether observations repeat or rotate

Follow-ups (out of scope)

Once we have the data we can pick the right fix from:

  • Bump CUMULATIVE_DISCOVERY_CALLS from 2 → 5 to match initial parallelism.
  • Have discovery request select_backend_excluding(already_pinned) so model-proxy rotates to undiscovered backends.
  • Run one cumulative cycle ~30s after startup instead of waiting for the first interval tick.

The cumulative discovery refresh on prod is silently failing to expand
the TLS SPKI pin set for multi-backend models (notably GLM-5.1-FP8 with
5 backends behind the L4 SNI proxy). After startup, no further
"Pinned new TLS SPKI fingerprint" or "Cumulative discovery expanded
pinned backend set" events appear, even though load_inference_url_models
runs every 5 min. The no-new-fingerprints branch logged at DEBUG and was
filtered out in production, so the cycle was invisible — we could not
tell whether discovery was running, what backends the L4 LB routed to,
or whether calls were failing.

This commit adds two fields to DiscoveryOutcome:
  - observed_fingerprints: every verified TLS SPKI fingerprint a
    cumulative cycle saw, in call order. Repeats mean the L4 LB sent
    multiple calls to the same backend — the load-balancing-bias signal
    needed to diagnose why pin-set expansion stalled.
  - failure_reasons: a category-prefixed reason per call that didn't
    produce a fingerprint (connect / send_timeout / timeout / status /
    malformed_json / query_encode / client_build / verify).

The per-call DEBUG logs are unchanged. The aggregate is logged once per
provider per refresh cycle, so volume is bounded by refresh_interval ×
model count (with default 300s and ~11 models this is ~10 INFO
log-lines/minute per replica).

The "Cumulative discovery cycle" log is promoted from DEBUG to INFO
and renamed to "Cumulative discovery cycle (no new fingerprints)" to
distinguish it from the expansion event. The same observed/failure
fields are also added to the expansion event and the initial-discovery
event so the same diagnostic data is available at startup and during
recovery.

No customer data is logged: fingerprints are server cert hashes,
failure reasons are categorized error types.
Copilot AI review requested due to automatic review settings May 15, 2026 10:19
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 15, 2026 10:19 — with GitHub Actions Inactive
@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Code Review

Observability-only change, low risk. A couple of correctness nits worth fixing before merge so the new logs read the way the PR description claims:

⚠️ Issues

1. Doc comment on observed_fingerprints has the ordering backwards (crates/services/src/inference_provider_pool/mod.rs:55-60)

/// Order is the order in which discovery's parallel calls returned (not the
/// order they were launched).

futures::future::join_all preserves the order of the input vector — that's launch order, not completion order. This matters for the PR's stated debug strategy: the count of duplicates is still meaningful for spotting LB collapse, but anyone reading the log and reasoning about timing ("call #2 hit backend X, call #3 also hit X") will be misled. Suggest:

/// Per-call verified TLS fingerprints observed in this pass, in launch order
/// (i.e. by `attempt` index — `join_all` preserves input order, not completion
/// order). Repeats mean the L4 LB routed multiple calls to the same backend.

2. failure_reasons and failed_calls count different things — debug logs will look inconsistent

verify failures get pushed into failure_reasons (line 821) but the call has already been counted as successful_calls on line 777 (HTTP success ⇒ successful_calls += 1 before verification runs). So in the INFO line for Cumulative discovery cycle (no new fingerprints), an operator can legitimately see failed_calls=0, failure_reasons=[\"verify: ...\"] and wonder which counter is wrong.

Either:

  • Add a verify_failures counter alongside failed_calls, or
  • Note this explicitly in the failure_reasons doc comment ("includes post-HTTP verify failures, so failure_reasons.len() >= failed_calls")

Given this PR exists because the previous logs were ambiguous, I'd lean toward separate counters.

Nits

  • PR description says category status:NNN but code emits \"status: {status}\" (line 728) with a space. Trivial, but log parsers / DD facets keyed off the prefix will need to know.
  • observed_fingerprints.push(vfp.clone()) (line 782) double-clones with the existing verified_this_round.insert(vfp.clone()) on the next line. SHA-256 hex strings are tiny, so this is fine — flagging only because the existing code already had one clone here.

Privacy / safety

  • ✅ No customer data. Fingerprints are SPKI hashes (already INFO-logged via Pinned new TLS SPKI fingerprint).
  • format!(\"{category}: {e}\") on reqwest errors: reqwest's Display may include the request URL, but that URL is the internal inference endpoint, not user content. Acceptable.
  • ✅ Volume claim is conservative — PR says ~2-3 lines/sec but the math (11 models / 300s) is ~0.04 lines/sec per replica. Plenty of headroom.

Verdict

⚠️ Two doc/counter-clarity fixes requested. Once those are addressed, this is good to ship.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds INFO-level observability for attestation discovery so production can diagnose why cumulative TLS SPKI discovery is not expanding backend pin sets.

Changes:

  • Tracks observed TLS fingerprints and per-call discovery failure reasons in DiscoveryOutcome.
  • Surfaces those fields on initial discovery and cumulative discovery log events.
  • Promotes the no-new-fingerprints cumulative discovery log from DEBUG to INFO.
Comments suppressed due to low confidence (1)

crates/services/src/inference_provider_pool/mod.rs:63

  • These reasons are not recorded in occurrence order: the futures are awaited with join_all and then processed in input/attempt order. Update the doc or collect results as each future completes if occurrence order is important for the diagnostic.
    /// Per-call failure reasons that prevented a fingerprint observation, in
    /// the order they occurred. Empty when all calls succeeded.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 780 to +782
Ok(verified) => {
if let Some(ref vfp) = verified.tls_cert_fingerprint {
observed_fingerprints.push(vfp.clone());
} else {
"send"
};
return Err(format!("{category}: {e}"));
Comment on lines +56 to +63
/// Per-call verified TLS fingerprints observed in this pass, in call order.
/// Order is the order in which discovery's parallel calls returned (not the
/// order they were launched). Same fingerprint appears multiple times when
/// the L4 LB routed several calls to the same backend — exactly the signal
/// needed to debug why cumulative discovery isn't expanding the pin set.
observed_fingerprints: Vec<String>,
/// Per-call failure reasons that prevented a fingerprint observation, in
/// the order they occurred. Empty when all calls succeeded.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the discovery process within the inference provider pool by introducing tracking for observed TLS fingerprints and specific failure reasons. It updates discovery calls to return detailed error categories instead of simple options, facilitating better debugging of why the pinned backend set might not be expanding. The review feedback highlights a correction needed in the documentation regarding the order of parallel call results and recommends sanitizing error messages to prevent sensitive information leakage in logs.

Comment on lines +57 to +58
/// Order is the order in which discovery's parallel calls returned (not the
/// order they were launched). Same fingerprint appears multiple times when
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The futures::future::join_all function used in discover_model preserves the order of the input futures. Therefore, the results will be in the order they were launched (call 0, call 1, etc.), not the order in which they returned. This is actually beneficial for debugging as it allows correlating results with the specific algorithm used for each call index.

    /// Order is the order in which discovery's parallel calls were launched
    /// (preserved by join_all). Same fingerprint appears multiple times when

} else {
"send"
};
return Err(format!("{category}: {e}"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The error message from reqwest::Error may contain sensitive information such as internal URLs or IP addresses. Since these failure reasons are now surfaced in INFO logs, it is recommended to use the existing sanitize_error_message utility to redact sensitive details.

                            return Err(format!("{category}: {}", Self::sanitize_error_message(&e.to_string())));
References
  1. To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.

error = %e,
"Attestation verification failed for discovered backend"
);
failure_reasons.push(format!("verify: {e}"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Attestation verification errors can sometimes embed details about the environment or specific backend responses. It is safer to sanitize these strings before including them in the DiscoveryOutcome which is logged at INFO level.

                    failure_reasons.push(format!("verify: {}", Self::sanitize_error_message(&e.to_string())));
References
  1. To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.

Address two issues from the bot reviews:

1. The doc on observed_fingerprints/failure_reasons said the order was the
   order calls *returned*. join_all preserves input order, so it's actually
   launch order (by attempt index). Updated both docs.

2. failed_calls only counted transport-level failures, but failure_reasons
   also included post-HTTP verify failures. An operator reading the INFO
   log could see failed_calls=0 alongside failure_reasons=["verify: ..."]
   and reasonably wonder which counter was correct. Added a separate
   verify_failures counter so the relationship is explicit:
     failure_reasons.len() == failed_calls + verify_failures
   and surfaced it on all three INFO/WARN log sites (initial discovery,
   cumulative expansion, cumulative no-new-fingerprints).

Also documented the category vocabulary on failure_reasons inline so
operators don't have to grep the source to know what 'client_build:' or
'send_timeout:' mean.
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 15, 2026 10:40 — with GitHub Actions Inactive
@Evrard-Nil
Copy link
Copy Markdown
Collaborator Author

Pushed ca45bd8 addressing the review feedback:

  • Ordering doc fix (claude / copilot / gemini): futures::future::join_all preserves input order, not completion order. Updated both observed_fingerprints and failure_reasons docs to say "launch order (by attempt index)".
  • Counter clarity (claude): added a separate verify_failures: usize to DiscoveryOutcome so post-HTTP verification failures are not silently rolled into the failure_reasons list while failed_calls shows 0. The relationship is now explicit:
    failure_reasons.len() == failed_calls + verify_failures
    
    verify_failures is surfaced on the three INFO/WARN log sites (initial discovery, cumulative expansion, cumulative no-new-fingerprints). Also documented the category vocabulary on failure_reasons inline.

Not addressed (intentional):

  • Status format with space (claude nit): kept "{category}: {detail}" consistent across all categories rather than removing the space for status:NNN only. PR description was wrong; categories all use "{cat}: ".
  • Double clone on push+insert (claude nit): left as-is — SHA-256 hex strings are 64 bytes, cost is negligible.

@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 15, 2026 10:46 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil merged commit 86a4a8b into main May 15, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants