observability: surface cumulative discovery cycles at INFO#595
Conversation
The cumulative discovery refresh on prod is silently failing to expand
the TLS SPKI pin set for multi-backend models (notably GLM-5.1-FP8 with
5 backends behind the L4 SNI proxy). After startup, no further
"Pinned new TLS SPKI fingerprint" or "Cumulative discovery expanded
pinned backend set" events appear, even though load_inference_url_models
runs every 5 min. The no-new-fingerprints branch logged at DEBUG and was
filtered out in production, so the cycle was invisible — we could not
tell whether discovery was running, what backends the L4 LB routed to,
or whether calls were failing.
This commit adds two fields to DiscoveryOutcome:
- observed_fingerprints: every verified TLS SPKI fingerprint a
cumulative cycle saw, in call order. Repeats mean the L4 LB sent
multiple calls to the same backend — the load-balancing-bias signal
needed to diagnose why pin-set expansion stalled.
- failure_reasons: a category-prefixed reason per call that didn't
produce a fingerprint (connect / send_timeout / timeout / status /
malformed_json / query_encode / client_build / verify).
The per-call DEBUG logs are unchanged. The aggregate is logged once per
provider per refresh cycle, so volume is bounded by refresh_interval ×
model count (with default 300s and ~11 models this is ~10 INFO
log-lines/minute per replica).
The "Cumulative discovery cycle" log is promoted from DEBUG to INFO
and renamed to "Cumulative discovery cycle (no new fingerprints)" to
distinguish it from the expansion event. The same observed/failure
fields are also added to the expansion event and the initial-discovery
event so the same diagnostic data is available at startup and during
recovery.
No customer data is logged: fingerprints are server cert hashes,
failure reasons are categorized error types.
Code ReviewObservability-only change, low risk. A couple of correctness nits worth fixing before merge so the new logs read the way the PR description claims:
|
There was a problem hiding this comment.
Pull request overview
Adds INFO-level observability for attestation discovery so production can diagnose why cumulative TLS SPKI discovery is not expanding backend pin sets.
Changes:
- Tracks observed TLS fingerprints and per-call discovery failure reasons in
DiscoveryOutcome. - Surfaces those fields on initial discovery and cumulative discovery log events.
- Promotes the no-new-fingerprints cumulative discovery log from DEBUG to INFO.
Comments suppressed due to low confidence (1)
crates/services/src/inference_provider_pool/mod.rs:63
- These reasons are not recorded in occurrence order: the futures are awaited with
join_alland then processed in input/attempt order. Update the doc or collect results as each future completes if occurrence order is important for the diagnostic.
/// Per-call failure reasons that prevented a fingerprint observation, in
/// the order they occurred. Empty when all calls succeeded.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Ok(verified) => { | ||
| if let Some(ref vfp) = verified.tls_cert_fingerprint { | ||
| observed_fingerprints.push(vfp.clone()); |
| } else { | ||
| "send" | ||
| }; | ||
| return Err(format!("{category}: {e}")); |
| /// Per-call verified TLS fingerprints observed in this pass, in call order. | ||
| /// Order is the order in which discovery's parallel calls returned (not the | ||
| /// order they were launched). Same fingerprint appears multiple times when | ||
| /// the L4 LB routed several calls to the same backend — exactly the signal | ||
| /// needed to debug why cumulative discovery isn't expanding the pin set. | ||
| observed_fingerprints: Vec<String>, | ||
| /// Per-call failure reasons that prevented a fingerprint observation, in | ||
| /// the order they occurred. Empty when all calls succeeded. |
There was a problem hiding this comment.
Code Review
This pull request enhances the discovery process within the inference provider pool by introducing tracking for observed TLS fingerprints and specific failure reasons. It updates discovery calls to return detailed error categories instead of simple options, facilitating better debugging of why the pinned backend set might not be expanding. The review feedback highlights a correction needed in the documentation regarding the order of parallel call results and recommends sanitizing error messages to prevent sensitive information leakage in logs.
| /// Order is the order in which discovery's parallel calls returned (not the | ||
| /// order they were launched). Same fingerprint appears multiple times when |
There was a problem hiding this comment.
The futures::future::join_all function used in discover_model preserves the order of the input futures. Therefore, the results will be in the order they were launched (call 0, call 1, etc.), not the order in which they returned. This is actually beneficial for debugging as it allows correlating results with the specific algorithm used for each call index.
/// Order is the order in which discovery's parallel calls were launched
/// (preserved by join_all). Same fingerprint appears multiple times when
| } else { | ||
| "send" | ||
| }; | ||
| return Err(format!("{category}: {e}")); |
There was a problem hiding this comment.
The error message from reqwest::Error may contain sensitive information such as internal URLs or IP addresses. Since these failure reasons are now surfaced in INFO logs, it is recommended to use the existing sanitize_error_message utility to redact sensitive details.
return Err(format!("{category}: {}", Self::sanitize_error_message(&e.to_string())));References
- To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.
| error = %e, | ||
| "Attestation verification failed for discovered backend" | ||
| ); | ||
| failure_reasons.push(format!("verify: {e}")); |
There was a problem hiding this comment.
Attestation verification errors can sometimes embed details about the environment or specific backend responses. It is safer to sanitize these strings before including them in the DiscoveryOutcome which is logged at INFO level.
failure_reasons.push(format!("verify: {}", Self::sanitize_error_message(&e.to_string())));References
- To avoid logic drift and simplify maintenance, consolidate duplicated logic into a single source of truth.
Address two issues from the bot reviews:
1. The doc on observed_fingerprints/failure_reasons said the order was the
order calls *returned*. join_all preserves input order, so it's actually
launch order (by attempt index). Updated both docs.
2. failed_calls only counted transport-level failures, but failure_reasons
also included post-HTTP verify failures. An operator reading the INFO
log could see failed_calls=0 alongside failure_reasons=["verify: ..."]
and reasonably wonder which counter was correct. Added a separate
verify_failures counter so the relationship is explicit:
failure_reasons.len() == failed_calls + verify_failures
and surfaced it on all three INFO/WARN log sites (initial discovery,
cumulative expansion, cumulative no-new-fingerprints).
Also documented the category vocabulary on failure_reasons inline so
operators don't have to grep the source to know what 'client_build:' or
'send_timeout:' mean.
|
Pushed ca45bd8 addressing the review feedback:
Not addressed (intentional):
|
Why
Production cloud-api intermittently returns 503 for
GET /v1/attestation/report?model=zai-org/GLM-latest(and other multi-backend models). Root cause is that some replicas only have a partial TLS SPKI pin set forglm-5-1.completions.near.ai(5 backends behind the L4 SNI proxy), and cumulative discovery is not expanding the set after startup.After startup we see zero
"Pinned new TLS SPKI fingerprint"or"Cumulative discovery expanded pinned backend set"events for GLM on either prod host in 6+ hours, even though"Loaded inference_url models"fires every 5 min. The no-new-fingerprints branch logs at DEBUG (filtered in prod), so we can't tell from logs whether cumulative actually ran, which backends the L4 LB routed it to, or whether per-call failures (TLS reject / timeout / 5xx) are silently eating cycles.What this PR does
Adds two fields to
DiscoveryOutcomeso every discovery pass produces enough INFO-level data to find the root cause on prod:observed_fingerprints: Vec<String>— every verified TLS SPKI fingerprint a cycle saw, in call order. Repeats mean the L4 LB sent multiple calls to the same backend (load-balancing collapse). Missing-from-prod-pin-set entries mean the LB did route to a new backend but something else (e.g.verify) failed.failure_reasons: Vec<String>— a category-prefixed reason per call that didn't produce a fingerprint. Categories:connect/send_timeout/timeout/request/send/status:NNN/malformed_json/query_encode/client_build/verify.These fields are surfaced on three existing log events:
Cumulative discovery cycle (no new fingerprints)— promoted from DEBUG to INFO and renamed. This is the key signal — it now fires once per provider per refresh cycle in prod.Cumulative discovery expanded pinned backend set— already INFO; gainsobserved_fingerprintsandfailure_reasons.Initial attestation discovery complete— already INFO; same additions, so startup diagnostics show what each replica saw.Per-call DEBUG logs are unchanged.
Volume
Aggregate INFO log fires once per provider per cumulative cycle. With default
refresh_interval_secs = 300and ~11 reused models, that's ~2-3 log-lines/sec per replica during steady state — well within budget.Privacy
No customer data. Fingerprints are server-cert SPKI hashes (already logged at INFO in
"Pinned new TLS SPKI fingerprint"). Failure reasons are bounded categorical strings + reqwest error display, no request/response bodies.What we expect to see once this ships
For the GLM stall, one of:
observed_fingerprintscontains only the already-pinned values every cycle → L4 LB is collapsing onto known backends; fix is on the model-proxy / discovery-routing side (PR forselect_backend_excluding).failure_reasonscontainsconnect/timeout/verifyconsistently → upstream backend or attestation path is the blocker.Without this PR we cannot distinguish these on prod.
Test plan
cargo check -p servicescleancargo test --lib -p services inference_provider_pool— 31/31 passingCumulative discovery cycle (no new fingerprints)events appear with non-emptyobserved_fingerprintsfor healthy multi-backend modelsservice:cloud-api @fields.model:zai-org/GLM-5.1-FP8 @fields.message:"Cumulative discovery cycle", confirm whether observations repeat or rotateFollow-ups (out of scope)
Once we have the data we can pick the right fix from:
CUMULATIVE_DISCOVERY_CALLSfrom 2 → 5 to match initial parallelism.select_backend_excluding(already_pinned)so model-proxy rotates to undiscovered backends.