Skip to content

feat(telemetry): per-LLM-call metrics, structured logs, and tool tracking#5

Merged
offendingcommit merged 5 commits intosync/upstream-2026-05-03from
claude/llm-call-observability-metrics
May 4, 2026
Merged

feat(telemetry): per-LLM-call metrics, structured logs, and tool tracking#5
offendingcommit merged 5 commits intosync/upstream-2026-05-03from
claude/llm-call-observability-metrics

Conversation

@offendingcommit
Copy link
Copy Markdown
Owner

Summary

Adds observability around every honcho_llm_call invocation so we can answer "which model is effective for which feature" directly from Prometheus, without bouncing through Langfuse or parsing logs.

Motivated by ops pain in our k8s deploy: dream cycles on glm-5.1:cloud were averaging ~22 minutes wall time with frequent 10/12-iteration tool-loop saturation and silent timeouts, and we had no metric that distinguished "model didn't converge" from "infra broke", or that revealed when a call had silently failed over to the Gemini backup.

What changes

Six new Prometheus series (cardinality-bounded; no workspace_name label since these answer model effectiveness, not per-tenant slowness):

Series Labels Question it answers
llm_calls (counter) feature, provider, model, outcome "Which model handled which feature, did it work?"
llm_call_duration_seconds (histogram) feature, provider, model, outcome p50/p95/p99 per feature × model
llm_tokens (counter) feature, provider, model, token_type Cost & throughput per model — the existing *_tokens_processed counters aren't model-labeled
llm_tool_calls (counter) feature, tool_name, outcome Which tools the model invokes + their success rate
llm_iterations (histogram) feature, outcome Detects max-iter saturation (the glm-5.1 issue)
llm_backup_used (counter) feature, primary/backup provider+model When the retry chain falls over to the backup — currently invisible

Plus one logfmt log line per call (honcho.llm.call feature=… provider=… model=… latency_ms=… outcome=… …) for Loki / kubectl logs filtering.

Outcome taxonomy splits "model didn't converge" (error_max_iterations) from "infra broke" (error_timeout / error_validation / error_other); success_via_backup and success_after_retry are their own buckets so silent reliability tax is observable.

Implementation

Five atomic commits, each leaving the tree green (lint + typecheck + LLM test suite pass):

  1. feat(llm): add hit_max_iterations flag on tool-loop response — pure data-model addition; sets the flag on the synthesis-fallback path so downstream observability can label that case distinctly.
  2. feat(telemetry): declare per-LLM-call Prometheus series — declares 6 new series + recorder methods on the existing PrometheusMetrics singleton (sentry-wrapped error handling). Pure declarations, no callers yet.
  3. feat(telemetry): add observe_llm_call helper — context-manager-based wrapper (src/telemetry/llm_call_metrics.py); helpers for finalize/max-iter; feature-label normalizer ("Dreamer/deduction"dream_deduction).
  4. feat(llm): emit per-tool-call metrics inside the tool execution loop — threads track_name/trace_name through execute_tool_loop's signature (default None) and emits per-tool counter in success + error branches.
  5. feat(llm): wire observe_llm_call into honcho_llm_call — wraps the call body; captures the winning AttemptPlan via a last_plan cell so primary-vs-backup is recorded correctly.

Cardinality budget: feature × provider × model × outcome ≈ 1.7k series cap.

Sample queries

# Latency p95 per feature × model — the headline question
histogram_quantile(0.95, sum by (le, feature, model) (rate(llm_call_duration_seconds_bucket[5m])))

# Backup-failover rate (the "is the primary failing silently" signal)
sum by (feature, primary_model) (rate(llm_backup_used_total[1h]))

# Max-iter saturation per model
sum by (feature, model) (rate(llm_calls_total{outcome="error_max_iterations"}[10m]))

# Tokens through Gemini vs glm-5.1 in the last hour
sum by (model) (increase(llm_tokens_total{feature=~"dream.*"}[1h]))

Test plan

  • uv run ruff check src/ — clean
  • uv run basedpyright src/llm src/telemetry — 0 errors / 0 warnings
  • uv run pytest tests/llm/ — 79 pass
  • End-to-end smoke (synthetic ModelConfig + ResolvedFallbackConfig): all 6 series fire correctly across success, success_via_backup, error_max_iterations, error_timeout, and tool-call paths
  • Deploy to homelab and verify metrics show up in Prometheus + the structured log line lands cleanly in kubectl logs
  • Follow-up: Grafana dashboard JSON tuned to real cardinality

Targeting

Based on sync/upstream-2026-05-03 (the in-flight upstream sync, PR #4) so this PR's diff is just the 5 metrics commits. GitHub will auto-rebase to main once #4 merges.

Not in scope

  • Grafana dashboard JSON — coming as a follow-up after we see real series flowing
  • Helm chart bump — infra/honcho/values.yaml still pins tag: latest, so ArgoCD will pick up the new image once it's pushed to GHCR

Telemetry-only signal: True when the loop exited via the max-iterations
synthesis path rather than the model deciding to stop. Distinguishes
"model didn't converge" from natural termination so downstream
observability can label the two cases differently.

No emitter changes — flag is set but no consumer reads it yet.
Adds six new metrics + recorder methods on the existing PrometheusMetrics
singleton; no callers yet, so this commit is purely declarative.

Series:
- llm_calls / llm_call_duration_seconds — counter + histogram per call,
  labeled by feature × provider × model × outcome.
- llm_tokens — input/output/cache_read/cache_creation per
  feature × provider × model.
- llm_tool_calls — per-tool invocation outcome inside the tool loop.
- llm_iterations — histogram of iterations consumed per call/outcome.
- llm_backup_used — counts failovers from primary to backup provider.

Cardinality-bounded: feature × provider × model × outcome ≈ 1.7k series
cap. Deliberately no workspace_name label here — these answer "is this
model effective for this feature", not "is workspace X slow".

LLMCallOutcome enum exported from src.telemetry.prometheus so callers can
reference the canonical values without importing from the metrics module
directly.
Introduces src/telemetry/llm_call_metrics.py — a context-manager-based
wrapper that turns one LLM call into one set of Prometheus samples and
one logfmt log line.

Surface:
- observe_llm_call(...) — context manager yielding a mutable _CallState
  the caller populates over the call's lifetime.
- finalize_success(...) — populate state from a successful response and
  pick the outcome bucket (success / success_after_retry / success_via_backup).
- mark_max_iterations(...) — flip the state to error_max_iterations when
  the tool loop exited via the synthesis path.
- normalize_feature_label(...) — maps caller's track_name/trace_name to
  a low-cardinality Prom label (e.g. "Dreamer/deduction" -> dream_deduction).

No callers wired in yet — this commit is the helper module on its own
so the diff stays reviewable. Wiring into honcho_llm_call and the tool
loop lands in subsequent commits.

Errors raised inside the wrapped call are classified into outcome
buckets (timeout / validation / other) and re-raised; the wrapper never
swallows or transforms exceptions.
Adds prometheus_metrics.record_llm_tool_call() calls in both the
success and error branches of execute_tool_loop's per-tool dispatch.
Threads track_name / trace_name through the function signature so the
emitted metric carries the same feature label that the call-level
metrics will use.

Both new params default to None (current callers don't pass them yet),
so feature label resolves to "unknown" until honcho_llm_call is wired
in the next commit. Metric emission is wrapped in PrometheusMetrics'
sentry-captured error handler — a metric bug can never break a real
tool call.
Wraps the body of honcho_llm_call (both tool-less and tool-loop paths)
in observe_llm_call(...) so every invocation produces one set of
Prometheus samples and one logfmt log line.

Captures the AttemptPlan that produced the most-recent (and on success,
the winning) call via a `last_plan` cell updated inside _get_attempt_plan,
so the recorded provider/model is the one that actually answered —
primary on early attempts, backup on the final retry. This makes
backup-on-final-attempt observable directly from llm_calls / llm_tokens
without parsing logs.

Passes track_name and trace_name through to execute_tool_loop so its
per-tool counter (added in the previous commit) carries the same
feature label as the call-level metrics.

When the tool loop returns response.hit_max_iterations=True, the call's
outcome is overridden to error_max_iterations via mark_max_iterations
so dashboards can split "model didn't converge" from clean success
without the tool-loop having to know about outcome semantics.

Streaming responses don't carry token counts at the entry point —
the recorded call still emits but token counters skip those rows
(record_llm_tokens silently no-ops on count<=0). Acceptable partial
signal until streaming refactor surfaces tokens earlier.

ruff + basedpyright clean. End-to-end smoke verified all six series
fire correctly across success, success_via_backup, error_max_iterations,
error_timeout, and tool-call paths.
@offendingcommit offendingcommit merged commit 937627d into sync/upstream-2026-05-03 May 4, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant