feat(telemetry): per-LLM-call metrics, structured logs, and tool tracking by offendingcommit · Pull Request #5 · offendingcommit/honcho

offendingcommit · 2026-05-04T02:34:34Z

Summary

Adds observability around every honcho_llm_call invocation so we can answer "which model is effective for which feature" directly from Prometheus, without bouncing through Langfuse or parsing logs.

Motivated by ops pain in our k8s deploy: dream cycles on glm-5.1:cloud were averaging ~22 minutes wall time with frequent 10/12-iteration tool-loop saturation and silent timeouts, and we had no metric that distinguished "model didn't converge" from "infra broke", or that revealed when a call had silently failed over to the Gemini backup.

What changes

Six new Prometheus series (cardinality-bounded; no workspace_name label since these answer model effectiveness, not per-tenant slowness):

Series	Labels	Question it answers
`llm_calls` (counter)	feature, provider, model, outcome	"Which model handled which feature, did it work?"
`llm_call_duration_seconds` (histogram)	feature, provider, model, outcome	p50/p95/p99 per feature × model
`llm_tokens` (counter)	feature, provider, model, token_type	Cost & throughput per model — the existing `*_tokens_processed` counters aren't model-labeled
`llm_tool_calls` (counter)	feature, tool_name, outcome	Which tools the model invokes + their success rate
`llm_iterations` (histogram)	feature, outcome	Detects max-iter saturation (the `glm-5.1` issue)
`llm_backup_used` (counter)	feature, primary/backup provider+model	When the retry chain falls over to the backup — currently invisible

Plus one logfmt log line per call (honcho.llm.call feature=… provider=… model=… latency_ms=… outcome=… …) for Loki / kubectl logs filtering.

Outcome taxonomy splits "model didn't converge" (error_max_iterations) from "infra broke" (error_timeout / error_validation / error_other); success_via_backup and success_after_retry are their own buckets so silent reliability tax is observable.

Implementation

Five atomic commits, each leaving the tree green (lint + typecheck + LLM test suite pass):

feat(llm): add hit_max_iterations flag on tool-loop response — pure data-model addition; sets the flag on the synthesis-fallback path so downstream observability can label that case distinctly.
feat(telemetry): declare per-LLM-call Prometheus series — declares 6 new series + recorder methods on the existing PrometheusMetrics singleton (sentry-wrapped error handling). Pure declarations, no callers yet.
feat(telemetry): add observe_llm_call helper — context-manager-based wrapper (src/telemetry/llm_call_metrics.py); helpers for finalize/max-iter; feature-label normalizer ("Dreamer/deduction" → dream_deduction).
feat(llm): emit per-tool-call metrics inside the tool execution loop — threads track_name/trace_name through execute_tool_loop's signature (default None) and emits per-tool counter in success + error branches.
feat(llm): wire observe_llm_call into honcho_llm_call — wraps the call body; captures the winning AttemptPlan via a last_plan cell so primary-vs-backup is recorded correctly.

Cardinality budget: feature × provider × model × outcome ≈ 1.7k series cap.

Sample queries

# Latency p95 per feature × model — the headline question
histogram_quantile(0.95, sum by (le, feature, model) (rate(llm_call_duration_seconds_bucket[5m])))

# Backup-failover rate (the "is the primary failing silently" signal)
sum by (feature, primary_model) (rate(llm_backup_used_total[1h]))

# Max-iter saturation per model
sum by (feature, model) (rate(llm_calls_total{outcome="error_max_iterations"}[10m]))

# Tokens through Gemini vs glm-5.1 in the last hour
sum by (model) (increase(llm_tokens_total{feature=~"dream.*"}[1h]))

Test plan

uv run ruff check src/ — clean
uv run basedpyright src/llm src/telemetry — 0 errors / 0 warnings
uv run pytest tests/llm/ — 79 pass
End-to-end smoke (synthetic ModelConfig + ResolvedFallbackConfig): all 6 series fire correctly across success, success_via_backup, error_max_iterations, error_timeout, and tool-call paths
Deploy to homelab and verify metrics show up in Prometheus + the structured log line lands cleanly in kubectl logs
Follow-up: Grafana dashboard JSON tuned to real cardinality

Targeting

Based on sync/upstream-2026-05-03 (the in-flight upstream sync, PR #4) so this PR's diff is just the 5 metrics commits. GitHub will auto-rebase to main once #4 merges.

Not in scope

Grafana dashboard JSON — coming as a follow-up after we see real series flowing
Helm chart bump — infra/honcho/values.yaml still pins tag: latest, so ArgoCD will pick up the new image once it's pushed to GHCR

Telemetry-only signal: True when the loop exited via the max-iterations synthesis path rather than the model deciding to stop. Distinguishes "model didn't converge" from natural termination so downstream observability can label the two cases differently. No emitter changes — flag is set but no consumer reads it yet.

Adds six new metrics + recorder methods on the existing PrometheusMetrics singleton; no callers yet, so this commit is purely declarative. Series: - llm_calls / llm_call_duration_seconds — counter + histogram per call, labeled by feature × provider × model × outcome. - llm_tokens — input/output/cache_read/cache_creation per feature × provider × model. - llm_tool_calls — per-tool invocation outcome inside the tool loop. - llm_iterations — histogram of iterations consumed per call/outcome. - llm_backup_used — counts failovers from primary to backup provider. Cardinality-bounded: feature × provider × model × outcome ≈ 1.7k series cap. Deliberately no workspace_name label here — these answer "is this model effective for this feature", not "is workspace X slow". LLMCallOutcome enum exported from src.telemetry.prometheus so callers can reference the canonical values without importing from the metrics module directly.

Introduces src/telemetry/llm_call_metrics.py — a context-manager-based wrapper that turns one LLM call into one set of Prometheus samples and one logfmt log line. Surface: - observe_llm_call(...) — context manager yielding a mutable _CallState the caller populates over the call's lifetime. - finalize_success(...) — populate state from a successful response and pick the outcome bucket (success / success_after_retry / success_via_backup). - mark_max_iterations(...) — flip the state to error_max_iterations when the tool loop exited via the synthesis path. - normalize_feature_label(...) — maps caller's track_name/trace_name to a low-cardinality Prom label (e.g. "Dreamer/deduction" -> dream_deduction). No callers wired in yet — this commit is the helper module on its own so the diff stays reviewable. Wiring into honcho_llm_call and the tool loop lands in subsequent commits. Errors raised inside the wrapped call are classified into outcome buckets (timeout / validation / other) and re-raised; the wrapper never swallows or transforms exceptions.

Adds prometheus_metrics.record_llm_tool_call() calls in both the success and error branches of execute_tool_loop's per-tool dispatch. Threads track_name / trace_name through the function signature so the emitted metric carries the same feature label that the call-level metrics will use. Both new params default to None (current callers don't pass them yet), so feature label resolves to "unknown" until honcho_llm_call is wired in the next commit. Metric emission is wrapped in PrometheusMetrics' sentry-captured error handler — a metric bug can never break a real tool call.

Wraps the body of honcho_llm_call (both tool-less and tool-loop paths) in observe_llm_call(...) so every invocation produces one set of Prometheus samples and one logfmt log line. Captures the AttemptPlan that produced the most-recent (and on success, the winning) call via a `last_plan` cell updated inside _get_attempt_plan, so the recorded provider/model is the one that actually answered — primary on early attempts, backup on the final retry. This makes backup-on-final-attempt observable directly from llm_calls / llm_tokens without parsing logs. Passes track_name and trace_name through to execute_tool_loop so its per-tool counter (added in the previous commit) carries the same feature label as the call-level metrics. When the tool loop returns response.hit_max_iterations=True, the call's outcome is overridden to error_max_iterations via mark_max_iterations so dashboards can split "model didn't converge" from clean success without the tool-loop having to know about outcome semantics. Streaming responses don't carry token counts at the entry point — the recorded call still emits but token counters skip those rows (record_llm_tokens silently no-ops on count<=0). Acceptable partial signal until streaming refactor surfaces tokens earlier. ruff + basedpyright clean. End-to-end smoke verified all six series fire correctly across success, success_via_backup, error_max_iterations, error_timeout, and tool-call paths.

offendingcommit added 5 commits May 3, 2026 21:30

offendingcommit merged commit 937627d into sync/upstream-2026-05-03 May 4, 2026
1 check failed

offendingcommit mentioned this pull request May 4, 2026

docs(CLAUDE.md): refresh LLM provider routing notes for new src/llm/ architecture #6

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): per-LLM-call metrics, structured logs, and tool tracking#5

feat(telemetry): per-LLM-call metrics, structured logs, and tool tracking#5
offendingcommit merged 5 commits intosync/upstream-2026-05-03from
claude/llm-call-observability-metrics

offendingcommit commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

offendingcommit commented May 4, 2026

Summary

What changes

Implementation

Sample queries

Test plan

Targeting

Not in scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant