feat(telemetry): per-LLM-call metrics, structured logs, and tool tracking#5
Merged
offendingcommit merged 5 commits intosync/upstream-2026-05-03from May 4, 2026
Conversation
Telemetry-only signal: True when the loop exited via the max-iterations synthesis path rather than the model deciding to stop. Distinguishes "model didn't converge" from natural termination so downstream observability can label the two cases differently. No emitter changes — flag is set but no consumer reads it yet.
Adds six new metrics + recorder methods on the existing PrometheusMetrics singleton; no callers yet, so this commit is purely declarative. Series: - llm_calls / llm_call_duration_seconds — counter + histogram per call, labeled by feature × provider × model × outcome. - llm_tokens — input/output/cache_read/cache_creation per feature × provider × model. - llm_tool_calls — per-tool invocation outcome inside the tool loop. - llm_iterations — histogram of iterations consumed per call/outcome. - llm_backup_used — counts failovers from primary to backup provider. Cardinality-bounded: feature × provider × model × outcome ≈ 1.7k series cap. Deliberately no workspace_name label here — these answer "is this model effective for this feature", not "is workspace X slow". LLMCallOutcome enum exported from src.telemetry.prometheus so callers can reference the canonical values without importing from the metrics module directly.
Introduces src/telemetry/llm_call_metrics.py — a context-manager-based wrapper that turns one LLM call into one set of Prometheus samples and one logfmt log line. Surface: - observe_llm_call(...) — context manager yielding a mutable _CallState the caller populates over the call's lifetime. - finalize_success(...) — populate state from a successful response and pick the outcome bucket (success / success_after_retry / success_via_backup). - mark_max_iterations(...) — flip the state to error_max_iterations when the tool loop exited via the synthesis path. - normalize_feature_label(...) — maps caller's track_name/trace_name to a low-cardinality Prom label (e.g. "Dreamer/deduction" -> dream_deduction). No callers wired in yet — this commit is the helper module on its own so the diff stays reviewable. Wiring into honcho_llm_call and the tool loop lands in subsequent commits. Errors raised inside the wrapped call are classified into outcome buckets (timeout / validation / other) and re-raised; the wrapper never swallows or transforms exceptions.
Adds prometheus_metrics.record_llm_tool_call() calls in both the success and error branches of execute_tool_loop's per-tool dispatch. Threads track_name / trace_name through the function signature so the emitted metric carries the same feature label that the call-level metrics will use. Both new params default to None (current callers don't pass them yet), so feature label resolves to "unknown" until honcho_llm_call is wired in the next commit. Metric emission is wrapped in PrometheusMetrics' sentry-captured error handler — a metric bug can never break a real tool call.
Wraps the body of honcho_llm_call (both tool-less and tool-loop paths) in observe_llm_call(...) so every invocation produces one set of Prometheus samples and one logfmt log line. Captures the AttemptPlan that produced the most-recent (and on success, the winning) call via a `last_plan` cell updated inside _get_attempt_plan, so the recorded provider/model is the one that actually answered — primary on early attempts, backup on the final retry. This makes backup-on-final-attempt observable directly from llm_calls / llm_tokens without parsing logs. Passes track_name and trace_name through to execute_tool_loop so its per-tool counter (added in the previous commit) carries the same feature label as the call-level metrics. When the tool loop returns response.hit_max_iterations=True, the call's outcome is overridden to error_max_iterations via mark_max_iterations so dashboards can split "model didn't converge" from clean success without the tool-loop having to know about outcome semantics. Streaming responses don't carry token counts at the entry point — the recorded call still emits but token counters skip those rows (record_llm_tokens silently no-ops on count<=0). Acceptable partial signal until streaming refactor surfaces tokens earlier. ruff + basedpyright clean. End-to-end smoke verified all six series fire correctly across success, success_via_backup, error_max_iterations, error_timeout, and tool-call paths.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds observability around every
honcho_llm_callinvocation so we can answer "which model is effective for which feature" directly from Prometheus, without bouncing through Langfuse or parsing logs.Motivated by ops pain in our k8s deploy: dream cycles on
glm-5.1:cloudwere averaging ~22 minutes wall time with frequent 10/12-iteration tool-loop saturation and silent timeouts, and we had no metric that distinguished "model didn't converge" from "infra broke", or that revealed when a call had silently failed over to the Gemini backup.What changes
Six new Prometheus series (cardinality-bounded; no
workspace_namelabel since these answer model effectiveness, not per-tenant slowness):llm_calls(counter)llm_call_duration_seconds(histogram)llm_tokens(counter)*_tokens_processedcounters aren't model-labeledllm_tool_calls(counter)llm_iterations(histogram)glm-5.1issue)llm_backup_used(counter)Plus one logfmt log line per call (
honcho.llm.call feature=… provider=… model=… latency_ms=… outcome=… …) for Loki /kubectl logsfiltering.Outcome taxonomy splits "model didn't converge" (
error_max_iterations) from "infra broke" (error_timeout/error_validation/error_other);success_via_backupandsuccess_after_retryare their own buckets so silent reliability tax is observable.Implementation
Five atomic commits, each leaving the tree green (lint + typecheck + LLM test suite pass):
feat(llm): add hit_max_iterations flag on tool-loop response— pure data-model addition; sets the flag on the synthesis-fallback path so downstream observability can label that case distinctly.feat(telemetry): declare per-LLM-call Prometheus series— declares 6 new series + recorder methods on the existingPrometheusMetricssingleton (sentry-wrapped error handling). Pure declarations, no callers yet.feat(telemetry): add observe_llm_call helper— context-manager-based wrapper (src/telemetry/llm_call_metrics.py); helpers for finalize/max-iter; feature-label normalizer ("Dreamer/deduction"→dream_deduction).feat(llm): emit per-tool-call metrics inside the tool execution loop— threadstrack_name/trace_namethroughexecute_tool_loop's signature (defaultNone) and emits per-tool counter in success + error branches.feat(llm): wire observe_llm_call into honcho_llm_call— wraps the call body; captures the winningAttemptPlanvia alast_plancell so primary-vs-backup is recorded correctly.Cardinality budget: feature × provider × model × outcome ≈ 1.7k series cap.
Sample queries
Test plan
uv run ruff check src/— cleanuv run basedpyright src/llm src/telemetry— 0 errors / 0 warningsuv run pytest tests/llm/— 79 passsuccess,success_via_backup,error_max_iterations,error_timeout, and tool-call pathskubectl logsTargeting
Based on
sync/upstream-2026-05-03(the in-flight upstream sync, PR #4) so this PR's diff is just the 5 metrics commits. GitHub will auto-rebase tomainonce #4 merges.Not in scope
infra/honcho/values.yamlstill pinstag: latest, so ArgoCD will pick up the new image once it's pushed to GHCR