Today the only LLM-call observability is _write_llm_log (llmcore.py:790), which appends raw prompt/response to temp/model_responses/*.txt — usable for quick debug, but no way to group calls by task, replay bad cases, or compare model configs on the same run. I'd like to contribute an opt-in Langfuse integration: one agent_runner_loop call becomes a trace, each chat() becomes a generation (with token usage), and each tool dispatch becomes a tool span.
Why this helps
Cost visibility. Aggregate token usage per task / model / time range — see which tasks burn the most tokens and which model gets the best cache-hit rate.
Bad-case replay. Expand a trace tree in the UI with full context (messages, tool calls, outputs) instead of grepping 10 MB text logs.
Dataset curation. Successful traces export directly to eval sets or fine-tuning corpora — a natural data source for the agent's self-evolution loop.
Model/prompt comparison. Run the same prompt across GLM / MiniMax / Claude and diff the traces side-by-side.
Tool-level analytics. Per-tool failure rate, average latency, call frequency — direct input for evolving L2/L3 memory and Skills.
Collaboration. Paste a trace URL into chat instead of copy-pasting a whole conversation.
Design boundaries
Zero intrusion when disabled. No langfuse_config in mykey.py → langfuse is never imported. No new required dependency, no behavior change.
Minimal change surface. A few hook points in llmcore.py and agent_loop.py only. No new directory, no new module, no OpenTelemetry abstraction layer.
Failure isolation. Any Langfuse-side exception is swallowed; it never propagates to the agent loop.
Subagents are separate processes; v1 treats each as its own top-level trace (no cross-process parent linking). Happy to do that as a follow-up.
Verified locally in both enabled/disabled states. If the direction is acceptable I'll open the PR with UI screenshots.
Today the only LLM-call observability is _write_llm_log (llmcore.py:790), which appends raw prompt/response to temp/model_responses/*.txt — usable for quick debug, but no way to group calls by task, replay bad cases, or compare model configs on the same run. I'd like to contribute an opt-in Langfuse integration: one agent_runner_loop call becomes a trace, each chat() becomes a generation (with token usage), and each tool dispatch becomes a tool span.
Why this helps
Cost visibility. Aggregate token usage per task / model / time range — see which tasks burn the most tokens and which model gets the best cache-hit rate.
Bad-case replay. Expand a trace tree in the UI with full context (messages, tool calls, outputs) instead of grepping 10 MB text logs.
Dataset curation. Successful traces export directly to eval sets or fine-tuning corpora — a natural data source for the agent's self-evolution loop.
Model/prompt comparison. Run the same prompt across GLM / MiniMax / Claude and diff the traces side-by-side.
Tool-level analytics. Per-tool failure rate, average latency, call frequency — direct input for evolving L2/L3 memory and Skills.
Collaboration. Paste a trace URL into chat instead of copy-pasting a whole conversation.
Design boundaries
Zero intrusion when disabled. No langfuse_config in mykey.py → langfuse is never imported. No new required dependency, no behavior change.
Minimal change surface. A few hook points in llmcore.py and agent_loop.py only. No new directory, no new module, no OpenTelemetry abstraction layer.
Failure isolation. Any Langfuse-side exception is swallowed; it never propagates to the agent loop.
Subagents are separate processes; v1 treats each as its own top-level trace (no cross-process parent linking). Happy to do that as a follow-up.
Verified locally in both enabled/disabled states. If the direction is acceptable I'll open the PR with UI screenshots.