You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend the Token Cost Observatory (token-metrics.sh, token_report.sh) to capture LSP-specific metrics: tool-call counts by type (LSP navigation vs grep/read), tokens consumed per navigation approach, and a "grounding ratio" (findings verified by LSP / total findings). This provides the data needed for a go/no-go decision on fleet-wide LSP rollout and ongoing cost optimization — ensuring the LSP pilot is measured, not just deployed.
Market Signal
ManoMano's Project AEGIS benchmark showed Serena-equipped agents used 4 subagents vs 12 for vanilla Claude on the same task — but actual token costs were similar ($27.30 vs $23.54). The nuance matters: raw token count alone does not capture quality. A cheaper run that fails is worse than a slightly more expensive run that succeeds. The industry is moving toward quality-adjusted cost metrics (tokens per correct finding, cost per verified fix) rather than raw token counts. MCPBench and MCPAgentBench both evaluate tool-use efficiency, not just task success — the benchmarking discipline is maturing.
User Signal
Discussion #578 explicitly recommends measuring "tokens/run, tool-call count, review precision — pilot vs control." The Token Cost Observatory (#332, #464) already captures per-call JSONL with workflow/tier/model/input/output/cache tokens. The ET (Effective Token) formula weights output 4x input — LSP's benefit is reducing output (fewer false findings to write) and reducing input (targeted navigation vs loading whole files into context).
Technical Opportunity
token-metrics.sh's emit_token_record() already logs per-call JSONL. The proposed extensions:
lsp_enabled — boolean flag indicating whether the run had MCP/LSP tools available
grounding_ratio — float (verified findings / total findings), computed by the LSP verification step (Discussion #594)
token_report.sh's render_* functions gain a new section: "LSP Efficiency" comparing:
LSP-enabled vs LSP-disabled runs (same tier, same model)
Tool-call distribution (navigation tools vs raw file reads)
ET per verified finding
The existing per-repo breakdown in token_report.sh already supports pilot-vs-control comparison: run the pilot on one repo (.github-private) while the fleet continues without LSP, then compare the Observatory reports side-by-side.
Assessment
Dimension
Score
Rationale
Feasibility
high
Extends existing JSONL schema + rendering functions with optional fields
Impact
med
Enables data-driven go/no-go for fleet-wide LSP; prevents rollout based on vendor narratives alone
Urgency
med
Should be ready before the LSP pilot begins so day-1 data is captured
Adversarial Review
Strongest objection: Adding per-tool-call metrics increases JSONL artifact size and report complexity. The grounding ratio requires parsing agent output to count findings, which is fragile. Comparing LSP vs non-LSP runs requires A/B capability that does not exist in the current pipeline.
Rebuttal: JSONL growth is minimal (one optional JSON object field per record — ~100 bytes). The grounding ratio is computed from the verification step's structured output, not from parsing free-text agent responses. A/B comparison requires no new infrastructure: run the pilot on one repo while the fleet continues without LSP, then compare the Observatory's per-repo breakdown. The existing annotate_records() function can filter by lsp_enabled to split the data.
Suggested Next Step
Add optional mcp_tool_calls and lsp_enabled fields to emit_token_record() in token-metrics.sh. Update token_report.sh to render an "MCP Tool Usage" section when MCP data is present. Define the grounding ratio metric schema and add it to the weekly report template. Target: metrics infrastructure ready before LSP pilot begins.
🤖 Proposed by Mary (BMAD Strategic Business Analyst) · companion to Discussion #578
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Extend the Token Cost Observatory (
token-metrics.sh,token_report.sh) to capture LSP-specific metrics: tool-call counts by type (LSP navigation vs grep/read), tokens consumed per navigation approach, and a "grounding ratio" (findings verified by LSP / total findings). This provides the data needed for a go/no-go decision on fleet-wide LSP rollout and ongoing cost optimization — ensuring the LSP pilot is measured, not just deployed.Market Signal
ManoMano's Project AEGIS benchmark showed Serena-equipped agents used 4 subagents vs 12 for vanilla Claude on the same task — but actual token costs were similar ($27.30 vs $23.54). The nuance matters: raw token count alone does not capture quality. A cheaper run that fails is worse than a slightly more expensive run that succeeds. The industry is moving toward quality-adjusted cost metrics (tokens per correct finding, cost per verified fix) rather than raw token counts. MCPBench and MCPAgentBench both evaluate tool-use efficiency, not just task success — the benchmarking discipline is maturing.
User Signal
Discussion #578 explicitly recommends measuring "tokens/run, tool-call count, review precision — pilot vs control." The Token Cost Observatory (#332, #464) already captures per-call JSONL with workflow/tier/model/input/output/cache tokens. The ET (Effective Token) formula weights output 4x input — LSP's benefit is reducing output (fewer false findings to write) and reducing input (targeted navigation vs loading whole files into context).
Technical Opportunity
token-metrics.sh'semit_token_record()already logs per-call JSONL. The proposed extensions:mcp_tool_calls— optional JSON object mapping tool names to call counts (e.g.,{"lsp_find_references": 4, "grep": 2})lsp_enabled— boolean flag indicating whether the run had MCP/LSP tools availablegrounding_ratio— float (verified findings / total findings), computed by the LSP verification step (Discussion #594)token_report.sh'srender_*functions gain a new section: "LSP Efficiency" comparing:The existing per-repo breakdown in
token_report.shalready supports pilot-vs-control comparison: run the pilot on one repo (.github-private) while the fleet continues without LSP, then compare the Observatory reports side-by-side.Assessment
Adversarial Review
Strongest objection: Adding per-tool-call metrics increases JSONL artifact size and report complexity. The grounding ratio requires parsing agent output to count findings, which is fragile. Comparing LSP vs non-LSP runs requires A/B capability that does not exist in the current pipeline.
Rebuttal: JSONL growth is minimal (one optional JSON object field per record — ~100 bytes). The grounding ratio is computed from the verification step's structured output, not from parsing free-text agent responses. A/B comparison requires no new infrastructure: run the pilot on one repo while the fleet continues without LSP, then compare the Observatory's per-repo breakdown. The existing
annotate_records()function can filter bylsp_enabledto split the data.Suggested Next Step
Add optional
mcp_tool_callsandlsp_enabledfields toemit_token_record()intoken-metrics.sh. Updatetoken_report.shto render an "MCP Tool Usage" section when MCP data is present. Define the grounding ratio metric schema and add it to the weekly report template. Target: metrics infrastructure ready before LSP pilot begins.🤖 Proposed by Mary (BMAD Strategic Business Analyst) · companion to Discussion #578
Beta Was this translation helpful? Give feedback.
All reactions