Measured Serena on two self-hosted models with deterministic gates: no effect on a strong model, but it rescued a weak one on the dangerous task #1573
Replies: 2 comments 2 replies
-
|
Your benchmark is not at all representative of a real-world situation. |
Beta Was this translation helpful? Give feedback.
-
|
You are right, and the rename case is the weakest place to look for token savings: with files this small there is nothing to avoid reading, so the harness structurally cannot show the win you describe. I have added an update to the writeup saying exactly that and crediting your point: https://sovgrid.org/blog/serena-local-benchmark/ To be precise about what this benchmark does and does not claim: it measures correctness under deterministic gates (typecheck plus an actual post-rename verification, no LLM judging), not the token-reduction claim. The token deltas in the tables are a property of these specific small tasks, not a verdict on Serena's economics at scale. Where it did find a signal was behavioral, not economic: the strong model gained nothing, but the weak model was rescued on the ambiguous rename (the one where a global text replace clobbers an unrelated The token-savings claim needs the opposite test bed: a real repo with large, cross-referenced files, measuring input tokens with and without the symbol tooling on the same operation. That is a fair follow-up and I will run it as a separate test rather than fold it into these numbers. If you have a repo or operation you consider a good showcase, I will use it so the comparison is on Serena's home ground. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Independent measurement, N=3-5 per cell, deterministic pass/fail gates (typecheck + actual rename verification, no LLM judging), opencode headless on a DGX Spark. Models: Qwen3.6-35B (vLLM) and Mistral-Small-4 (SGLang). Baseline arm = native grep/read/edit, serena arm = same plus Serena MCP (
ide-assistantcontext).The headline result you may want to know about: on an ambiguous rename (rename
UserRepository.save, leave an unrelatedLogger.savealone), the weak model with native tools failed 3 of 3, every time doing a global rename that clobbered the unrelated symbol across 8 files, and the broken result passed typecheck and lint. With Serena it went to 1 of 3 correct and the damage dropped from 8 files to 1.7 mean. Serena acted as a semantic guardrail exactly where plain text-replacement is confidently wrong.The flip side, equally measured: the strong model never needed it. Qwen3.6 passed everything in both arms with identical surgical diffs, and Serena's tool schemas added +15% to +158% input tokens depending on task (ts-rename N=5: 75.6k baseline vs 195.2k with Serena). If schema size could shrink, the tax on capable models would drop and the guardrail story would be close to free.
Verdict we published: situational, guardrail for weak models, overhead for strong ones. Full method and raw runs: harness https://github.com/cipherfoxie/agent-bench, writeup https://sovgrid.org/blog/serena-local-benchmark/
Beta Was this translation helpful? Give feedback.
All reactions