Measured Serena on two self-hosted models with deterministic gates: no effect on a strong model, but it rescued a weak one on the dangerous task #1573

cipherfoxie · 2026-06-12T18:40:57Z

cipherfoxie
Jun 12, 2026

Independent measurement, N=3-5 per cell, deterministic pass/fail gates (typecheck + actual rename verification, no LLM judging), opencode headless on a DGX Spark. Models: Qwen3.6-35B (vLLM) and Mistral-Small-4 (SGLang). Baseline arm = native grep/read/edit, serena arm = same plus Serena MCP (ide-assistant context).

The headline result you may want to know about: on an ambiguous rename (rename UserRepository.save, leave an unrelated Logger.save alone), the weak model with native tools failed 3 of 3, every time doing a global rename that clobbered the unrelated symbol across 8 files, and the broken result passed typecheck and lint. With Serena it went to 1 of 3 correct and the damage dropped from 8 files to 1.7 mean. Serena acted as a semantic guardrail exactly where plain text-replacement is confidently wrong.

The flip side, equally measured: the strong model never needed it. Qwen3.6 passed everything in both arms with identical surgical diffs, and Serena's tool schemas added +15% to +158% input tokens depending on task (ts-rename N=5: 75.6k baseline vs 195.2k with Serena). If schema size could shrink, the tax on capable models would drop and the guardrail story would be close to free.

Verdict we published: situational, guardrail for weak models, overhead for strong ones. Full method and raw runs: harness https://github.com/cipherfoxie/agent-bench, writeup https://sovgrid.org/blog/serena-local-benchmark/

opcode81 · 2026-06-12T19:16:13Z

opcode81
Jun 12, 2026
Maintainer

Your benchmark is not at all representative of a real-world situation.
You are testing a single operation on a tiny project comprising just a handful of extremely small files (at most 6 lines per file).
Serena will really shine in a real project - the more complex, the greater the potential for savings.
For the rename operation you tested, an agent using Serena does not even have to read any of the files in which the method being renamed actually appears, so the larger the files are, the more this matters.

0 replies

cipherfoxie · 2026-06-13T08:27:14Z

cipherfoxie
Jun 13, 2026
Author

You are right, and the rename case is the weakest place to look for token savings: with files this small there is nothing to avoid reading, so the harness structurally cannot show the win you describe. I have added an update to the writeup saying exactly that and crediting your point: https://sovgrid.org/blog/serena-local-benchmark/

To be precise about what this benchmark does and does not claim: it measures correctness under deterministic gates (typecheck plus an actual post-rename verification, no LLM judging), not the token-reduction claim. The token deltas in the tables are a property of these specific small tasks, not a verdict on Serena's economics at scale.

Where it did find a signal was behavioral, not economic: the strong model gained nothing, but the weak model was rescued on the ambiguous rename (the one where a global text replace clobbers an unrelated Logger.save). That is exactly where a symbolic find-references step should matter most.

The token-savings claim needs the opposite test bed: a real repo with large, cross-referenced files, measuring input tokens with and without the symbol tooling on the same operation. That is a fair follow-up and I will run it as a separate test rather than fold it into these numbers. If you have a repo or operation you consider a good showcase, I will use it so the comparison is on Serena's home ground.

2 replies

MischaPanch Jun 13, 2026
Maintainer

We have been working on benchmarking internally for the last 3 weeks and will soon publish first results

MischaPanch Jun 13, 2026
Maintainer

Happy to continue the conversation after that, then you can inspect our methodology, test the cases yourself if you want to and contribute to the benchmark cases, if interested :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measured Serena on two self-hosted models with deterministic gates: no effect on a strong model, but it rescued a weak one on the dangerous task #1573

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Measured Serena on two self-hosted models with deterministic gates: no effect on a strong model, but it rescued a weak one on the dangerous task #1573

Uh oh!

cipherfoxie Jun 12, 2026

Replies: 2 comments · 2 replies

Uh oh!

opcode81 Jun 12, 2026 Maintainer

Uh oh!

cipherfoxie Jun 13, 2026 Author

Uh oh!

MischaPanch Jun 13, 2026 Maintainer

Uh oh!

MischaPanch Jun 13, 2026 Maintainer

cipherfoxie
Jun 12, 2026

Replies: 2 comments 2 replies

opcode81
Jun 12, 2026
Maintainer

cipherfoxie
Jun 13, 2026
Author

MischaPanch Jun 13, 2026
Maintainer

MischaPanch Jun 13, 2026
Maintainer