paper "Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents" #192

tfriedel · 2026-06-06T14:29:39Z

tfriedel
Jun 6, 2026

relevant paper:
https://arxiv.org/abs/2602.07900

found via this discussion:
https://news.ycombinator.com/item?id=48419659

I was a big proponent of encoding TDD red-green-refactor methodology into my agent workflows until recently when I made the same realization after reading this study: https://arxiv.org/pdf/2602.07900

TLDR; it found test-writing volume only weakly correlates with success and that encoding test-writing principles did not move resolution rates but did materially change cost. Encouraging tests cost +19.8% output tokens for 0% gain; discouraging them saved 33–49% input tokens for ≤2.6pp accuracy loss. Separately, imposing the TDD procedure specifically seems like it can backfire: it actually increased regressions from 6.08% to 9.94%.

IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper "Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents" #192

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

paper "Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents" #192

Uh oh!

tfriedel Jun 6, 2026

Replies: 0 comments

tfriedel
Jun 6, 2026