You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was a big proponent of encoding TDD red-green-refactor methodology into my agent workflows until recently when I made the same realization after reading this study: https://arxiv.org/pdf/2602.07900
TLDR; it found test-writing volume only weakly correlates with success and that encoding test-writing principles did not move resolution rates but did materially change cost. Encouraging tests cost +19.8% output tokens for 0% gain; discouraging them saved 33–49% input tokens for ≤2.6pp accuracy loss. Separately, imposing the TDD procedure specifically seems like it can backfire: it actually increased regressions from 6.08% to 9.94%.
IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
relevant paper:
https://arxiv.org/abs/2602.07900
found via this discussion:
https://news.ycombinator.com/item?id=48419659
Beta Was this translation helpful? Give feedback.
All reactions