How are you handling flaky LLM tests? #28

hoainho · 2026-06-01T12:40:44Z

hoainho
Jun 1, 2026
Maintainer

LLM tests are flaky in a way that classical unit tests arent: same input, different output, no bug. How are you dealing with it?

eval-harness ships a 3-sample byte-identical stability check (tags flaky: true when samples diverge), but theres no industry consensus on what to do with a flaky result. Options Ive seen:

Re-run with temperature=0 + retry-until-stable
3-sample majority vote (what llm_judge does)
5-sample pass@k (issue Add stochastic pass@k mode (T>0, multiple samples per case) #21)
Quarantine flaky cases out of CI gating (still record, dont block)
Just accept the flake and widen expect_regex

Whats working for you? Whats the cost in API spend?

sueun-dev · 2026-06-04T04:58:28Z

sueun-dev
Jun 4, 2026

For CI gating, I’d split “flake signal” from “release decision”. Keep temperature=0 plus deterministic checks as the blocking path. When a prose/LLM case flakes, don’t widen the regex first; record it as flaky and rerun it outside the blocking path with a small fixed budget.

For intentionally stochastic behavior, pass@k is cleaner than retry-until-stable because it admits the test is probabilistic. I’d probably keep smoke at 3 samples, use 5 only for promoted/full cases, and avoid nesting 5 spawns x 3 judge samples unless the case is explicitly marked expensive. That cost gets weird fast.

So my default would be: quarantine from merge gating, trend it, then promote only cases that stay stable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are you handling flaky LLM tests? #28

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How are you handling flaky LLM tests? #28

Uh oh!

hoainho Jun 1, 2026 Maintainer

Replies: 1 comment

Uh oh!

sueun-dev Jun 4, 2026

hoainho
Jun 1, 2026
Maintainer

sueun-dev
Jun 4, 2026