Replies: 1 comment
-
|
For CI gating, I’d split “flake signal” from “release decision”. Keep temperature=0 plus deterministic checks as the blocking path. When a prose/LLM case flakes, don’t widen the regex first; record it as flaky and rerun it outside the blocking path with a small fixed budget. For intentionally stochastic behavior, pass@k is cleaner than retry-until-stable because it admits the test is probabilistic. I’d probably keep smoke at 3 samples, use 5 only for promoted/full cases, and avoid nesting 5 spawns x 3 judge samples unless the case is explicitly marked expensive. That cost gets weird fast. So my default would be: quarantine from merge gating, trend it, then promote only cases that stay stable. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
LLM tests are flaky in a way that classical unit tests arent: same input, different output, no bug. How are you dealing with it?
eval-harness ships a 3-sample byte-identical stability check (tags
flaky: truewhen samples diverge), but theres no industry consensus on what to do with a flaky result. Options Ive seen:llm_judgedoes)expect_regexWhats working for you? Whats the cost in API spend?
Beta Was this translation helpful? Give feedback.
All reactions