feat(replay): add --strict-failure flag to fail response-divergent tests instead of demoting to OBSOLETE#4251
Conversation
|
CLA Assistant Lite bot: I have read the CLA Document and I hereby sign the CLA Charan Kamarapu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
|
The CLA check failed. Please ensure you have:
After fixing these issues, comment 'recheck' to trigger the workflow again. |
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
…sts instead of demoting to OBSOLETE
When a test's response body differs from the recorded baseline AND the
consumed mock set diverged from the recorded mapping, replay currently
demotes the failure to OBSOLETE (replay.go:1828 / 2353) — the run-level
aggregate then reports PASSED because OBSOLETE doesn't count as failure.
The reasoning was "the mock contract changed, so the response diff is
a consequence — tell the user to re-record, don't fail loudly."
That's correct from a "did this test still validate the contract" angle,
but breaks two important workflows:
- CI: a real source regression that ALSO touches the mock set (e.g. a
SQL drift that changes both the wire query and the response) gets
silently classified as OBSOLETE → run passes → CI green → bug ships.
- Agentic flows: an autonomous fix agent watching for FAILED tests
has nothing to diagnose when the planted regression lands in
OBSOLETE. The signal-to-fix loop never starts.
This flag opts into a stricter classification: when the response diverges
from the recorded baseline (testPass=false), the test is FAILED — even
if mockSetMismatch=true would otherwise demote it. The per-test OBSOLETE
label is replaced with FAILED, but the mappingDiff (expected vs actual
mocks, missing calls) is still written to the report for diagnostics —
the user / agent loses NO information, just gets the right status.
Default behavior unchanged: omit the flag → existing OBSOLETE demotion
applies. Opt in via `--strict-failure` (CLI on `record`, `test`, and
`cloud replay`) or `strictFailure: true` in keploy.yml.
What it does NOT change:
- testPass=true + mockSetMismatch=true silent-ignore (controlled by
SchemaNoiseStrict, not this flag). The current behavior of treating
"response matched but mock didn't" as PASSED is a deliberate noise
tolerance; this PR doesn't touch it.
Sites:
- config/config.go: Test.StrictFailure field
- cli/provider/cmd.go: --strict-failure flag wired on the test/replay
command group (and cloud-replay inherits it via the same group)
- pkg/service/replay/replay.go: respect StrictFailure at the two
OBSOLETE-decision sites (lines 1828 and 2353 — HTTP and gRPC paths)
Signed-off-by: Charan Kamarapu <charan@keploy.io>
8f5d9c3 to
52034e1
Compare
|
The CLA check failed. Please ensure you have:
After fixing these issues, comment 'recheck' to trigger the workflow again. |
🚀 Keploy Performance Test ResultsMulti-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.
Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1% ✅ Result: PASSED - Only 0 out of 3 runs failed (threshold: 2) P50, P90, and P99 percentiles naturally filter out outliers |
charankamarapu
left a comment
There was a problem hiding this comment.
Principal engineer self-review
The change is correct and minimal — flag opt-in, default behaviour preserved, both demotion sites updated. A few things I'd want addressed before this lands:
Required
1. No test coverage for the new decision branch.
This PR changes replay-loop classification logic — the most consequential code path in the project. The matrix (testPass × mockSetMismatch × strictMockReject × StrictFailure) has 16 combinations; the change affects rows where testPass=false ∧ mockSetMismatch=true. There's no test asserting either:
- Default off → OBSOLETE (regression guard for existing users)
- Flag on → FAILED + mappingDiff preserved (the actual new behaviour)
A table-driven test on RunTestSet covering at minimum those two rows for each of the two sites (HTTP + gRPC) would catch the next person who edits the decision tree. Even a small replay_classification_test.go with a fakeReplayer would be enough.
Strongly recommended
2. Asymmetry between the two demotion sites is undocumented.
Site 1 (replay.go:1828, HTTP):
} else if mockSetMismatch && !strictMockReject && !r.config.Test.StrictFailure {Site 2 (replay.go:2352, gRPC):
} else if mockSetMismatch && !r.config.Test.StrictFailure {Site 2 doesn't consider strictMockReject at all. If that's because the gRPC path doesn't run the SchemaNoiseStrict pre-check that sets strictMockReject, say so in a comment. Otherwise the next reader (me, in 6 months) will assume one site forgot a guard and "fix" it, breaking the gRPC behaviour. A one-line comment on each branch costs nothing and prevents future bug introduction.
3. The two demotion sites are duplicate logic that should be extracted.
Both sites encode the same decision tree (PASSED → OBSOLETE → FAILED) with slightly different guards. Extract:
func (r *Replayer) classifyFailure(testPass, mockSetMismatch, strictMockReject bool) models.TestStatus {
if testPass { return models.TestStatusPassed }
if mockSetMismatch && !strictMockReject && !r.config.Test.StrictFailure {
return models.TestStatusObsolete
}
return models.TestStatusFailed
}That makes the asymmetry impossible (one source of truth) and makes the unit test from #1 trivial to write.
Nice to have
4. CLI help text leaks the implementation term "OBSOLETE".
--help users don't know what OBSOLETE means. Suggest reframing in user terms:
Mark response-divergent tests as FAILED even when their consumed mock set diverged from the recording (default: such tests are marked OBSOLETE so you can re-record without seeing a hard failure).
5. Add the diagnostic context to the commit body.
The commit body explains the WHY beautifully. Worth adding one more sentence noting the agentic-workflow scenario this unblocks — "diagnosed against the keploy validation harness scenario-1, where a planted Case-1 source regression landed in OBSOLETE → run reported PASSED → agent had no signal to engage." Future readers grepping for OBSOLETE will land on this commit and need the context.
Functionally LGTM after #1 + #2. The behaviour is correct and the flag-gated rollout is the right shape. Tests + the comment on the asymmetry are the blockers.
Summary
Adds an opt-in
--strict-failureflag (andstrictFailurekeploy.yml field). When set, a test whose response diverges from the recorded baseline is marked FAILED even if the consumed mock set also diverged from the recorded mapping. The default behavior — demoting such cases to OBSOLETE so the user can re-record without a hard failure — stays exactly the same.Why
The current
replay.go:1828(and twin site at 2353) decision tree:is correct from a "did this test still validate the recorded contract" angle — but breaks two real workflows:
Diagnosed 2026-06-08 against scenario-1 of our validation harness: a planted uncommitted
ORDER BY DESC LIMIT → ASC LIMITedit fell into OBSOLETE (because the recorded mock'ssql_ast_hashdidn't match the live ASC query — a mockSetMismatch), the cloud replay aggregate reported PASSED, and the agent never engaged. With this flag, the same scenario reports FAILED and the agent's Case 1 path runs cleanly: revert source → green.What it changes
testPassmockSetMismatch--strict-failureOnly row 6 changes. The OBSOLETE classification still exists in the data model and the per-test report; the run-level aggregate just rolls it into FAILED when
--strict-failureis set.The
mappingDiffblock (expected vs actual mocks, missing calls) is still written to the report regardless of the flag — the diagnostic content is preserved, only the status field changes.CLI surface
--strict-failureregistered onkeploy testandkeploy record(same flag group as--schema-noise-detection,--update-test-mapping,--must-pass).keploy cloud replayinherits the flag since it builds on the same command tree.strictFailure: true|false(defaultfalse).Verification
Validated against the orderflow scenario suite (S1-S7) — S1 reports FAILED instead of PASSED with the flag, exit code non-zero,
mappingDiffblock preserved.Test plan
🤖 Generated with Claude Code