Skip to content

feat(replay): add --strict-failure flag to fail response-divergent tests instead of demoting to OBSOLETE#4251

Merged
charankamarapu merged 1 commit into
mainfrom
feat/strict-failure-flag
Jun 8, 2026
Merged

feat(replay): add --strict-failure flag to fail response-divergent tests instead of demoting to OBSOLETE#4251
charankamarapu merged 1 commit into
mainfrom
feat/strict-failure-flag

Conversation

@charankamarapu

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in --strict-failure flag (and strictFailure keploy.yml field). When set, a test whose response diverges from the recorded baseline is marked FAILED even if the consumed mock set also diverged from the recorded mapping. The default behavior — demoting such cases to OBSOLETE so the user can re-record without a hard failure — stays exactly the same.

Why

The current replay.go:1828 (and twin site at 2353) decision tree:

if testPass {
    PASSED
} else if mockSetMismatch && !strictMockReject {
    OBSOLETE
} else {
    FAILED
}

is correct from a "did this test still validate the recorded contract" angle — but breaks two real workflows:

  • CI: a source regression that ALSO touches the mock set (e.g. a SQL drift changing both wire query and response) lands in OBSOLETE → run reports PASSED → CI stays green → bug ships.
  • Agentic flows: an autonomous fix agent watching for FAILED tests has nothing to diagnose when the planted regression lands in OBSOLETE. The fix loop never starts.

Diagnosed 2026-06-08 against scenario-1 of our validation harness: a planted uncommitted ORDER BY DESC LIMIT → ASC LIMIT edit fell into OBSOLETE (because the recorded mock's sql_ast_hash didn't match the live ASC query — a mockSetMismatch), the cloud replay aggregate reported PASSED, and the agent never engaged. With this flag, the same scenario reports FAILED and the agent's Case 1 path runs cleanly: revert source → green.

What it changes

Case testPass mockSetMismatch --strict-failure Before After
1 true false any PASSED PASSED
2 true true false PASSED (silent) PASSED (silent — unchanged)
3 true true true FAILED (via SchemaNoiseStrict) FAILED (via SchemaNoiseStrict — unchanged)
4 false false any FAILED FAILED
5 false true false OBSOLETE OBSOLETE (unchanged — default opt-in only)
6 false true true OBSOLETE FAILED ← new behaviour

Only row 6 changes. The OBSOLETE classification still exists in the data model and the per-test report; the run-level aggregate just rolls it into FAILED when --strict-failure is set.

The mappingDiff block (expected vs actual mocks, missing calls) is still written to the report regardless of the flag — the diagnostic content is preserved, only the status field changes.

CLI surface

  • --strict-failure registered on keploy test and keploy record (same flag group as --schema-noise-detection, --update-test-mapping, --must-pass).
  • keploy cloud replay inherits the flag since it builds on the same command tree.
  • keploy.yml field: strictFailure: true|false (default false).

Verification

Validated against the orderflow scenario suite (S1-S7) — S1 reports FAILED instead of PASSED with the flag, exit code non-zero, mappingDiff block preserved.

Test plan

  • CI: existing replay tests (no change expected — flag defaults off)
  • Manual: run a test where response diverges + mock set mismatches; verify FAILED with flag, OBSOLETE without flag, mappingDiff present in both

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


Charan Kamarapu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.59ms 3.27ms 5.01ms 100.02 0.00% ✅ PASS
2 2.53ms 3.13ms 4.7ms 100.02 0.00% ✅ PASS
3 2.56ms 3.35ms 4.83ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

…sts instead of demoting to OBSOLETE

When a test's response body differs from the recorded baseline AND the
consumed mock set diverged from the recorded mapping, replay currently
demotes the failure to OBSOLETE (replay.go:1828 / 2353) — the run-level
aggregate then reports PASSED because OBSOLETE doesn't count as failure.
The reasoning was "the mock contract changed, so the response diff is
a consequence — tell the user to re-record, don't fail loudly."

That's correct from a "did this test still validate the contract" angle,
but breaks two important workflows:

  - CI: a real source regression that ALSO touches the mock set (e.g. a
    SQL drift that changes both the wire query and the response) gets
    silently classified as OBSOLETE → run passes → CI green → bug ships.
  - Agentic flows: an autonomous fix agent watching for FAILED tests
    has nothing to diagnose when the planted regression lands in
    OBSOLETE. The signal-to-fix loop never starts.

This flag opts into a stricter classification: when the response diverges
from the recorded baseline (testPass=false), the test is FAILED — even
if mockSetMismatch=true would otherwise demote it. The per-test OBSOLETE
label is replaced with FAILED, but the mappingDiff (expected vs actual
mocks, missing calls) is still written to the report for diagnostics —
the user / agent loses NO information, just gets the right status.

Default behavior unchanged: omit the flag → existing OBSOLETE demotion
applies. Opt in via `--strict-failure` (CLI on `record`, `test`, and
`cloud replay`) or `strictFailure: true` in keploy.yml.

What it does NOT change:
  - testPass=true + mockSetMismatch=true silent-ignore (controlled by
    SchemaNoiseStrict, not this flag). The current behavior of treating
    "response matched but mock didn't" as PASSED is a deliberate noise
    tolerance; this PR doesn't touch it.

Sites:
  - config/config.go: Test.StrictFailure field
  - cli/provider/cmd.go: --strict-failure flag wired on the test/replay
    command group (and cloud-replay inherits it via the same group)
  - pkg/service/replay/replay.go: respect StrictFailure at the two
    OBSOLETE-decision sites (lines 1828 and 2353 — HTTP and gRPC paths)
Signed-off-by: Charan Kamarapu <charan@keploy.io>
@charankamarapu charankamarapu force-pushed the feat/strict-failure-flag branch from 8f5d9c3 to 52034e1 Compare June 7, 2026 23:21
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.7ms 3.4ms 4.97ms 100.02 0.00% ✅ PASS
2 2.65ms 3.31ms 4.9ms 100.02 0.00% ✅ PASS
3 2.76ms 3.68ms 5.87ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

@charankamarapu charankamarapu left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Principal engineer self-review

The change is correct and minimal — flag opt-in, default behaviour preserved, both demotion sites updated. A few things I'd want addressed before this lands:

Required

1. No test coverage for the new decision branch.
This PR changes replay-loop classification logic — the most consequential code path in the project. The matrix (testPass × mockSetMismatch × strictMockReject × StrictFailure) has 16 combinations; the change affects rows where testPass=false ∧ mockSetMismatch=true. There's no test asserting either:

  • Default off → OBSOLETE (regression guard for existing users)
  • Flag on → FAILED + mappingDiff preserved (the actual new behaviour)

A table-driven test on RunTestSet covering at minimum those two rows for each of the two sites (HTTP + gRPC) would catch the next person who edits the decision tree. Even a small replay_classification_test.go with a fakeReplayer would be enough.

Strongly recommended

2. Asymmetry between the two demotion sites is undocumented.

Site 1 (replay.go:1828, HTTP):

} else if mockSetMismatch && !strictMockReject && !r.config.Test.StrictFailure {

Site 2 (replay.go:2352, gRPC):

} else if mockSetMismatch && !r.config.Test.StrictFailure {

Site 2 doesn't consider strictMockReject at all. If that's because the gRPC path doesn't run the SchemaNoiseStrict pre-check that sets strictMockReject, say so in a comment. Otherwise the next reader (me, in 6 months) will assume one site forgot a guard and "fix" it, breaking the gRPC behaviour. A one-line comment on each branch costs nothing and prevents future bug introduction.

3. The two demotion sites are duplicate logic that should be extracted.

Both sites encode the same decision tree (PASSED → OBSOLETE → FAILED) with slightly different guards. Extract:

func (r *Replayer) classifyFailure(testPass, mockSetMismatch, strictMockReject bool) models.TestStatus {
    if testPass { return models.TestStatusPassed }
    if mockSetMismatch && !strictMockReject && !r.config.Test.StrictFailure {
        return models.TestStatusObsolete
    }
    return models.TestStatusFailed
}

That makes the asymmetry impossible (one source of truth) and makes the unit test from #1 trivial to write.

Nice to have

4. CLI help text leaks the implementation term "OBSOLETE".

--help users don't know what OBSOLETE means. Suggest reframing in user terms:

Mark response-divergent tests as FAILED even when their consumed mock set diverged from the recording (default: such tests are marked OBSOLETE so you can re-record without seeing a hard failure).

5. Add the diagnostic context to the commit body.

The commit body explains the WHY beautifully. Worth adding one more sentence noting the agentic-workflow scenario this unblocks — "diagnosed against the keploy validation harness scenario-1, where a planted Case-1 source regression landed in OBSOLETE → run reported PASSED → agent had no signal to engage." Future readers grepping for OBSOLETE will land on this commit and need the context.


Functionally LGTM after #1 + #2. The behaviour is correct and the flag-gated rollout is the right shape. Tests + the comment on the asymmetry are the blockers.

@charankamarapu charankamarapu merged commit d0fbdbd into main Jun 8, 2026
138 of 141 checks passed
@charankamarapu charankamarapu deleted the feat/strict-failure-flag branch June 8, 2026 09:12
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 8, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants