Create LICENSE#2
Merged
Merged
Conversation
PranshuSrivastava
pushed a commit
that referenced
this pull request
Mar 22, 2024
Add hardening options
manasmanohar
pushed a commit
to manasmanohar/keploy
that referenced
this pull request
Jan 10, 2025
* feat: add wrapers around interfaces of keploy oss Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com> * build: add github action for build check golang Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com> * refactor: remove replace statement in go mod Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com> * fix: upgrade to newer version in go mod Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com> * refactor: remove goveralls for test-coverage Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com> * refactor: remove blobs from goreleaser Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com> --------- Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>
1 task
gnana1905
pushed a commit
to gnana1905/keploy
that referenced
this pull request
Sep 17, 2025
ci: add github actions to deploy website
This was referenced Feb 22, 2026
9 tasks
slayerjain
added a commit
that referenced
this pull request
Apr 18, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full that failed this PR's own build-and-upload job. 1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the pre-release read of m.outChan was still under m.mu alone, racing SetOutputChannel/CloseOutChan which write it under outChanMu. Switched to outChanStatus() — same RLock path the send uses — so the consistency check is now under the lock that protects the field. 2. SetOutputChannel idempotence: the hot path is buildRecordSession re-binding rule.MC (same pointer) on every accepted connection. Previously it unconditionally cleared outChanClosed, so a post-shutdown re-bind turned a closed channel back 'open' and the next send would panic. Now only clears the flag when the channel POINTER actually changes (new session / re-record). New test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins this. 3. dns.go kept calling SetOutputChannel on every mock. With the idempotence fix in #2 above it's now safe, but the comment block is updated to spell out why, and the call is ordered before AddMock/SendConfigMock so the first DNS query (which may fire before the first TCP connection triggers buildRecordSession) still binds the channel. 4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest is 'client-side only' — it isn't, dialPostgresSSLUpstream re-originates SSLRequest upstream when dest port is 5432. Rewrote the comment to match actual behaviour and point at the runtime_hooks.go docstring. 5. prepare_and_run_windows.yml build-and-upload failed on ee0332e with 'There is not enough space on the disk' writing go-build importcfg under SystemTemp. Ported the tiered disk cleanup from keploy/windows-redirector#2: fast path on >=4 GB free, light sweep (SystemTemp\go-build*, sibling runners' _diag / _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows Update cache), last-resort go module cache purge, hard-fail under 2 GB. Verification: go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass petclinic local 5-round stress, 1295 mocks captured, zero WARNING: DATA RACE, 4 s clean SIGTERM shutdown. Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full that failed this PR's own build-and-upload job. 1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the pre-release read of m.outChan was still under m.mu alone, racing SetOutputChannel/CloseOutChan which write it under outChanMu. Switched to outChanStatus() — same RLock path the send uses — so the consistency check is now under the lock that protects the field. 2. SetOutputChannel idempotence: the hot path is buildRecordSession re-binding rule.MC (same pointer) on every accepted connection. Previously it unconditionally cleared outChanClosed, so a post-shutdown re-bind turned a closed channel back 'open' and the next send would panic. Now only clears the flag when the channel POINTER actually changes (new session / re-record). New test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins this. 3. dns.go kept calling SetOutputChannel on every mock. With the idempotence fix in #2 above it's now safe, but the comment block is updated to spell out why, and the call is ordered before AddMock/SendConfigMock so the first DNS query (which may fire before the first TCP connection triggers buildRecordSession) still binds the channel. 4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest is 'client-side only' — it isn't, dialPostgresSSLUpstream re-originates SSLRequest upstream when dest port is 5432. Rewrote the comment to match actual behaviour and point at the runtime_hooks.go docstring. 5. prepare_and_run_windows.yml build-and-upload failed on ee0332e with 'There is not enough space on the disk' writing go-build importcfg under SystemTemp. Ported the tiered disk cleanup from keploy/windows-redirector#2: fast path on >=4 GB free, light sweep (SystemTemp\go-build*, sibling runners' _diag / _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows Update cache), last-resort go module cache purge, hard-fail under 2 GB. Verification: go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass petclinic local 5-round stress, 1295 mocks captured, zero WARNING: DATA RACE, 4 s clean SIGTERM shutdown. Signed-off-by: slayerjain <shubham@keploy.io>
7 tasks
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full that failed this PR's own build-and-upload job. 1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the pre-release read of m.outChan was still under m.mu alone, racing SetOutputChannel/CloseOutChan which write it under outChanMu. Switched to outChanStatus() — same RLock path the send uses — so the consistency check is now under the lock that protects the field. 2. SetOutputChannel idempotence: the hot path is buildRecordSession re-binding rule.MC (same pointer) on every accepted connection. Previously it unconditionally cleared outChanClosed, so a post-shutdown re-bind turned a closed channel back 'open' and the next send would panic. Now only clears the flag when the channel POINTER actually changes (new session / re-record). New test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins this. 3. dns.go kept calling SetOutputChannel on every mock. With the idempotence fix in #2 above it's now safe, but the comment block is updated to spell out why, and the call is ordered before AddMock/SendConfigMock so the first DNS query (which may fire before the first TCP connection triggers buildRecordSession) still binds the channel. 4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest is 'client-side only' — it isn't, dialPostgresSSLUpstream re-originates SSLRequest upstream when dest port is 5432. Rewrote the comment to match actual behaviour and point at the runtime_hooks.go docstring. 5. prepare_and_run_windows.yml build-and-upload failed on ee0332e with 'There is not enough space on the disk' writing go-build importcfg under SystemTemp. Ported the tiered disk cleanup from a companion repo workflow: fast path on >=4 GB free, light sweep (SystemTemp\go-build*, sibling runners' _diag / _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows Update cache), last-resort go module cache purge, hard-fail under 2 GB. Verification: go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass petclinic local 5-round stress, 1295 mocks captured, zero WARNING: DATA RACE, 4 s clean SIGTERM shutdown. Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full that failed this PR's own build-and-upload job. 1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the pre-release read of m.outChan was still under m.mu alone, racing SetOutputChannel/CloseOutChan which write it under outChanMu. Switched to outChanStatus() — same RLock path the send uses — so the consistency check is now under the lock that protects the field. 2. SetOutputChannel idempotence: the hot path is buildRecordSession re-binding rule.MC (same pointer) on every accepted connection. Previously it unconditionally cleared outChanClosed, so a post-shutdown re-bind turned a closed channel back 'open' and the next send would panic. Now only clears the flag when the channel POINTER actually changes (new session / re-record). New test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins this. 3. dns.go kept calling SetOutputChannel on every mock. With the idempotence fix in #2 above it's now safe, but the comment block is updated to spell out why, and the call is ordered before AddMock/SendConfigMock so the first DNS query (which may fire before the first TCP connection triggers buildRecordSession) still binds the channel. 4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest is 'client-side only' — it isn't, dialPostgresSSLUpstream re-originates SSLRequest upstream when dest port is 5432. Rewrote the comment to match actual behaviour and point at the runtime_hooks.go docstring. 5. prepare_and_run_windows.yml build-and-upload failed on ee0332e with 'There is not enough space on the disk' writing go-build importcfg under SystemTemp. Ported the tiered disk cleanup from a companion repo workflow: fast path on >=4 GB free, light sweep (SystemTemp\go-build*, sibling runners' _diag / _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows Update cache), last-resort go module cache purge, hard-fail under 2 GB. Verification: go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass petclinic local 5-round stress, 1295 mocks captured, zero WARNING: DATA RACE, 4 s clean SIGTERM shutdown. Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
…4045) * feat(proxy): extensibility hooks + Postgres SSL, ALPN, MySQL fixes Adds generic extension points in the agent/proxy/hooks packages so downstream builds can plug in alternative capture layers (observe-only BPF, sidecar injection, etc.) without touching the OSS core. Also bundles four orthogonal bug fixes that keploy hits today in the default proxy path, plus a cgroup orphan-cleanup helper that fixes "program already attached" errors after a crash. No feature-specific code — the extension hooks are named after what they do (SkipProxyListener, AgentInfoCustomizer), not after any downstream product. - `pkg/agent/runtime_hooks.go`: add `SkipProxyListener bool` and `AgentInfoCustomizer func(*structs.AgentInfo)` hook variables. When SkipProxyListener is true, the proxy does not bind a TCP listener but DNS, parsers, and session management still run. AgentInfoCustomizer is called right before AgentInfo is written to the BPF map so downstream code can populate the new Flags field without touching the OSS load path. - `pkg/agent/hooks/structs/structs.go`: add `Flags uint32` field to AgentInfo — an extensible flag slot that downstream BPF code can consume (e.g. to branch cgroup/bind4 behavior). - `pkg/service/agent/hooks.go`: export `SetSkipProxyListener` and `SetAgentInfoCustomizer` setter helpers. - `pkg/agent/proxy/proxy.go`: honor SkipProxyListener by skipping `net.Listen` and blocking on ctx.Done instead. - `pkg/agent/hooks/linux/hooks.go`: invoke `AgentInfoCustomizer` on the populated `agentInfo` before writing it to the BPF map. - `pkg/agent/hooks/linux/cleanup.go` (**new**): helpers that find and detach BPF cgroup programs left behind by a crashed prior keploy run. `DetectAndCleanOrphanedCgroupPrograms` is called from the hooks load path BEFORE attaching new programs, fixing the common "program already attached" failure after a hard-kill. Scan is scoped to programs whose owning PID is dead, so it's safe when other live keploy instances are around. Also exposes `CleanOrphanedBPFPins` (pin cleanup) and an `ExtraBPFPinPatterns` slice so downstream builds can register additional glob patterns for their own pinned maps. **Postgres SSLRequest handshake** (`pkg/agent/proxy/proxy.go`) — `libpq` / `pgx` / `pq` with `sslmode=prefer|require|verify-*` sends an 8-byte SSLRequest packet **before** the TLS handshake begins, expecting the server to reply with a single byte `'S'` (accept) or `'N'` (refuse). The proxy previously just peeked 5 bytes and tried to detect a TLS record header, which fails on this packet — the Postgres client would hang waiting for the server reply, and the connection would drop after timeout. Now we peek 8 bytes, detect the SSLRequest by length=8 + code=0x04D2162F, discard it from the buffered reader, reply `'S'`, and re-peek for the TLS ClientHello that arrives next. The existing TLS upgrade path then runs verbatim. **TLS ALPN negotiation for non-HTTP clients** (`pkg/agent/proxy/tls/tls.go`) — The TLS handler previously forced `NextProtos = ["h2"]` for any client that didn't explicitly advertise `http/1.1`, which caused Go's `tls.Server` to raise `tls: client requested unsupported application protocols` for clients advertising things like `postgresql`, `mongodb`, or `redis`. Now the handler leaves NextProtos empty (letting the client's protocol selection stand) unless the client advertises `http/1.1` or `h2` explicitly. Fixes libpq TLS handshakes against the proxy. **MySQL `MatchType` content detection** (`pkg/agent/proxy/integrations/mysql/mysql.go`) — `MatchType` used to unconditionally return `false`, relying on the caller's port-based routing (3306). Capture layers that deliver bytes without an L4 destination port (TLS uprobe streams, alternative transports) need a content-based fallback so MySQL plaintext is routed to the MySQL parser instead of falling through to Generic. New heuristic matches (a) server HandshakeV10 (first body byte 0x0a) and (b) client HandshakeResponse41 (4-byte caps with `CLIENT_PROTOCOL_41 | CLIENT_PLUGIN_AUTH` + zero-filled reserved field), both restricted to small packets (len < 512) at low sequence numbers. Strict enough to avoid false positives on HTTP POST bodies. **MySQL `CLIENT_DEPRECATE_EOF` seeding in post-TLS record path** (`pkg/agent/proxy/integrations/mysql/recorder/conn.go`) — after a post-TLS client handshake is decoded, the recorder was not copying `decodeCtx.ClientCapabilities` into `decodeCtx.ClientCaps`, so `DeprecateEOF()` returned `false` and the TextResultSet reader expected an intermediate EOF packet between column definitions and row data. Modern drivers (mysql-connector-python/j, go-sql-driver, node mysql2) all advertise `CLIENT_DEPRECATE_EOF` and send rows immediately, and the recorder would abort with `expected EOF packet for column definition`. We now mirror what `handleInitialHandshake` does on the non-TLS path: seed `ClientCaps` from the decoded capabilities (and on the seq=0 "existing connection" branch, defensively seed `CLIENT_DEPRECATE_EOF` since we don't see a HandshakeResponse41 there). **MySQL recorder: non-fatal EOF on TLS-short-circuited handshake** (`pkg/agent/proxy/integrations/mysql/recorder/record.go`) — when a capture layer detects the MySQL `CLIENT_SSL` bit mid-handshake and closes the SimulatedConn so a TLS-aware consumer can take over, the recorder was returning the EOF as an error. Now it's logged at debug level and returned as nil. No observable behavior change for the default proxy path (EOF is preserved for real TCP hang-ups). **MySQL replayer: fall through to plaintext mock candidates when no SSLRequest-tagged mock exists** (`pkg/agent/proxy/integrations/mysql/replayer/conn.go`) — previously, if no mock had a recorded SSLRequest entry, the replayer would abort with `no mysql mocks matched the SSL request`. Now it falls through to the full mock set and lets the post-TLS HandshakeResponse41 match against plaintext candidates. The post-TLS wire format is identical, so a plaintext-recorded MySQL flow can still serve a TLS-requesting replay client. **Docker run: defensive guard against `-p 0:0`** (`pkg/platform/docker/util.go`) — when a caller intentionally sets `SetupOptions.ProxyPort = 0` (e.g. running the agent without a listening proxy socket), the alias builder was emitting `-p 0:0` which `docker run` rejects outright. Now we conditionally build the `-p <proxy>:<proxy>` fragment and skip it entirely when ProxyPort is zero. `pkg/models/const.go` now exposes `CapturedReqTimeKey`, `CapturedRespTimeKey`, `CapturedReqTime(ctx)`, `CapturedRespTime(ctx)`. These let a capture layer install a per-connection wall-clock source on the request context so integration parsers stamp `reqTimestampMock` / `resTimestampMock` with true packet arrival order instead of parser-goroutine scheduling order (the proxy path drifts under concurrent-connection load and can invert mock timestamps across exchanges). Both helpers fall back to `time.Now()` when no source is installed, so call sites can use them unconditionally. Applied inside the MySQL query recorder (`query.go`) as the first consumer. `github.com/cilium/ebpf` goes from v0.19.0 to v0.21.0. The newer release contains the CO-RE field-exists fixes that downstream capture-layer BPF code depends on; the bump is otherwise a drop-in. - [x] `go build ./...` — clean - [x] `go test ./pkg/agent/proxy/integrations/mysql/... ./pkg/models/... ./pkg/agent/... ./pkg/platform/docker/...` — pass - [x] Full grep for `proxyless` / `LowLatency` / `--low-latency` / `--container-name` across `*.go` — zero hits in new code - No config surface changes (`config.Record` is untouched) - No new CLI flags - No BPF C source changes (`bpf_x86_bpfel.go/.o` left at baseline — the obfuscated map-name cleanup can land in a separate PR) Signed-off-by: slayerjain <shubham@keploy.io> * perf(record): cache next test index per test-set to avoid O(n²) dir scan pprof showed FindLastIndex at 16% of the record client's CPU at ~3000 accumulated tests — every new test did a full getdents64 on the tests/ directory. Cache the counter per-testset via sync.Map + atomic.Int64; seed from FindLastIndex on first access, atomic increment thereafter. No external process mints names in the same directory during a record session. Signed-off-by: slayerjain <shubham@keploy.io> * perf(proxy): DirectChunkReader fast path in ReadBuffConn Skip the ReadBytes goroutine hop when the conn exposes pre-framed chunks directly (enterprise SimulatedConn backed by eBPF ringbuf). HasPending+carry-over Read preserves Prepend semantics for protocol- detection peeks. Non-implementing conns unchanged. Signed-off-by: slayerjain <shubham@keploy.io> * perf(record): async batched gob mock writer gated by KEPLOY_MOCK_FORMAT yaml.v3 encoder was 28-30% of the record client's CPU at high throughput (pprof c=20 PG bulk-insert). Replace it for low-latency mode with an async gob writer: - Single background goroutine per MockYaml instance owns the open file + bufio (256 KB) + gob.Encoder. The encoder stays alive across mocks so the per-mock cost is one Encode call plus a 64-byte type-table delta (not a fresh encoder setup). - InsertMock enqueues to a 4096-buffer chan; parser goroutine never blocks on disk or encoding. Queue-full falls back to a synchronous encode so mocks are never dropped — the drop count is tracked atomically for observability. - readGobMocks uses one decoder that keeps the type table live across Decode calls, matching the writer's single-stream output. - Close() drains the queue and flushes before returning — call it at record shutdown. Reader auto-detects mocks.gob first, falling back to mocks.yaml, so existing YAML recordings still replay. Round-trip tested across HTTP, Redis, Kafka, Generic, Postgres, MySQL, MongoDB, DNS, and multi-mock append. All 9 pass with reflect.DeepEqual between input and decoded Mock. gob handles the interface{} Message fields correctly via the gob.Register calls in pkg/models/*. Signed-off-by: slayerjain <shubham@keploy.io> * chore(bpf): regenerate x86 objects against ebpf#92 verifier fix Earlier regen tripped the kernel 6.12 verifier in k_connect6 ("dereference of modified ctx ptr R2 off=24 disallowed") after the agent_info flags branch. ebpf#92 has the C-source fix; this regen ships the Go + object generated from that ebpf revision. Signed-off-by: slayerjain <shubham@keploy.io> * fix(syncMock): recover from send on closed channel during shutdown AddMock's default-case select did not catch closed-channel panic. HandleOutgoing closes outChan during shutdown while a parser goroutine can still be mid-send. recover() + silent drop is correct — shutdown is racing the parser and the mock reader is already gone. Signed-off-by: slayerjain <shubham@keploy.io> * fix(record): flush async mockDB writer at record shutdown The gob binary mock writer added in KEPLOY_MOCK_FORMAT=gob runs a background goroutine that holds up to 4096 mocks in a channel before encoding. Previously there was no call site for MockYaml.Close(), so at Ctrl-C the queued tail was lost along with the writer goroutine. Wire a deferred io.Closer-type-assertion at the top of Recorder.Start. The default YAML MockDB doesn't implement io.Closer so the assertion is a no-op; only the gob path's Close drains the queue and flushes the file. Added TestMockYamlIsIOCloser as a compile-time + runtime guard so a future refactor that drops the Close method can't silently lose mocks. Signed-off-by: slayerjain <shubham@keploy.io> * feat(record): gob format config field + magic header + overflow logging + Recorder.Cleanups Four debt items from the principal review bundled because they touch the same area and gob-format codepath: 1. record.mockFormat config field (yaml|gob) — promotes KEPLOY_MOCK_FORMAT from an env-var escape hatch to a documented config knob. Env var still takes precedence for ad-hoc runs. Wired from the OSS CLI provider via mockdb.SetConfiguredMockFormat. 2. Magic header on mocks.gob — "keploy-gob-v1\n" is written at the start of every new file; readGobMocks rejects files without it with a clear error. Protects against silent struct-schema drift across keploy versions — future format breaks bump the suffix. 3. gobWriteSync now reuses the async writer's open file + encoder under the mutex. Previously it opened a second fd with a fresh gob.Encoder, concatenating a new type-table into the stream that the reader's single gob.Decoder could not resume across. This was a latent corruption bug that only fired when the async queue filled up (which the fix for item 1 makes observable). 4. MockYaml.Close now logs the queue-overflow count at Info level. Previously the atomic counter was incremented but never surfaced — no way for an operator to notice disk/encoding had become a bottleneck. New test TestGobMagicHeaderRejectsMismatch covers the header guard. 5. Recorder.Cleanups LIFO stack replaces the ad-hoc io.Closer type-assertion added in afcf4a18. mockDB auto-registers if it implements io.Closer; downstream subsystems (telemetry, mapping-DB, etc.) can RegisterCleanup their own flush without editing this file. Signed-off-by: slayerjain <shubham@keploy.io> * chore: regenerate arm64 BPF objects; add docs/env-vars.md Matches the x86 regen landed earlier in this PR (1edb975f). Same underlying ebpf#92 source commit. bpf_arm64_bpfel.go was already identical to the regenerated version; only the .o needed updating. docs/env-vars.md collects KEPLOY_MOCK_FORMAT, PPROF_PORT, and GOMEMLIMIT_MB with notes on what was deliberately NOT added as a default (GOMAXPROCS cap, auto-gob-in-low-latency). Keeps the env-var surface documented and discoverable for reviewers. Signed-off-by: slayerjain <shubham@keploy.io> * style: gofmt alignment fixes Signed-off-by: slayerjain <shubham@keploy.io> * chore(bpf): regen .o/.go from post-#96 ebpf main (struct size=56) The previously bundled .o files were captured from an intermediate state of ebpf#96 where agent_info_t had grown to 64 bytes (had _pad + flags + _pad2). The merged version of #96 keeps the struct at 56 bytes via a size-preserving rename of the existing _pad slot to flags. keploy's Go AgentInfo mirror is 56 bytes, which caused "marshal value: structs.AgentInfo doesn't marshal to 56 bytes" failures at hook load time on every platform that runs the BPF loader (observed on macOS docker CI; same root cause elsewhere). Regenerated from ebpf main at 9e657c6. Signed-off-by: slayerjain <shubham@keploy.io> * fix(hooks): align AgentInfo.Flags with BPF offset 44 (merged ebpf#96 layout) The previous layout placed Flags at offset 56 (after RecordingStartTime) with trailing blank padding, making the struct 64 bytes. The merged ebpf#96 renamed the existing _pad slot to flags in-place, so the BPF reads flags at offset 44 and the struct stays 56 bytes. Moving Flags to the slot the old blank [4]byte occupied restores byte-for-byte parity with the BPF map value: size 56, Flags@44, RecordingStartTime@48. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy): gate Postgres SSLRequest interception behind runtime hook spring-petclinic (PG via JDBC) was regressing with EOF during doAuthentication because the proxy was unconditionally intercepting the 8-byte SSLRequest preamble and replying 'S' itself, then terminating TLS. The default keploy build injects a Postgres parser via extraparsers.go which already handles SSLRequest through the existing TLSUpgrader interface; double-handling meant the parser never saw the SSLRequest and couldn't forward post-TLS traffic, leading to the Java client timing out on auth reads. Fix: add agent.InterceptPostgresSSLRequest (default false) and a SetInterceptPostgresSSLRequest setter. The proxy-level handshake now runs only when the flag is enabled, preserving parser-driven behavior on the default build and letting pure-proxy downstream builds (no PG parser registered) opt in when needed. Signed-off-by: slayerjain <shubham@keploy.io> * chore: address Copilot review comments - gob overflow hint now reports the actual queueCapacity and points at ensureGobWriter as the control (no fictitious env var). - MySQL MatchType slices body to exactly the first packet's (buf[4:4+pktLen]) and uses pktLen for length checks, so a buffer carrying subsequent packets cannot spuriously satisfy the 23-byte reserved-zero window. - Postgres SSLRequest error messages now include an actionable next step for the operator: what to check for each failure class (reset between peek/discard, one-byte write failure, missing ClientHello after 'S'). - record.go io.Closer comment corrected: MockYaml implements Close unconditionally and is a no-op in yaml mode (gobStop==nil). Signed-off-by: slayerjain <shubham@keploy.io> * chore: address second round of Copilot comments - cleanup.go: rewrote detachOrphanedByName docstring to be honest about the name-only match (BPF_PROG_GET_INFO_BY_FD exposes no creator PID); safety comes from the global no-other-keploy gate in the caller. Stale 'CleanupStaleAgents' reference removed. - cleanup.go: detach failure log moved from Warn to Debug and made actionable (cgroup path + attach type + next step). - record.go: cleanup failure log moved from Warn to Error and tells the operator what to inspect (mock-file tail, disk space, output-dir permissions) plus the failing cleanup index. - db.go: gobClosed latches on Close(), InsertMock after close returns a clear error instead of silently enqueueing into a channel whose consumer has exited (prevented first-4096-then- hang behavior in hypothetical re-record-reuse flows). Signed-off-by: slayerjain <shubham@keploy.io> * fix(mockdb): gob read path now treats Mongo mocks like YAML does Two off-by-one inclusion bugs in the gob path of GetFilteredMocks and GetUnFilteredMocks meant Mongo mocks were: - silently dropped from GetFilteredMocks (per-test mocks), because the whitelist switch did not list 'Mongo' (or any default case); - accidentally promoted into GetUnFilteredMocks (config mocks), because !isUnfilteredKind was treated as 'include' instead of 'exclude'. Rewrote both paths to mirror the YAML semantics: a fixed set of kinds (Generic / Postgres / Http / Http2 / Redis / MySQL / DNS; plus PostgresV2 on the unfiltered side) are 'unfiltered config' mocks and go to GetUnFilteredMocks; everything else (Mongo, gRPC, etc.) is a per-testcase mock and goes to GetFilteredMocks. Does not fix the pre-existing Mongo Fuzzer record_latest_replay_build flake that lives in the integrations mongo matcher — that test uses YAML anyway — but unblocks Mongo workloads in gob mode. Signed-off-by: slayerjain <shubham@keploy.io> * fix(mockdb): make gob writer lifecycle restartable + DeleteMocksForSet gob Two bugs in the same file addressing Copilot review round 3: 1. Re-record cycle: Recorder.Start auto-registers mockDB.Close via RegisterCleanup. Re-record orchestration invokes Recorder.Start repeatedly on the same mockDB instance — under the previous sync.Once + gobClosed latch, the second and subsequent sessions would lose every mock (writer goroutine exited after first Close; sync.Once prevented re-init). Rewrote the lifecycle with a mutex-guarded gobRunning flag: Close flushes the writer and flips gobRunning=false; the next ensureGobWriter observes !gobRunning and spawns a fresh goroutine with fresh channels. Verified with a local repro (two Insert/Close cycles, both mock files written with non-zero size). 2. DeleteMocksForSet only removed mocks.yaml, leaving a stale mocks.gob on disk that GetFilteredMocks would then prefer on replay. Now deletes both variants (missing-file is not an error) so mocks-only refresh actually refreshes the format currently in use. Also made the async gob writer's per-mock error log actionable: it now names the output directory and tells the operator to check disk/permissions, re-run with --debug, or fall back via KEPLOY_MOCK_FORMAT=yaml while triaging. Signed-off-by: slayerjain <shubham@keploy.io> * chore: address Copilot round-4 comments on mockdb + util + testdb 1. mockdb: extracted isUnfilteredMockKind() as the single source of truth for YAML/gob kind classification. Both GetFilteredMocks and GetUnFilteredMocks call it, removing the inline switch drift that could put PostgresV2 in different buckets depending on on-disk format. 2. mockdb: gobReopenLocked now truncates the target file on every open instead of O_APPEND. A gob stream's type table lives in the encoder; appending a second encoder session to an existing file (from a re-record cycle or a cross-test-set switchback) would embed a duplicate type table mid-file and break the reader's single gob.Decoder. Each session now owns its file exclusively. 3. mockdb: gobFlushAndClose returns the combined flush+close error, stored in gobFlushErr and surfaced by Close() so Recorder.Start's deferred cleanup logs the real disk-full/permission-change shutdown failure instead of silently losing the tail of mocks.gob. 4. mockdb: gobWriterLoop no longer takes InsertMock's ctx. A request- scoped ctx cancelling mid-session would exit the writer early and leave gobRunning=true with a dead consumer; the writer now shuts down only on gobStop (i.e. Close). 5. util: ReadBuffConn carry-over drain now loops until HasPending() reports false. Previously a single 64KiB Read could lose in-order delivery when a parser had prepended multiple buffers (peek + protocol-upgrade prefix). 6. testdb: nextTestIndex first-call now returns seed+1 (the next free index) instead of seed (which collided with the last existing test on disk). Race path still uses Add(1) on the winning counter. 7. tests: corrected gob_roundtrip_test.go comment that claimed the reader re-creates its decoder each iteration — it uses one decoder across the whole stream. Signed-off-by: slayerjain <shubham@keploy.io> * chore: tighten gob writer Close race + gate BPF pin cleanup Two correctness fixes from Copilot round 5: 1. MockYaml.Close() now holds the lifecycle lock for the entire teardown. The earlier version released the lock before waiting for gobDone, letting a concurrent ensureGobWriter reassign gobQueue/gobStop/gobDone while the previous writer goroutine was still selecting on them — a data race that could start two writer goroutines on the same mocks.gob and corrupt it. Now gobRunning stays true while we drain, concurrent Inserts block on the same lifecycle lock, and a second concurrent Close cannot double-close gobStop because it observes gobRunning=false after the first Close completes. Timeout path leaves gobRunning=true so a retry can complete the shutdown rather than racing with a half-drained writer. Passes go test -race. 2. CleanOrphanedBPFPins now internally gates on HasOtherKeployProcesses(os.Getpid()) and returns early when a sibling keploy is alive. Previously the safety precondition was 'the caller checks first', which is fragile — every new call site had to re-enforce it. Moving the check inside matches the cgroup-program cleanup path (DetectAndCleanOrphanedCgroupPrograms already gates the same way). Signed-off-by: slayerjain <shubham@keploy.io> * docs(cleanup): align queryAttachedPrograms comment with implementation The comment claimed 'bpf.ProgQuery via the raw syscall wrapper' but queryAttachedPrograms actually calls cilium/ebpf's link.QueryPrograms (the typed wrapper around BPF_PROG_QUERY). Reworded to match — so a future maintainer tracing 'which API surface does this depend on?' reaches the right place. Signed-off-by: slayerjain <shubham@keploy.io> * fix(testdb): revert nextTestIndex off-by-one FindLastIndex in pkg/platform/yaml/utils.go already does a trailing 'lastIndex++' before returning — it yields the next available index, not the largest existing one. The earlier round of comment-driven review mis-described this and I blindly added another +1, which would have made a fresh test-set start at test-2 and shifted every subsequent ID by one. Reverting to the original contract: seed the counter with the value FindLastIndex returned, return it directly on the first call, and use Add(1) on subsequent calls. Passes go test -race on the package. Signed-off-by: slayerjain <shubham@keploy.io> * chore: address Copilot round-6 comments - db.go Close(): track gobStopClosed so retries after a flush-timeout don't panic on closed channel. Previously gobRunning stayed true on timeout but gobStop was already closed; a second Close would hit close(closedChan) and panic. - db.go GetUnFilteredMocks YAML path: drop the inline kind switch and call isUnfilteredMockKind, same helper the gob path uses. Single source of truth — YAML and gob can no longer drift on which kinds fall into the 'unfiltered config' bucket. - db.go DeleteMocksForSet: error log now tells the operator to check read-only attributes, directory ownership, and open handles — the only failure modes that would surface here after our os.IsNotExist tolerance. - proxy.go: when SkipProxyListener is set, log 'TCP listener intentionally not bound' instead of 'Proxy started at port:N' so operators and downstream automation aren't told a port is open when it intentionally isn't. Signed-off-by: slayerjain <shubham@keploy.io> * chore: address Copilot round-7 comments - proxy.go: after replying 'S' to the Postgres SSLRequest, any non- nil error on Peek(5) is now treated as fatal. The previous branch only returned on non-EOF errors; an EOF with a partial buffer (len>0 but <5) would fall through into the TLS-detection path running against a truncated header and misclassify the stream as plaintext. Log now includes bytesRead for diagnostics. - db.go GetFilteredMocks gob path: special-case PostgresV2 so it reaches tcsMocks alongside the Mongo / gRPC kinds, matching the YAML path. YAML's GetFilteredMocks switch doesn't list PostgresV2 (so it falls through to 'include'); the gob path's !isUnfilteredMockKind was over-excluding it because the helper intentionally lists it in the unfiltered bucket. Replay behavior is now consistent across .gob and .yaml storage. Signed-off-by: slayerjain <shubham@keploy.io> * chore: address Copilot round-8 comments Five substantive fixes: 1. mockdb.insertMockGob now holds the lifecycle lock through the channel send. Previously the "writer is alive" check was done in ensureGobWriter but the actual send happened after the lock was released, so a concurrent Close could transition the writer to stop-closed between the two and the enqueued job would end up in a queue whose consumer had already drained+exited. With the lock held, the invariant 'any job we enqueue is seen by the current writer (or its drainAndClose)' holds atomically. Insert also now rejects enqueue attempts after stop-closed is set. 2. mockdb.drainAndClose records gobWriteOne errors into gobFlushErr (joined via errors.Join) so a disk-full / perm-change during the final drain actually surfaces through Close() to the Recorder.Start deferred cleanup log, instead of being silently discarded. 3. mockdb.DeleteMocksForSet: reject a testSetID that contains '..' or isn't a clean single-segment relative name. Previously, a re-record request with testSetID='../../etc' would let os.Remove delete files outside the mocks directory. 4. Rename gobWriteJob.testSet → testSetPath to match what it actually holds (a full '<MockPath>/<testSetID>' path, not the ID). zap field in the async-writer log renamed consistently. 5. proxy.go: when SkipProxyListener is on, emit the Info-level startup notice only once (in StartProxy near the DNS-control banner). The second log inside start() is now Debug — operators don't see the same message twice, but debug traces still have it. Passes go test -race -count=3 on the mockdb package. Signed-off-by: slayerjain <shubham@keploy.io> * chore: tighten testSetID validation + fix gofmt + test comment - gofmt: struct field alignment regressed after the testSet→testSetPath rename in the previous commit; reformatted via gofmt -w. - mockdb.DeleteMocksForSet: tightened validation — rejects empty IDs, ".", "..", and any '/' or '\\' separator (the previous guard only rejected unclean paths / '..', letting 'a/b' or '.' pass). The test-set layout is one directory level under MockPath, so a single non-empty segment is the only acceptable shape. - gob_roundtrip_test.go: TestMockYamlIsIOCloser comment updated to describe the RegisterCleanup + deferred drain shutdown mechanism (the type-assertion dance comment described the pre-PR pattern). Not changing the Metadata["type"] accesses Copilot flagged as nil-deref risks — Go's nil-map reads return the zero value without panicking, and this pattern is already used at pkg/platform/yaml/mockdb/db.go:339,879 in the pre-existing YAML paths. Adding a defensive nil check would be cargo-cult, not a real fix. Signed-off-by: slayerjain <shubham@keploy.io> * docs(proxy): document InterceptPostgresSSLRequest client-side-only scope Copilot round-9 flagged that the opt-in flag replies 'S' and MITMs TLS with the client, but tls.Dial to the upstream does a direct TLS handshake without forwarding the 8-byte SSLRequest preamble. A real Postgres server typically rejects direct TLS on 5432. Rather than add upstream forwarding in this PR (substantial feature work that needs its own design + test), the docstring now explicitly states the flag's current scope: - Client-side only: read SSLRequest from client, reply 'S', TLS MITM. - NOT a full end-to-end MITM against vanilla Postgres — for that path, use the parser-driven TLSUpgrader in the private parser set. - Usable today in (a) downstream builds that decrypt the client side and forward plaintext elsewhere, or (b) against upstreams that accept direct TLS on the destination port. The flag defaults to false, so this does not change any runtime behavior in the default build. A follow-up PR can add symmetric upstream SSLRequest forwarding if a production consumer needs it. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy): revert to Peek(5) for default path; extend to 8 only on SSLRequest prefix The previous Peek(8) was unconditional and deadlocked plaintext protocols whose first client write is <8 bytes and then waits for a server response. Redis 'PING\\r\\n' (6 bytes), memcached 'quit\\r\\n' (6 bytes), and similar greetings would never deliver the 7th and 8th byte, hanging handleConnection until the TCP read deadline. New flow: - Peek 5 bytes (what the TLS record header and most prefix checks need). Same behavior as pre-this-PR main. - Only when InterceptPostgresSSLRequest is on and the 5-byte prefix matches the SSLRequest signature '00 00 00 08 04', do a second Peek(8) to verify the full 8-byte packet before the 'S' reply. The prefix is unique enough not to collide with TLS, CONNECT, or any other protocol greeting, so the 3-byte wait is deterministic — only a real SSLRequest client incurs it. Plus two doc/log fixes: - hooks.go: DetectAndCleanOrphanedCgroupPrograms internally gates on HasOtherKeployProcesses (as of the round-5 fix). The caller's comment still said the check was in the caller, and the caller emitted an Info log that duplicated the helper's own Info log on cleanup. Comment corrected, duplicate log removed. - proxy.go isPostgresSSLRequestPrefix helper added, documented alongside the full-signature helper. Signed-off-by: slayerjain <shubham@keploy.io> * fix: gob flush-err accumulation + DirectChunks ownership copy Three round-10 Copilot comments addressed: 1. mockdb.gobWriterLoop now accumulates per-mock encode failures into gobFlushErr (under gobMu), not just the Error log. Previously a steady-state failure in the hot path would be logged but Close() could still return nil — the operator would see warnings in the log but the cleanup handler would not flag the session as lossy. 2. mockdb.gobFlushAndClose now joins its flush/close error with any pre-accumulated gobFlushErr instead of overwriting it. The old assignment meant that a successful final flush masked earlier encode failures. Close now returns the union of errors from the whole session. 3. util.ReadBuffConn defensively copies each chunk from DirectChunks() before handing it to bufferChannel. Matches the allocation semantics of the regular ReadBytes path and makes producer-side scratch-buffer reuse safe. The DirectChunkReader docstring now states the ownership contract explicitly: chunks may alias a shared backing array, consumers must not rely on their lifetime past the next push. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy): remove duplicate readyChan send that leaked skipListener goroutine readyChan is buffered size 1 and start() already sends exactly once on all code paths — skipListener branch, listen failure, and listen success. The goroutine's extra 'readyChan <- err' after start() returned would block forever because the caller only reads readyChan once: before this PR that was a latent main bug, but the new skipListener path made the goroutine leak visible on every shutdown (start returns nil after <-ctx.Done, the second send blocks, the goroutine never exits, errgroup.Wait hangs). Made start() the sole writer. Removed the post-return send. The goroutine still returns the error correctly for errgroup.Go to propagate. Also: reset gobOverflows to 0 on Close via atomic.Swap so re-record cycles on the same MockYaml instance don't accumulate the prior session's overflow count into the next session's Close summary. Signed-off-by: slayerjain <shubham@keploy.io> * fix(docker): apply ProxyPort=0 guard to darwin/colima/windows branches too The proxyPortStr helper (empty when opts.ProxyPort==0, '-p P:P' otherwise) was only being substituted into the linux docker-run alias. The windows, darwin-colima, darwin-docker, and windows-colima branches all kept the inline '-p ' + fmt.Sprintf(...) construction, so ProxyPort=0 on those hosts would still generate '-p 0:0' and fail docker run before the agent ever started. Replaced all four call sites with proxyPortStr. Guard now applies uniformly across every supported host — the skipListener use case (agent running without a bound proxy port) works on macOS and Windows too. Also: declining Copilot's readGobMocks &m 'pointer reuse' flag. Go allocates a fresh heap address for the block-scoped 'var m' each iteration whenever '&m' escapes via append, and the existing TestRoundTrip_MultipleMocksAppend test already validates that multi-mock gob reads return distinct mocks with correct fields. Verified independently with a repro (3 append(&m) gives 3 distinct heap addresses). The suggested 'm := new(...)' rewrite is semantically identical — it would be a no-op refactor, not a real fix. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy): skipListener shutdown no longer closes session.MC The non-skip shutdown path closes session.MC only after clientConnErrGrp.Wait() guarantees every per-conn producer has exited. My skipListener branch closed the channel immediately on ctx.Done without that guarantee — DNS mock producers and any external capture-layer senders still running during shutdown could hit 'send on closed channel' and panic mid-teardown. Removed the close. MC's lifetime ends when all references drop; consumers driven by ctx.Done don't need an explicit close to stop ranging. Documented the rationale inline so a future edit doesn't reintroduce the close. Also made the nsswitch-reset error log actionable with a recovery hint. Signed-off-by: slayerjain <shubham@keploy.io> * fix(mockdb): hold lifecycle lock across sync-fallback write insertMockGob's queue-full fallback released gobLifecycleMu before calling gobWriteSync. A concurrent Close() could then flush+close the writer (setting gobFile=nil), and when the in-flight gobWriteSync acquired gobMu and called gobReopenLocked it would hit the O_TRUNC path — destroying every mock written earlier in the session. Keep the lifecycle lock held across the sync write instead. This serializes the degraded sync path against Close(), at the cost of serializing parsers during queue overflow. Since overflow only fires when the 4096-slot async queue is already full (itself a degraded-throughput mode), this is an acceptable tradeoff for the 'no dropped mocks' guarantee the overflow path exists to provide. Passes go test -race. Signed-off-by: slayerjain <shubham@keploy.io> * chore: delete unused ensureGobWriter + log BPF pin cleanup errors - mockdb.ensureGobWriter was dead code after round 8 inlined the writer-init inside insertMockGob (under the lifecycle lock so the sync-fallback race-fix works). Removed. Nothing else called it; doc comments in Close() and the overflow hint now point at the actual init site. - cleanup.CleanOrphanedBPFPins now surfaces os.Remove / os.RemoveAll errors instead of silently discarding them. Previously a failed unpin would go unreported and the subsequent attach would fail with 'program already attached' or 'pin exists' — root cause invisible. Log is Error level with an actionable next step (check bpffs writability, CAP_SYS_ADMIN/CAP_BPF, no other live keploy holding the pin). Signed-off-by: slayerjain <shubham@keploy.io> * chore: warn on gob+UpdateMocks + clarify extraparsers.go ref - mockdb.UpdateMocks: when gob format is active and mocks.gob exists, log a Warn explaining that pruning is YAML-only today and the gob file will grow. Users who run with KEPLOY_MOCK_FORMAT=gob + RemoveUnusedMocks=true previously got a silent no-op because the YAML-only read saw no mocks.yaml. Gob pruning (read stream, filter, rewrite fresh encoder session) is deferred to a follow-up; warning with a concrete next step is the responsible intermediate. - runtime_hooks.go: the InterceptPostgresSSLRequest docstring mentioned 'extraparsers.go' which doesn't exist in the OSS tree — it's generated at CI time by setup-private-parsers. Reworded so a maintainer reading the comment can trace where the Postgres parser is actually wired in (CI action + integrations repo), not hunt for a nonexistent file. Signed-off-by: slayerjain <shubham@keploy.io> * docs(mockdb): gob pruning skip log Warn→Info (per repo logging guidelines) The gob+UpdateMocks skip is a benign outcome with a documented next step, not a fault condition — the repo's logging guidelines discourage new Warn logs for that shape of message. Bumped the existing remediation text down to Info (no text change) so it still shows up for the user but doesn't raise the alert level. Signed-off-by: slayerjain <shubham@keploy.io> * chore: fix agentRegistartionMap typo + clarify mysql EOF log - Rename the misspelled field 'agentRegistartionMap' to 'agentRegistrationMap' so it matches 'clientRegistrationMap'. Both call sites updated (hooks.go + comm.go). No behavior change. - mysql/recorder/record.go: the io.EOF branch on handleInitialHandshake was logged as a TLS handoff 'short-circuit', but it also fires on ordinary client disconnect mid-handshake. Reworded the log to be neutral — 'EOF during MySQL handshake' with a conditional hint about TLS handoff — so ordinary disconnects are not misreported as TLS events in production logs. Also updated the PR body: section 4 is now labeled as opt-in (InterceptPostgresSSLRequest defaults false; default build still uses the parser-driven TLSUpgrader) and documents the scope limitation (client-side-only, no upstream SSLRequest forwarding). Signed-off-by: slayerjain <shubham@keploy.io> * chore: address round-19 Copilot comments - cleanup.go: log queryAttachedPrograms failures at Debug with actionable context instead of silently returning 0. Not all attach types support BPF_PROG_QUERY on every kernel, and insufficient privileges produce the same error shape; both turn orphan cleanup into a no-op, and the silent path left operators with no way to trace why the downstream attach failed with 'program already attached'. - mockdb.DeleteMocksForSet: dropped the overly-strict strings.Contains(testSetID, '..') substring check. It rejected perfectly valid single-segment names like 'v1..v2' or 'team..a'. The remaining checks (no separators, not absolute, not '.' or '..' verbatim, stable under filepath.Clean) already block every directory-escape case — a single-segment name with a '..' substring (but no separators) cannot resolve to a parent directory. Signed-off-by: slayerjain <shubham@keploy.io> * fix(mockdb): deep-copy Mock before async-gob enqueue insertMockGob enqueued the caller's *models.Mock pointer into the async writer and returned. A caller that subsequently mutated the same *Mock — e.g. RecordHooks.AfterMockInsert writing telemetry fields onto the same pointer, or a Mock-pool producer recycling slots — would race the async gob encoder and persist an unintended payload on disk. Fix: call mock.DeepCopy() before enqueue. The helper already exists in pkg/models/mock.go and handles the nested Noise / Metadata / per-protocol-slice fields. Copy cost is bounded by mock size. Not switching to 'encode to bytes synchronously' because each encoder session must share one type table for the reader's single-decoder stream — per-InsertMock byte-buffer encoding would embed a type table per mock and break replay. gobWriteSync (overflow fallback) does not need the copy: it runs synchronously on the caller goroutine while the caller is still blocked in InsertMock, so there's no window for external mutation during the encode. Signed-off-by: slayerjain <shubham@keploy.io> * fix(mockdb): UpdateMocks gob-skip keyed on file presence, not current run format The previous guard checked useGobMockFormat() before looking at mocks.gob. That diverged from the read paths: GetFilteredMocks and GetUnFilteredMocks prefer mocks.gob purely by file presence, regardless of what KEPLOY_MOCK_FORMAT is set to for this run. So a test-set recorded with gob last time but pruned today under a yaml-configured run would silently hit the YAML-only pruning code against a non-existent mocks.yaml — pruning would 'succeed' by doing nothing while replay still read the untouched gob file. Now the skip triggers on os.Stat(mocks.gob)==nil regardless of the current run's configuration, matching the replay side. User message updated to explain the state they're actually in ('gob is present, replay prefers it'). Signed-off-by: slayerjain <shubham@keploy.io> * refactor(mockdb): YAML GetFilteredMocks uses the shared isUnfilteredMockKind helper Before this PR the YAML path had its own Kind switch that excluded (Generic, Postgres, Http, Http2, Redis, MySQL, DNS) — default isFilteredMock=true let PostgresV2 and Mongo fall through to the append. The gob path I added in this PR was rewritten to use isUnfilteredMockKind + a PostgresV2 special case, so the classifier was shared for the unfiltered side but the filtered side still had two implementations — and the comment on the gob path claiming it was 'shared with YAML' was misleading. Replaced the inline switch with the same '!isUnfilteredMockKind(kind) || kind == PostgresV2' call the gob path uses. Behavior is identical: PostgresV2 continues to reach tcsMocks (matches the historical YAML fall-through), every other kind maps through the shared helper. Both filtered-mock paths now reference the same classifier, so a future kind addition only has to land in isUnfilteredMockKind to cover YAML and gob uniformly. Signed-off-by: slayerjain <shubham@keploy.io> * perf(cleanup): open cgroup FD once across keployBPFProgNames loop DetectAndCleanOrphanedCgroupPrograms iterates every entry in keployBPFProgNames and called detachOrphanedByName for each one — and detachOrphanedByName used to open/close the cgroup dir per call. That's 2 extra syscalls times the number of program names on every keploy startup. Moved the os.Open to the caller so the same *os.File is shared across every per-name query/detach in one cleanup pass. Function signature takes *os.File + the path (path kept for log context). No behavior change: same scan, same detach decisions, same safety gate (HasOtherKeployProcesses checked once before we open the FD). Passes go test -race. Signed-off-by: slayerjain <shubham@keploy.io> * ci: cap per-job timeout at 30min on Windows + fuzzer workflows A run of run_golang_native_windows/go-dedup burned 6h5m of windows-latest runner time. Root cause: keploy's Windows native capture layer (FFI) enters a tight 'Packet sent successfully!' loop if the intercepted process exits before the capture driver/DLL unloads. Job log showed 76,977 iterations in the last 4 min of runtime before GitHub Actions' default 6h hard timeout fired. Fixing that FFI shutdown path is a separate Windows-native change outside keploy#4045's scope. Pragmatic CI fix landing here: add 'timeout-minutes: 30' to every job that could potentially sit in a capture-layer shutdown loop: - .github/workflows/golang_native_windows.yml - .github/workflows/golang_docker_windows.yml - .github/workflows/fuzzer_linux.yml (all 5 fuzzer jobs) Healthy runs on these workflows finish in 3-5 min, so 30 min is ~5-10x headroom. 30 matches the existing timeout used on golang_docker_macos.yml, java_linux.yml, prepare_and_run_macos.yml, python_docker_macos.yml. A hung future run fails in 30 min rather than 6h — faster feedback, no stuck runner slot. The underlying Windows FFI shutdown bug is tracked for a separate fix in keploy's windows-native capture implementation. Signed-off-by: slayerjain <shubham@keploy.io> * fix(hooks/windows): cap StopRedirector shutdown at 10s The run_golang_native_windows/go-dedup job hit GitHub Actions' 6h runner timeout because keploy's Windows Hooks.unLoad blocked unconditionally on StopRedirector(). That call crosses into the Rust capture side (libwindows_redirector.a), which owns a WinDivert recv/send loop — if its stop path hangs (deadlock in the Rust side, slow wake of the recv thread), the Go process lingers until the OS tears it down, and the CI log fills with 'Packet sent successfully!' from the still-running loop. Wrap the StopRedirector call in a goroutine with a 10 s deadline: - happy path: Rust returns, we log the error (if any) and proceed - stuck path: we log an actionable 'driver didn't return in time' error naming the symptom and a pointer to src/ffi.rs, then return so the Go side finishes its own shutdown sequence This doesn't fix the underlying Rust shutdown bug (that's a separate change in the Rust redirector), but it prevents a single stuck capture from holding the whole keploy process — and therefore a CI runner — for 6 hours. Paired with the 30-min workflow timeout landed earlier, any future recurrence fails fast and visibly instead of silently burning runner time. Signed-off-by: slayerjain <shubham@keploy.io> * feat(proxy): gob mock pruning + Postgres SSLRequest upstream forwarding Two follow-ups folded in — both were documented as deferred in prior rounds; landing them here closes the scope gaps. 1. mockdb.updateMocksGob: when UpdateMocks hits a test-set backed by mocks.gob, it now reads the full stream via readGobMocks, applies the exact same filter as the yaml path (keep config mocks, consumed names, post-replay writes, pre-first-test startup mocks), and atomically rewrites a fresh single-encoder stream with the magic header. Quiesces any active async writer on this MockYaml via Close() before touching the file so a concurrent InsertMock cannot race the truncate-and-rewrite. Previously the path emitted an Info-log 'pruning is YAML-only' and returned nil; gob-recorded sessions with RemoveUnusedMocks=true would now see their mocks.gob actually shrink. 2. proxy.dialPostgresSSLUpstream: when agent.InterceptPostgresSSLRequest is on and destInfo.Port is 5432, the upstream dial now does SSLRequest + read 'S'/'N' + tls.Client.Handshake instead of tls.Dial. A real Postgres server on 5432 rejects a direct TLS handshake; the old code only handled the client side of the flag, so upstreams would fail. Non-5432 destinations still go through tls.Dial. 'N' from upstream surfaces a clear error: 'the upstream was built without TLS — enable SSL on Postgres or disable InterceptPostgresSSLRequest for this deployment'. Unexpected non-S/N bytes surface a 'not actually Postgres on 5432' error. 3. runtime_hooks.go InterceptPostgresSSLRequest docstring updated: the flag now covers both sides of the handshake for 5432 destinations; no longer labeled 'client-side-only'. Signed-off-by: slayerjain <shubham@keploy.io> * style: gofmt comment alignment in runtime_hooks.go Signed-off-by: slayerjain <shubham@keploy.io> * chore(windows): bump vendored libwindows_redirector.a Built from a companion repo workflow (branch fix/windows-shutdown-log-spam, commit f87c18c) by the new self-hosted Build libwindows_redirector.a workflow. Pulls in the shutdown and logging fixes that addressed the Windows go-dedup CI hang: bounded_join in HandleGuard's Drop with a shared shutdown budget, hot-path log macros instead of println! spam, and trace-level get_destination logging (off by default via fern's LevelFilter::Debug). sha256 3dd7a67e3522970edab08515b08c8d5a210988dcf9eaa019172e0b2a1961896b Signed-off-by: slayerjain <shubham@keploy.io> * fix: narrow syncMock recover() + preserve mocks.gob perms on prune Addresses two Copilot review comments on #4045 (round 21): 1. syncMock.AddMock: the defer func(){_=recover()}() around the closed-channel send swallows every panic, including unrelated bugs unrelated to the shutdown race. Narrowed to only swallow runtime.Error("send on closed channel"); anything else re-panics so real bugs still surface. Message text is the stdlib constant emitted by chansend1 for closed-channel sends, stable across Go versions. 2. updateMocksGob: os.CreateTemp creates the tmp file 0600, so without a chmod before os.Rename pruning silently narrowed mocks.gob from 0644 (writer's typical mode via umask 0022) down to owner-only, breaking replay for any other user/process. Stat the source file first, apply the same Mode to the tmp file before the rename; fall back to 0644 when the source is missing so we don't introduce a new mode-inheritance path. Signed-off-by: slayerjain <shubham@keploy.io> * fix(syncMock): type-assert to error, not runtime.Error, for closed-channel panic Copilot correctly flagged that runtime's plainError (the type used for 'send on closed channel' via chansend1) satisfies the error interface but does NOT implement the exported runtime.Error. The previous r.(runtime.Error) type assertion would fail and re-panic, defeating the shutdown-safety the recover is there to provide. Switched to r.(error) + message match — this is the pattern the Go stdlib tests use for this exact panic. Added two tests that pin down the behavior so a future refactor cannot silently regress: - TestAddMockSwallowsClosedChannelPanic: full path through AddMock with a closed outChan; the goroutine must return without propagating the panic. - TestAddMockRepanicsUnrelatedValues: table-driven, proves the recover is narrow (closed-channel error swallowed; other runtime errors and plain string panics re-panic). Signed-off-by: slayerjain <shubham@keploy.io> * chore(windows): rebump vendored libwindows_redirector.a Built from a companion repo workflow (branch fix/windows-shutdown-log-spam, commit 524c407) by the self-hosted Build libwindows_redirector.a workflow at run 24607337104. Incorporates the follow-up Copilot fixes landed on wr#2 since the previous bump: - panic payload downcast (&'static str / String) so bounded_join logs the actual panic message instead of a generic marker - 'rustup override set stable' instead of mutating the runner-wide default toolchain - explicit contents:write job permission for the release-attach step (GitHub Actions does not evaluate ${{}} in permissions, so kept static) - disk-space cleanup step before checkout (cargo registry/git, stale target/, runner _diag/_temp) to handle the shared self-hosted runner's C: volume filling up sha256 4776e6bbf879a3a74b111e0a4c2c6ed8f9ae67b20934701ddcf8794aa3a81fb7 Signed-off-by: slayerjain <shubham@keploy.io> * ci: bound build-and-upload + pull-docker-image Windows jobs at 30min Run 24608731505's build-and-upload job sat at 'in_progress' for 40+ min on windows-runner-02 with no step progress visible, blocking every other Windows job queued on the same runner (wr#2 stuck in queue behind it). prepare_and_run_windows.yml's build-and-upload and pull-docker-image jobs had no timeout-minutes, so a stuck CGO link or docker-pull hang could hold the runner for 6h. Added 30-min caps consistent with the ones already on golang_native_windows.yml and golang_docker_windows.yml. Healthy runs are ~5 min; 30 gives 6x headroom. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy): serialize close(session.MC) with SyncMockManager sends Root-cause of the petclinic record_build_replay_latest hang on #4045. # What happened in CI run_java_linux / spring-petclinic-rest (record_build_replay_latest) failed at the 30-min workflow timeout. Last log line was the 80% progress marker during record; then ~28 minutes of silence before the operation-cancelled marker. Orphan processes at cleanup: keploy, java, sh, bash — all still alive. # Repro and RCA Ran the same record flow locally under -race: WARNING: DATA RACE Write at 0x00c0004b5820 by goroutine 46: runtime.closechan() proxy.go:576 close(p.session.MC) proxy.(*Proxy).start.func2 Previous read at 0x00c0004b5820 by goroutine 69: runtime.chansend() syncMock.(*SyncMockManager).AddMock.func1 generic.createGenericMocksAsync.func1 createGenericMocksAsync runs as a bare 'go' goroutine spawned from Generic.RecordOutgoing — it is NOT tracked by clientConnErrGrp, so the proxy shutdown path's errgroup.Wait() returns while the parser is still producing mocks. The shutdown goroutine then closes session.MC concurrently with AddMock's outChan <- mock. The prior recover() in AddMock swallowed the resulting 'send on closed channel' panic, but did not prevent the race itself. Under -race the runtime reports unsynchronized memory access, and the race correlated with the 28-min hang. # Fix Added sync.RWMutex (outChanMu) + outChanClosed bool to SyncMockManager. AddMock takes RLock across the send; a new CloseOutChan method takes Lock across close(outChan). This serializes every in-flight send against the close without contending parallel sends. The recover() stays as defence in depth for the narrow window between the closed check and the send. proxy.go:576 now calls mgr.CloseOutChan() instead of direct close. SetOutputChannel clears outChanClosed so re-record flows (same manager, new channel) keep working. # Verification $ go test -race ./pkg/agent/proxy/syncMock/ -run TestCloseOutChanRacesAddMock --- PASS: TestCloseOutChanRacesAddMock (0.00s) ok go.keploy.io/server/v3/pkg/agent/proxy/syncMock Local petclinic reproducer (10 rounds × 50 chains × ~5 requests, ~2500 requests total, 2555 mocks captured): zero data-race reports, 4-second clean shutdown on SIGTERM. The 28-min hang does not recur. New test TestCloseOutChanRacesAddMock pins the behavior: 8 goroutines × 500 sends racing against a midstream CloseOutChan stay clean under go test -race. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy): address full mock-channel race class + wire DNS through the guard Applies the principal-engineer code-review to the 0985bf0 fix. That commit blocked only the AddMock vs close race; this commit covers the whole race class and drops the scaffolding that was only needed for the partial fix. # Changes 1. Single RWMutex (outChanMu) now guards both outChan and outChanClosed. Read outChan inside the RLock so sender observation and send are consistent under one lock, fixing the TOCTOU where a Close → SetOutputChannel(new) interleaving would have let AddMock send on its stale local copy of the already-closed channel. 2. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go) now routes its send loop through the same sendToOutChan helper. Its previous 'capture outChan under m.mu, release, send outside any lock' pattern was the same race class as AddMock's but on a different caller (DedupQueue.ResolveJob from mongo/dedup parsers). Without this fix, the mongo-dedup fuzzer would reproduce the same -race hit on a different line. 3. dns.go's async DNS mock path (pkg/agent/proxy/dns.go) now goes through SyncMockManager.SendConfigMock instead of a bare 'session.MC <- mock'. SendConfigMock bypasses AddMock's firstReqSeen buffering — DNS is a separate key-value map (one mock per unique (name, qtype) via the p.recordedDNSMocks dedupe), not a streaming parser, so the correct behaviour is always forward immediately. The outChanMu guard applies identically to AddMock and SendConfigMock, so DNS and parser paths close-vs-send-serialize the same way. 4. Dropped the recover() defence-in-depth inside AddMock: once outChan is read under the same RLock that gates outChanClosed, the closed-channel panic cannot happen, and a stray recover would only mask real bugs. 5. Dropped ResetForNewSession — unused; SetOutputChannel already clears outChanClosed, and piling a second way to do the same thing invites drift. 6. AddMock no longer buffers after CloseOutChan. Post-close, the mock is dropped instead of leaking into a buffer that ResolveRange will never drain. (Matches the old recover-based drop semantics; preserves buffering when outChan is simply unbound.) # Tests All under go test -race. Added: - TestCloseOutChanRacesResolveRange: the ResolveRange equivalent of the AddMock race test. 4 goroutines × 200 ResolveRanges concurrent with a midstream CloseOutChan. Pre-fix this would race; post-fix clean. - TestSetOutputChannelAfterCloseReopens: pins the Close → SetOC → AddMock sequence. Regression for the TOCTOU surfaced in code review. - TestSendConfigMockDropsAfterClose: DNS path post-close drops silently. - TestSendConfigMockIgnoresFirstReqSeen: SendConfigMock keeps forwarding after SetFirstRequestSignaled (AddMock would buffer — DNS must not). - TestCloseOutChanRacesSendConfigMock: concurrent DNS-path senders vs close. Kept and updated: - TestAddMockAfterCloseDropsSilently (was TestAddMockSwallows ClosedChannelPanic): same semantic pinned, new mechanism. Removed the TestAddMockRepanicsUnrelatedValues harness — the recover() it was pinning is gone. # Verification $ go test -race ./pkg/agent/proxy/syncMock/... PASS (8 tests) Full petclinic reproducer, 10 stress rounds × 50 chains × ~5 requests/chain, ~2500 requests total: - 2590 mock files captured - Zero 'WARNING: DATA RACE' lines in the keploy log - Clean SIGTERM shutdown in 4 s (down from the 30-min timeout observed on the bad run) Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy,ci): address round-24 Copilot review on syncMock refactor Four follow-up fixes on ee0332e plus the Windows runner disk-full that failed this PR's own build-and-upload job. 1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the pre-release read of m.outChan was still under m.mu alone, racing SetOutputChannel/CloseOutChan which write it under outChanMu. Switched to outChanStatus() — same RLock path the send uses — so the consistency check is now under the lock that protects the field. 2. SetOutputChannel idempotence: the hot path is buildRecordSession re-binding rule.MC (same pointer) on every accepted connection. Previously it unconditionally cleared outChanClosed, so a post-shutdown re-bind turned a closed channel back 'open' and the next send would panic. Now only clears the flag when the channel POINTER actually changes (new session / re-record). New test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins this. 3. dns.go kept calling SetOutputChannel on every mock. With the idempotence fix in #2 above it's now safe, but the comment block is updated to spell out why, and the call is ordered before AddMock/SendConfigMock so the first DNS query (which may fire before the first TCP connection triggers buildRecordSession) still binds the channel. 4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest is 'client-side only' — it isn't, dialPostgresSSLUpstream re-originates SSLRequest upstream when dest port is 5432. Rewrote the comment to match actual behaviour and point at the runtime_hooks.go docstring. 5. prepare_and_run_windows.yml build-and-upload failed on ee0332e with 'There is not enough space on the disk' writing go-build importcfg under SystemTemp. Ported the tiered disk cleanup from a companion repo workflow: fast path on >=4 GB free, light sweep (SystemTemp\go-build*, sibling runners' _diag / _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows Update cache), last-resort go module cache purge, hard-fail under 2 GB. Verification: go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass petclinic local 5-round stress, 1295 mocks captured, zero WARNING: DATA RACE, 4 s clean SIGTERM shutdown. Signed-off-by: slayerjain <shubham@keploy.io> * fix(proxy,mockdb): round-25 Copilot — ctx in gob prune, ResolveRange drops stale on close Three items from Copilot round 25 on ee0332e/52c9ba5, plus the PR description update (applied separately via gh api PATCH). 1. pkg/platform/yaml/mockdb/db.go — updateMocksGob ignored its ctx (`_ = ctx`). Wired ctx.Done() checkpoints at the entry of the function and every 1024 iterations of the filter and encode loops so a cancelled recorder aborts a large prune instead of finishing it for no one. 1024 is a cheap stride: syscall cost is amortised while cancellation latency stays under a few ms even on ~10^5-mock test-sets. 2. pkg/agent/proxy/syncMock/syncMock.go — ResolveRange was only snapshotting the `bound` boolean from outChanStatus and discarding `closed`. That conflated 'not wired yet' with 'shutdown in progress': both fell through to the retain path, so any re-record session that rebound a fresh channel via SetOutputChannel could inherit stale mocks from the previous session's buffer. Thread `closed` through and drop instead of retaining in both branches (window-match and retention). New test: TestResolveRangeDropsStaleMocksAfterClose pins both branches clean across the close boundary. Verification: - go test -race ./pkg/agent/proxy/syncMock/... ./pkg/platform/yaml/mockdb/... (all pass) PR description also updated to match the actual upstream-side SSLRequest scope (was mistakenly labeled 'client-side-only' — dialPostgresSSLUpstream re-originates SSLRequest on port 5432). Signed-off-by: slayerjain <shubham@keploy.io> * fix(mockdb): reject volume-qualified testSetID in DeleteMocksForSet Copilot round 26 on keploy#4045: the existing path-traversal guard on DeleteMocksForSet covered separators, absolute paths, and '.'/'..' but not Windows volume qualifiers. 'C:' has no separators, isn't absolute on Linux, and Cleans to itself — but on Windows filepath.Join(base, 'C:') can drop the base and resolve at the drive root, turning os.Remove into an arbitrary-file delete on a self-hosted Windows runner. Added two layered rejections: 1. filepath.VolumeName(testSetID) != '' — catches the case at runtime on Windows (returns 'C:' for 'C:', '\\server\share' for UNC, '\\?\C:' for extended-length, all non-empty). 2. strings.ContainsAny(':' ...) — filepath.VolumeName is Windows-specific and returns '' on Linux for 'C:', so the explicit colon check makes the Linux build reject the same strings the Windows runtime would. Legitimate test-set names never carry a colon. strings.HasPrefix for '\\' is covered by the existing ContainsAny('/\\') check. Test TestDeleteMocksForSetRejectsVolumeQualifier covers four shapes: 'C:', 'D:foo', '\\srv', '\\?\C:'. Also replied to Copilot's secondary comment on readGobMocks:line 961 (same-iteration &m reuse claim) explaining why it's cargo-culted — the var is declared inside the for-body, not in the for-init, so each iteration gets a fresh heap-escaping allocation. The existing TestGobRoundtrip* suite would fail if the claim were correct. Signed-off-by: slayerjain <shubham@keploy.io> * ci(windows): use Windows PowerShell 5.1 instead of pwsh on self-hosted runners The new win-runner-[1-4] self-hosted pool that came online after the old windows-runner-0[123] were retired does not ship PowerShe…
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
The Windows go-http-no-deps record phase on #4076 run 24629542035/job/72014589521 hit a 0-tests-recorded failure because the readiness probe broke on a single 200 response, and that first 200 came from a transient Docker Desktop port-publish stub rather than the in-container Go binary. Four seconds later the real traffic got "Unable to connect to the remote server" on the first request, and the single try/catch wrapping all six calls swallowed the error with $sent=0. No test-set-0 directory ever formed, and the script exited with "Test directory not found". Three-part fix, applied identically to the docker and native variants under golang/go-dedup: 1. Readiness probe now requires $stabilityCount (=3) consecutive 200 responses with a 1s gap before declaring ready. A single flap from a port-forwarder stub can no longer race real traffic. 2. After the probe succeeds, settle for $settleSec (=5) before firing traffic so keploy's recording-proxy interception layer has time to fully wire up against the container's listener. 3. Replace the all-in-one try/catch around the six HTTP calls with an Invoke-WithRetry helper (5 attempts, 1.5x exponential backoff). A transient connection-refused on call #1 no longer aborts calls #2..#6. If the probe never reaches the stability threshold inside the deadline, the script still proceeds so downstream record validation can surface the failure with its own clearer error message, instead of us silently pretending the app was ready. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
…ilently drop mocks (#4076) * fix(syncMock): bounded-block sendToOutChan so burst capture doesn't silently drop mocks sendToOutChan used a non-blocking send with a default-drop: select { case m.outChan <- mock: default: } Under burst capture on an oversubscribed runner (CI, a customer's box under GC pressure, downstream agent momentarily backpressuring the outChan), this silently dropped pre-first-request mocks. Customers saw "some calls didn't replay" with no actionable log signal — the anti-pattern the Windows redirector shutdown loop hit. Replace with: 1. Non-blocking fast path (unchanged — the common case where the consumer is keeping up has zero latency cost). 2. On fast-path miss, fall through to a bounded block (sendBudget = 200 ms). This absorbs a GC pause or a transient downstream stall without losing data. 3. On bounded-block timeout, increment a process-global dropCount and emit a rate-limited Warn (first drop immediately, then every 1024th). Operators see the drop instead of wondering why replay is missing mocks. The RWMutex contract is preserved: the send still runs under outChanMu.RLock, so CloseOutChan (write-lock holder) can't interleave a close between the not-closed check and the send. The only new cost is CloseOutChan may now wait up to sendBudget for in-flight sends — imperceptible vs. a process shutdown, and every concurrent sender was already racing the same RLock. Fixes a recurring Redis loss-rate / channel-race flake on keploy/enterprise's CI where the tests feed syncMock via SetOutputChannel and never call SetFirstRequestSignaled, so every recorded mock flows through this path. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(syncMock): plumb logger, atomic.Uint64 drop counter, drop-path tests Addresses the Copilot feedback on the bounded-send PR: - Plumb *zap.Logger through SyncMockManager (SetLogger) and log the overflow via that logger instead of zap.L(). syncMock is a package-level singleton that loads before any zap.ReplaceGlobals, so zap.L() silently fell back to Nop and the drop Warn never reached operators. Proxy.New now wires the proxy logger in once. - Promote the sampled drop log from Warn to Error with a shorter, actionable message so overflow is loud and grep-friendly. - Switch dropCount from bare uint64 + atomic.AddUint64 to sync/atomic.Uint64 so the 32-bit alignment requirement becomes a compile-time property instead of a field-ordering footgun. - Export DropCount() for tests and external observability. - Add regression tests covering: no-drop when the consumer drains within sendBudget, drop + Error-log + dropCount increment past budget, sampling cadence (n=1, 1024, 2048), nil-logger Nop fallback, and concurrent-increment atomicity under load. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci(windows/go-dedup): robust app-ready probe + per-request retry The Windows go-http-no-deps record phase on #4076 run 24629542035/job/72014589521 hit a 0-tests-recorded failure because the readiness probe broke on a single 200 response, and that first 200 came from a transient Docker Desktop port-publish stub rather than the in-container Go binary. Four seconds later the real traffic got "Unable to connect to the remote server" on the first request, and the single try/catch wrapping all six calls swallowed the error with $sent=0. No test-set-0 directory ever formed, and the script exited with "Test directory not found". Three-part fix, applied identically to the docker and native variants under golang/go-dedup: 1. Readiness probe now requires $stabilityCount (=3) consecutive 200 responses with a 1s gap before declaring ready. A single flap from a port-forwarder stub can no longer race real traffic. 2. After the probe succeeds, settle for $settleSec (=5) before firing traffic so keploy's recording-proxy interception layer has time to fully wire up against the container's listener. 3. Replace the all-in-one try/catch around the six HTTP calls with an Invoke-WithRetry helper (5 attempts, 1.5x exponential backoff). A transient connection-refused on call #1 no longer aborts calls #2..#6. If the probe never reaches the stability threshold inside the deadline, the script still proceeds so downstream record validation can surface the failure with its own clearer error message, instead of us silently pretending the app was ready. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci(windows/go-dedup): literal port substitution + patched YAML dump Root cause of the #4076 run 24629876059 failure: the `(?m)("?)8080:8080\1` backref regex intermittently produces a malformed port line on Windows PowerShell 5.1 — the leading quote is stripped while the trailing one is kept, yielding `- 54652:8080"`. docker-compose parses that as `containerPort: 8080"` and rejects it with `invalid containerPort: 8080"`. No container starts, every HTTP call gets ECONNREFUSED, and the record phase records zero tests. My earlier readiness-probe + retry work on this PR correctly surfaced the downstream symptom (consecutive 200s never observed, all six traffic requests exhaust their retry budget), but it was papering over the actual bug: the YAML itself was broken before docker-compose ever ran. No amount of client-side resilience can recover from a container that was never started. This change: 1. Replaces the backref regex with a plain `.Replace("8080:8080", "<port>:8080")` — the samples-go/go-dedup compose file uses the double-quoted short form exclusively, so literal substitution is both simpler and provably correct. 2. Validates that the expected port fragment is present after substitution, throwing loudly with the unpatched YAML printed if the sample ever introduces a form the script doesn't handle. Silent fall-through to an unpatched run was what let this slip undetected for so long. 3. Dumps the patched YAML to the job log so any future containerPort / hostPort diagnostic lands with the exact text docker-compose parsed — no more guessing whether the regex behaved. Native-windows script is left untouched — it runs the Go binary directly, not dockerized, and does not hit this code path. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
Two orthogonal fixes to stop unrelated Linux CI flakes from blocking the Windows CI work on #4077. gin-mongo: one record iteration on run 24631193918 hung for 140+ minutes and never produced a log we could RCA from. Three things made it possible for a single keploy misbehavior to stall the entire job: 1. The "wait for app to respond" loop had no deadline — any issue on the keploy/mongo/docker side where the app never started would loop on curl + sleep forever. Now bounded to 5 minutes per iteration; fails the iteration with a clear error instead of stalling. 2. SIGINT was best-effort — the script sent SIGINT, waited up to 30s, and then proceeded to `wait`, which blocks forever on the tee'd stdout of a keploy record process that hasn't exited. Now: if keploy is still alive after the 30s grace, send SIGQUIT first (Go runtime dumps all goroutine stacks into the already-tee'd app log — exactly the RCA data we were missing), then SIGKILL. Future hangs leave a post-mortem in ${app_name}.txt. 3. (Consequence of #2) `wait` now resolves promptly because the background pipeline closes. pokeapi: run 24631193918/job/72018929505 failed with a TLS handshake timeout from proxy.golang.org when go build fetched github.com/go-chi/chi@v1.5.5. go build has no built-in retry for module download; one transient handshake drop kills the job. Wrap in a bounded retry loop with exponential backoff (5s, 10s, 20s, ...) and an explicit GOPROXY fallback to direct. Neither script touches keploy Go code. Both fixes are scoped to their respective samples' runners. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain
added a commit
that referenced
this pull request
Apr 19, 2026
…replay stderr (#4077) * ci(windows/go-dedup): gate probe on keploy PID, neutralize replay stderr Two orthogonal failures surfaced on #4076 run 24630134970 once the port-patching RCA from 2fdd805 was in: 1. The readiness probe exits after ~5s because the loop gates on `$recJob.State -eq 'Running'` and the Start-Job tailing the keploy log files intermittently drops out of 'Running' on Windows when Get-Content -Wait races a log rotation. The tail job is a diagnostic helper, not the signal we want — the real "is keploy still working" fact is whether the keploy process itself (tracked in $REC_PID) is alive. Switch the loop terminator, drop the fragile dependency on the tail job. 2. Replay completes successfully but the script exits 1 because `& $env:REPLAY_BIN @testArgs 2>&1 | Tee-Object` promotes docker's benign stderr lifecycle lines ("Container … Recreate") to a terminating NativeCommandError under ErrorActionPreference=Stop. The pass/fail truth is the report YAML we validate immediately after, not PowerShell's opinion of native-command stderr. Temporarily flip EAP to 'Continue' just for the replay invocation and record $LASTEXITCODE so the downstream report check has more context if it fails. Both fixes applied symmetrically to the docker and native variants; the native script was carrying the same two patterns even though only the docker one has failed so far. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci(windows/go-dedup): pass --port to replay so it hits the published host port Record captures the testcase URL as `localhost:8080` — the container-internal port the Gin app actually binds to. The script patches docker-compose to publish on `$appPort`:8080 (random high port) to sidestep vpnkit / wsl's chronic hold on :8080 on Windows Docker Desktop runners. That worked for record (which hits $appPort from the host directly), but broke replay. keploy test re-dials the recorded URL (`localhost:8080`) on the host. On Windows there's no host→container bridge at :8080 with our patched port mapping — only :$appPort. All 9 recorded testcases on run 24630753752/job/72017828188 failed with ACTUAL STATUS 0 and "dial tcp [::1]:8080: connectex: No connection could be made because the target machine actively refused it." keploy's `test --port` flag is exactly for this — it rewrites the testcase destination at replay time, so recorded `localhost:8080` becomes `localhost:$appPort` and lands on docker's actual publisher. The flag is documented as "Custom http port to replace the actual port in the testcases" (cli/provider/cmd.go:345). Docker variant only — native-windows runs the Go binary directly, no port mapping to rewrite. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci(linux): bound gin-mongo runtime + retry pokeapi go-build Two orthogonal fixes to stop unrelated Linux CI flakes from blocking the Windows CI work on #4077. gin-mongo: one record iteration on run 24631193918 hung for 140+ minutes and never produced a log we could RCA from. Three things made it possible for a single keploy misbehavior to stall the entire job: 1. The "wait for app to respond" loop had no deadline — any issue on the keploy/mongo/docker side where the app never started would loop on curl + sleep forever. Now bounded to 5 minutes per iteration; fails the iteration with a clear error instead of stalling. 2. SIGINT was best-effort — the script sent SIGINT, waited up to 30s, and then proceeded to `wait`, which blocks forever on the tee'd stdout of a keploy record process that hasn't exited. Now: if keploy is still alive after the 30s grace, send SIGQUIT first (Go runtime dumps all goroutine stacks into the already-tee'd app log — exactly the RCA data we were missing), then SIGKILL. Future hangs leave a post-mortem in ${app_name}.txt. 3. (Consequence of #2) `wait` now resolves promptly because the background pipeline closes. pokeapi: run 24631193918/job/72018929505 failed with a TLS handshake timeout from proxy.golang.org when go build fetched github.com/go-chi/chi@v1.5.5. go build has no built-in retry for module download; one transient handshake drop kills the job. Wrap in a bounded retry loop with exponential backoff (5s, 10s, 20s, ...) and an explicit GOPROXY fallback to direct. Neither script touches keploy Go code. Both fixes are scoped to their respective samples' runners. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: fix pgrep pattern that silently missed keploy PID on Linux runners ## RCA On #4077 run 24631193918 the `gin-mongo (record_latest_replay_build)` job hung for 140+ minutes. The culprit was a pgrep-pattern bug in the CI bash helpers, not a hang in keploy itself. Every affected script does: "$RECORD_BIN" record -c "./someApp" 2>&1 | tee app.txt & ... REC_PID="$(pgrep -n -f 'keploy record' || true)" sudo kill -INT "$REC_PID" ... wait # at the bottom of the outer loop The Linux CI artifact is named `keploy-record` (hyphen). `pgrep -n -f 'keploy record'` — with a literal SPACE, not a hyphen — never matches a command line of the form `keploy-record record -c …`. `pgrep` exits non-zero, `|| true` swallows it, `REC_PID` is empty, and `sudo kill -INT ""` degenerates into a silent no-op ("failed to parse argument"). Keploy therefore never receives the shutdown signal, keeps running indefinitely, its tee'd stdout never closes, and the terminal `wait` at the bottom of the outer loop blocks on that pipe until GitHub Actions' 6-hour hard timeout kills the job. Reproduced the empty-pgrep and silent-no-op locally with the same binary naming convention (full repro stored in /tmp/ginmongo-repro): keploy stays alive, `kill: failed to parse argument: ''` is printed, and subsequent iterations see a stale keploy holding :27017 and 8080. ## Fix Replace the hard-coded literal with the basename of `$RECORD_BIN`, which is always set before these scripts run: REC_PID="$(pgrep -n -f "$(basename "${RECORD_BIN:-keploy}") record" || true)" Matches both `keploy record …` and `keploy-record record …` regardless of which artifact the job picked up (docker builds produce `keploy`; the linux native artifact produces `keploy-record`). Defaults to `keploy` so manual invocations (`bash script.sh` with no env) still work. ## Scope Applied to all 13 bash scripts carrying the buggy pattern: fuzzer/{mongo,postgres,mysql,rerecord}, golang/{gin_mongo/*, risk_profile,dns_mock,echo_sql/docker,proxy_stress_test/docker}, node/express_mongoose, python-docker, python/django_postgres. ## Defense-in-depth (already in 76e9dc6) The gin-mongo helper also gained SIGQUIT→SIGKILL escalation after 30s, so if pgrep ever misses again we get goroutine stacks in the tee log instead of another blank 6-hour timeout. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain
added a commit
that referenced
this pull request
Apr 24, 2026
…ext_step, block-comment allowlist, honest chaos skip Responds to Copilot tick #2 on #4113: - fakeconn/fakeconn.go: Read/ReadChunk now re-fetch the deadline channel on every loop iteration. A concurrent SetReadDeadline closes a "deadline changed" broadcast channel that the select observes, the loop reloads both channels, and the new deadline takes effect on in-flight reads. Previously the dlCh was captured once and concurrent SetReadDeadline updates were invisible to blocked readers. - util/util.go: RecoverWithoutClose's Error log now includes the recovered panic value and an actionable next_step (supervisor falls through to passthrough; file the Sentry issue with the parser owner or set KEPLOY_NEW_RELAY=off). No shortcut suppression; the log is genuinely actionable. - tools/lint/no_timestamp_in_parser/analyzer.go: collectSuppressLines now matches the marker in both line comments ("// allow:time.Now") and block comments ("/* allow:time.Now ... */"), matching the documented contract. ast.Comment.Text retains outer delimiters so we strip them explicitly before the prefix check. - tests/e2e/chaos-broken-parser: the in-process V2 proxy wiring is still stubbed (broken_parser.go returns errChaosNotYetWired). Rather than pretend the test can pass, three aligned fixes: * chaos_test.go now passes -tags chaos_broken_parser to `go run` so the tagged path is actually compiled. * chaos_test.go adds a KEPLOY_CHAOS_WIRING gate; without it the test skips with a clear "wiring not yet implemented" message. * harness/main.go propagates errChaosNotYetWired from startBrokenParserProxyIfEnabled and skips the supervisor- fallback invariant when the wiring is absent (I1 and I3 still apply — they are properties of the docker-compose stack itself). errChaosNotYetWired is now declared in broken_parser_stub.go so main.go can reference it regardless of build tag. All tests pass under -race. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain
added a commit
that referenced
this pull request
Apr 24, 2026
…ve migration (#4113) * feat(proxy): supervisor + relay + fakeconn foundation for parser isolation Introduce the record-mode architecture that isolates integration parsers from real network sockets, so a parser panic, hang, or resource exhaustion bug cannot affect the application's live TCP traffic. New packages (pkg/agent/proxy/): - fakeconn: read-only net.Conn populated with timestamped Chunks by the relay. Write always returns ErrFakeConnNoWrite so parsers cannot accidentally corrupt peer traffic. Deadlines, close, chunk blocking reads all covered by unit tests. - directive: parser -> supervisor control protocol for mid-stream operations (TLS upgrade, abort, finalize, pause, resume). Parsers no longer touch *tls.Conn directly; the relay handles both sides of the handshake. - supervisor: panic firewall + activity-based hang watchdog + memory cap wrapping each parser goroutine. Session type is a superset of the legacy integrations.RecordSession surface plus an incomplete-mock gate (EmitMock / MarkMockIncomplete / MarkMockComplete). - relay: sole owner and writer of real client/dest sockets. Stamps ReadAt/WrittenAt on every chunk at the syscall boundary and tees a copy into the parser's FakeConn via a bounded channel with drop accounting on memoryguard pressure / per-connection cap / channel full. Directive processor handles TLS upgrade by swapping the conn pointer under a pause barrier. - util.RecoverWithoutClose: companion to util.Recover that catches panics without closing sockets, for parsers that do not own the real sockets under the new path. - util.KillSwitch: process-wide parsing disable via env (KEPLOY_DISABLE_PARSING), SIGUSR1, or Trip/Reset. Unix/Windows build-tag split for the signal handler. Integration surface (additive, opt-in): - integrations.RecordSession gains V2 *supervisor.Session; un-migrated parsers ignore it and continue on the legacy path unchanged. - integrations.IntegrationsV2 marker interface (IsV2() bool) lets migrated parsers opt in. - proxy.handleConnection: checks util.DefaultKillSwitch.Enabled() first (routes to globalPassThrough), then matchedParser as IntegrationsV2 with IsV2()==true (routes to recordViaSupervisor), else legacy path untouched. KEPLOY_NEW_RELAY=off forces legacy even for V2 parsers as a rollback knob. - recordViaSupervisor runs the parser under the supervisor; on FallthroughToPassthrough (panic/hang/mem-cap) hands the real sockets to globalPassThrough so user traffic continues regardless of parser state. - newProxyTLSUpgradeFn adapts the existing pTls.HandleTLSConnection helper and crypto/tls.Client into the relay's TLSUpgradeFn shape. End-to-end integration tests (pkg/agent/proxy/v2_integration_test.go) prove invariants I1-I5 / I7: - TestV2_HappyPath_ChunkTimestampsCarried: traffic flows end-to-end, parser reads Chunks from FakeConns, emits mocks anchored to chunk.ReadAt / chunk.WrittenAt. - TestV2_PanicDoesNotBlockTraffic: panic caught, PanicReporter invoked, zero mocks emitted, client STILL receives the server's reply through the relay. - TestV2_HangDetected: activity-based watchdog fires within budget, supervisor returns StatusHung + FallthroughToPassthrough. - TestV2_KillSwitchLifecycle: Trip/Reset/Enabled round-trip. Zero behavior change for current traffic: no parser implements IsV2() yet, so every connection continues on the legacy path. Subsequent PRs migrate parsers one at a time. All new packages have unit tests that pass under -race. Full pkg/agent/proxy/... test suite passes; go vet clean; gofmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * feat(http): native-migrate record path to V2 FakeConn architecture Signed-off-by: slayerjain <shubhamkjain@outlook.com> * feat(mysql): native-migrate record path to V2 FakeConn architecture Signed-off-by: slayerjain <shubhamkjain@outlook.com> * feat(generic): native-migrate record path to V2 FakeConn architecture Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(mysql): annotate record_v2 MITM TLS config for CodeQL The V2 MySQL recorder's buildDestTLSConfigV2 sets InsecureSkipVerify=true because keploy acts as a MITM between the app and the real upstream — the same pattern used elsewhere in keploy's record-mode TLS handling. CodeQL flags the new occurrence as go/disabled-certificate-check; document the MITM rationale on the call site and suppress with both lgtm and gosec directives so the existing keploy pattern keeps passing security scans. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(mysql): use system trust store in V2 dest TLS config CodeQL flagged the previous InsecureSkipVerify=true as a new security alert. Replace it with default verification against the system trust store. When the upstream's cert doesn't verify (self-signed, custom CA), the supervisor's fallback-to-passthrough ensures traffic still flows — the mock is dropped but the app's connection continues. This is strictly safer and aligns with the stability-over-fidelity trade- off documented elsewhere in the V2 architecture. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * feat(lint): forbid time.Now in parser record files (I5 invariant) Signed-off-by: slayerjain <shubhamkjain@outlook.com> * test(e2e): chaos test for broken parser — panic should not break app's DB connection Signed-off-by: slayerjain <shubhamkjain@outlook.com> * docs+shutdown: architecture README, matcher contract, parser migration guide, coordinated shutdown - pkg/agent/proxy/README.md: V2 architecture overview, safety invariants I1-I8 with enforcement mapping, migration pattern skeleton, test coverage summary, rollout knobs. Preserves legacy doc at the bottom. - docs/reference/mock-matcher-contract.md: pins what the replay matcher consumes from every mock (ReqTimestampMock, ResTimestampMock, Kind, ConnectionID, reserved Metadata keys). Documents the I5 timestamp rule and the per-session monotonicity clamp. - docs/contributing/parser-migration-guide.md: step-by-step recipe for porting a legacy parser to V2, including mid-stream TLS handling via the directive channel and the 'things that bite people' list. - proxy.StopProxyServer: coordinated shutdown sequence -- trip kill switch first (in-flight accepts bypass parsers), close listener, drain grace 5s, then force-close tracked client connections. Matches PLAN.md §3.7 for the userspace half of invariant I8. The kernel-side proxy_ready BPF gate requires BPF source changes (source not in this repo) and is documented as deferred. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * docs(chaos-e2e): correct stub header — V2 packages exist on branch Original stub was written in a worktree based off an older snapshot and documented the V2 packages as missing. They exist. Update the header to reflect what's actually left: define the broken parser type implementing IntegrationsV2, register with high priority synthetic IntegrationType, instantiate an in-process proxy with a stub MockManager. A follow-up PR can close the loop. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * style: gofmt fix broken_parser stub header indent Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy): address Copilot review — record-only kill switch, relay-continues-on-fallback, logging guideline, godoc, benign errors Responds to Copilot review on #4113: - proxy.go: kill switch now only gates MODE_RECORD; MODE_TEST replay dispatch is unaffected so mocks keep matching when record-time parsing is disabled. - proxy_v2.go (architectural): on supervisor FallthroughToPassthrough, do NOT cancel the relay before starting passthrough — the relay keeps forwarding client<->dest bytes raw until peer close. The old code stopped forwarding during the fallback gap, defeating the I1 invariant. Removes the redundant globalPassThrough call (the relay IS the passthrough once the tee is cut via SessionOnAbort). Log downgraded to Debug with actionable next_step guidance. - proxy_v2.go: clarified comment on why ErrGroup remains populated on the V2 path (legacy integration helpers access it via ctx-value in shared code, migration tracked). - relay/directive_proc.go: TLS-upgrade failures log at Debug with directive_reason + next_step instead of Warn (expected condition, ack surfaces the error to the parser). - relay/relay.go: isBenignNetErr now matches net.ErrClosed and the "use of closed network connection" text the doc claimed was already handled. - directive/doc.go: godoc link updated to [Ack] (type is Ack, not DirectiveAck). - util/kill_switch.go: comment on envDisableParsing no longer claims it's exported — it's a package-local constant. - util/util.go: RecoverWithoutClose's sentry.Flush now runs only on the actual panic path, not on every deferred call. - supervisor/supervisor.go: hang-abort log downgraded from Warn to Debug with next_step (tune HangBudget, or set KEPLOY_NEW_RELAY=off). - fakeconn/fakeconn.go: synthetic chunks drained from the Read-stash now carry the source chunk's ReadAt/WrittenAt/Dir; previously the doc claimed this but the impl returned zero timestamps. All tests pass under -race (pkg/agent/proxy/...). Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy/tools/chaos): address Copilot review — deadline re-fetch, next_step, block-comment allowlist, honest chaos skip Responds to Copilot tick #2 on #4113: - fakeconn/fakeconn.go: Read/ReadChunk now re-fetch the deadline channel on every loop iteration. A concurrent SetReadDeadline closes a "deadline changed" broadcast channel that the select observes, the loop reloads both channels, and the new deadline takes effect on in-flight reads. Previously the dlCh was captured once and concurrent SetReadDeadline updates were invisible to blocked readers. - util/util.go: RecoverWithoutClose's Error log now includes the recovered panic value and an actionable next_step (supervisor falls through to passthrough; file the Sentry issue with the parser owner or set KEPLOY_NEW_RELAY=off). No shortcut suppression; the log is genuinely actionable. - tools/lint/no_timestamp_in_parser/analyzer.go: collectSuppressLines now matches the marker in both line comments ("// allow:time.Now") and block comments ("/* allow:time.Now ... */"), matching the documented contract. ast.Comment.Text retains outer delimiters so we strip them explicitly before the prefix check. - tests/e2e/chaos-broken-parser: the in-process V2 proxy wiring is still stubbed (broken_parser.go returns errChaosNotYetWired). Rather than pretend the test can pass, three aligned fixes: * chaos_test.go now passes -tags chaos_broken_parser to `go run` so the tagged path is actually compiled. * chaos_test.go adds a KEPLOY_CHAOS_WIRING gate; without it the test skips with a clear "wiring not yet implemented" message. * harness/main.go propagates errChaosNotYetWired from startBrokenParserProxyIfEnabled and skips the supervisor- fallback invariant when the wiring is absent (I1 and I3 still apply — they are properties of the docker-compose stack itself). errChaosNotYetWired is now declared in broken_parser_stub.go so main.go can reference it regardless of build tag. All tests pass under -race. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(supervisor/docs): implement the documented behaviour — monotonic clamp, ConnectionID propagation, relay wiring doc Responds to Copilot tick #3 on #4113: - supervisor/session.go: EmitMock now implements the per-session ReqTimestampMock monotonicity clamp that matcher-contract.md describes — a mock earlier than the most-recent emitted mock on the same session gets its timestamp clamped to lastReq+1ns. SetDebugMonotonic(true) in test binaries turns the clamp into a panic so parser bugs surface immediately. Guarded by a short mutex; lock-free hot paths are preserved for hookMu / mockIncomplete. - supervisor/session.go: EmitMock now auto-propagates ClientConnID to mock.ConnectionID when the parser didn't already populate it, matching the documented Session.ClientConnID contract. Parsers that want to override can still assign before EmitMock. - relay/doc.go: package doc no longer says "not yet wired into handleConnection" — it is, via proxy_v2.go::recordViaSupervisor for V2-capable parsers. Mentions the KEPLOY_NEW_RELAY rollback knob. - docs/reference/mock-matcher-contract.md: rewrote the monotonicity and ConnectionID sections to match the actual implementation — simplified code example removed, behaviour described including the SetDebugMonotonic toggle and the zero-timestamp bypass. All tests pass under -race. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(lint/chaos): narrow lint scope to *_v2.go; reusable tickers; README accuracy Responds to Copilot tick #4 on #4113: - tools/lint/no_timestamp_in_parser/analyzer.go: narrowed inScope to *_v2.go files (and recorder_v2/ directory) instead of matching any encode*.go. Legacy encode.go files in integrations/generic and integrations/http use time.Now() legitimately; the old scope would have flagged them as violations. The V2 contract only applies to the *_v2.go files that adopted the chunk-timestamp rule, so that's the precise scope. Package doc rewritten to document the narrowed scope + rationale. - tests/e2e/chaos-broken-parser/harness/main.go: replaced the two time.After-in-loop sites (driveQueries inter-query spacing and waitForPostgresHealthy retry loop) with reusable *time.Ticker + defer Stop(). Matches the repo's polling-loop convention and avoids per-iteration timer allocation. - tests/e2e/chaos-broken-parser/README.md: "Status on this branch" section now accurately reflects that the V2 packages ARE landed on feat/proxy-v2-foundation (they're introduced by this PR). What remains stubbed is the in-process proxy instantiation inside broken_parser.go — the TODO checklist in its header documents it. All tests pass under -race; `go vet`, `gofmt -l`, `go build ./...` clean. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy,docs): wire incomplete-mock + pending-work hooks; Chunk.IsZero ignores Dir; Debug-level FakeConn.Write; doc drift Responds to Copilot tick #5 on #4113: - proxy_v2.go: relay.Config now wires OnMarkMockIncomplete to the session's MarkMockIncomplete so tee drops (memoryguard, per-conn cap, channel full, write error, short write, KindAbortMock) mark the in-flight mock incomplete. Previously the relay would drop silently and the parser could still emit a partial mock — exact invariant I4 violation Copilot flagged. - relay/config.go, relay/relay.go: new OnClientChunkTeed callback invoked on each successful client→dest tee. proxy_v2.go wires it to sv.MarkPendingWork and sets svSess.OnPendingCleared to sv.ClearPendingWork. EmitMock calls OnPendingCleared after a successful emit. Net effect: the supervisor's activity watchdog can now distinguish an idle connection (no pending requests) from a parser that received bytes but isn't emitting a mock (hang candidate), closing a real gap in watchdog semantics. - supervisor/session.go: OnPendingCleared field + EmitMock wiring. - fakeconn/chunk.go: Chunk.IsZero drops the `Dir == 0` check because Direction(0) is the valid value FromClient; a real empty FromClient chunk with zero timestamps would have been misclassified as zero. Now checks only payload + timestamps + SeqNo, which is the genuine "unset" signal. - fakeconn/fakeconn.go: FakeConn.Write's log call downgraded Warn → Debug. The returned ErrFakeConnNoWrite is the primary signal for the parser-misuse bug; the log is diagnostic. logger interface's method also renamed Warn → Debug for consistency. - tests/e2e/chaos-broken-parser/harness/broken_parser.go: header updated — the supervisor fallback KEEPS the relay running for raw forwarding rather than invoking globalPassThrough. Matches the current recordViaSupervisor behaviour; previous wording was from an earlier design iteration. - tools/lint/no_timestamp_in_parser/README.md: scope description rewritten to match the narrowed *_v2.go / recorder_v2/ scope. Legacy encode.go files explicitly documented as out-of-scope. - docs/reference/mock-matcher-contract.md: enforcement paragraph aligned with the narrowed analyzer scope. All tests pass under -race. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(lint/chaos): TypesInfo-based time-package resolution; single-source suppression marker; doc/filename drift Responds to Copilot tick #6 on #4113: - tools/lint/no_timestamp_in_parser/analyzer.go: time-package resolution now goes through pass.TypesInfo (refersToStdlibTimePackage) so the rule catches aliased imports like `import stdtime "time"` and renamed copies, not just bare `time` identifiers. Falls back to the old name-match when TypesInfo isn't populated (e.g. testdata without full type-check). - tools/lint/no_timestamp_in_parser/analyzer.go: suppressionComment is now the single source of truth for the marker — the matcher derives its prefix from the constant (no more `_ = suppressionComment` workaround) and the diagnostic message interpolates it too. - tools/lint/no_timestamp_in_parser/analyzer.go: inScope docstring rewritten to match the actual "directory-agnostic *_v2.go / *recorder_v2/* suffix match" implementation, with rationale for the cross-repo consistency reason. - tools/lint/no_timestamp_in_parser/analyzer_test.go: top-of-file comment updated — the testdata fixtures exercise the *_v2.go suffix match, not a /recorder/ substring match. - tests/e2e/chaos-broken-parser/chaos_test.go: comment on e2eTagEnabled referenced chaos_test_e2e_tag.go / chaos_test_no_tag.go which don't exist; real sentinel files are e2e_tag_enabled_test.go and e2e_tag_disabled_test.go. Updated so future maintainers find the right files. All tests pass under -race. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * feat(fakeconn): add LastWrittenTime accessor FakeConn already tracks each chunk's WrittenAt timestamp — but only via the chunk returned from ReadChunk. Parsers that consume bytes via Read (and therefore never see Chunk structs) had no way to observe the response-side timestamp without refactoring their read loop. Adds a symmetric LastWrittenTime() accessor alongside LastReadTime() so parsers can source ResTimestampMock from chunk.WrittenAt in the same way ReqTimestampMock is sourced from chunk.ReadAt via LastReadTime. Implementation mirrors LastReadTime: an atomic.Int64 updated by every ReadChunk delivery (including the drain-from-stash path) and returned as time.Unix(0, n). Consumer fix follows in keploy/integrations#133 (Postgres v3 session mock switches from LastReadTime to LastWrittenTime for resTs). Test coverage: TestReadFromChunks now asserts both LastReadTime and LastWrittenTime across a two-chunk Read; new TestLastWrittenTimeZeroBeforeAnyChunk pins the zero-value contract. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: clear pending-work on dropped mocks; disambiguate V2 success log Four Copilot review follow-ups on 1e8f383: - supervisor/session.go: EmitMock's drop-due-to-mockIncomplete path now calls OnPendingCleared before returning. Without this the supervisor's hang watchdog stays armed after a benign drop (chunk gate, memory pressure, short write) and eventually fires a spurious abort on an idle connection. New TestSessionEmitMockDropPathClearsPending pins the behavior. - integrations/generic/encode_v2.go: flushMock's incomplete-drop early-return path now also calls sess.OnPendingCleared. The early return was an allocation optimisation and previously skipped the pending-clear that EmitMock would have done, for the same watchdog reason. - proxy.go (V2 dispatcher): the post-recordViaSupervisor log used to claim "successfully recorded via V2 path" even when the supervisor had fallen through to passthrough (parser panic / hang / explicit fallback), because recordViaSupervisor returns nil error in both cases. Split the reporting: recordViaSupervisor now logs "V2 parser recorded outgoing message successfully" (with the supervisor result.Status) only on the genuine non-fallthrough success path; the caller log softens to "V2 record path returned". - relay/tee.go: comment on the per-conn byte-cap check said "Reserve capacity optimistically, then undo on channel-full" but the code only increments t.bytes after a successful staging send, so nothing is ever "undone" on drop paths. Rewrote the comment to match actual accounting so a future change doesn't rely on an incorrect description. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: doc accuracy + stderr fallback + panic next_step Four Copilot review follow-ups on c2e63d2: - directive/doc.go: package doc said directives go "to a channel owned by the supervisor, which forwards ... to the relay". Actually the relay exposes the directive channel directly (session.Directives = r.Directives()) and processes them inside relay.processDirectives; the supervisor never touches them. Rewrote to match the actual ownership/flow. - util/util.go RecoverWithoutClose: the nil-logger fallback printed "Recovered from panic in parser (no logger available)" to stdout. Library code writing to stdout can corrupt CLI/tool-parseable output streams. Switched to fmt.Fprintln(os.Stderr, ...) — same signal, predictable stream. - supervisor/supervisor.go classifyReturn panic path: Error log lacked an actionable next_step. Added guidance pointing at the supervisor's automatic fallthrough-to-passthrough behaviour plus the three escape hatches (KEPLOY_NEW_RELAY=off for this parser, KEPLOY_DISABLE_PARSING=1 / SIGUSR1 to disable parser dispatch entirely) so operators have a clear remediation path. - proxy_v2.go recordViaSupervisor doc: responsibility #4 said FallthroughToPassthrough would "run globalPassThrough on the real sockets", but the implementation intentionally does NOT call globalPassThrough — it keeps the relay forwarding until peer close (invariant I1 in pkg/agent/proxy/README.md; calling globalPassThrough would create a read-loop gap, exactly the stall V2 was designed to eliminate). Aligned the comment. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: drop unused srcPtr/dstPtr; correct README I3 fallback claim Three Copilot review follow-ups on 0799eb3: - proxy_v2.go newProxyTLSUpgradeFn: the signature took srcPtr and dstPtr *net.Conn parameters that were never used. The function only performs the handshake and returns the upgraded net.Conn; pointer mutation is the relay's responsibility. Dropped the two parameters and updated the single call site. The doc comment already explained this ownership split; now the signature matches. - README.md "Safety invariants" table I3: claimed recordViaSupervisor routes FallthroughToPassthrough to globalPassThrough. Actual behaviour (tick 16's proxy_v2.go doc fix) is the opposite — the relay keeps forwarding until peer close so there is never a read-loop gap. Rewrote the I3 "Enforced by" cell to match. - README.md recordViaSupervisor overview paragraph: same globalPassThrough claim in prose form. Rewrote to describe the relay-continuation fallback + explicit note that calling globalPassThrough would reintroduce the exact stall V2 was designed to eliminate. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: shutdown drain now actually works; correct Metadata[type] doc Two Copilot review follow-ups on c88f4f5: - proxy.go / proxy_v2.go: waitForConnDrain was ineffective because p.clientConnections was only read and nil-cleared on shutdown, never populated by any call path. len() was always 0, so the 5-second drain grace returned immediately and shutdown proceeded without waiting for in-flight parsers. The force-close loop that followed was also a no-op. Replaced with a sync.WaitGroup approach: - Proxy.activeConns sync.WaitGroup replaces the clientConnections []net.Conn slice + connMutex coordination. - handleConnection calls activeConns.Add(1) / defer Done() at entry so every invocation is tracked, regardless of which return path it takes. - waitForConnDrain spins a sentinel goroutine that closes a done channel when Wait() returns, then selects done vs ctx.Done(). Net effect: a single Wait() drains the whole active set with a deadline. - The dead force-close loop is gone. Connections that don't finish inside the grace window exit via the parent ctx cancellation their parserCtx inherits; their deferred srcConn/dstConn closes fire on return without any double-close race from a shared slice. - docs/reference/mock-matcher-contract.md: Metadata["type"] was documented as "protocol name (e.g. mysql)" but the codebase actually uses it as a lifetime tag (see pkg/models/lifetime.go): "config" for session-lifetime mocks, "connection" for connection-lifetime mocks. Rewrote the reserved-keys table row with the accurate semantic and a pointer to lifetime.go. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * Change default integration ref to 'feat/v2-native-parsers' Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * fix: align shutdown-step doc; test block-comment timestamp suppression Two Copilot review follow-ups on 5e37514: - proxy.go StopProxyServer shutdown-step comment: said step 4 was "Force-close any still-live client connections at the end", but the implementation no longer holds a slice of live conns (dropped in tick 20's activeConns WaitGroup refactor). Rewrote step 3 to describe the WaitGroup-based drain and added step 4 explaining the deliberate design: stragglers exit asynchronously via parent ctx cancellation, without a shared slice and without a double-close race against their own deferred closes. Shutdown is still bounded by the 5-second grace timeout. - tools/lint/no_timestamp_in_parser: the README documents both line-comment (// allow:time.Now) and block-comment (/* allow:time.Now */) suppression, but the analysistest fixtures only exercised the line form. Added two new exported functions to the "suppressed" fixture: LogUptimeBlockComment — single-line block comment above a time.Since() call, proves the /* ... */ branch of collectSuppressLines fires. LogBootTwoLineBlock — multi-line block comment; verifies that the analyzer's "end line + 1" lookup honours the block's END line, not its start line. With these fixtures the block-comment suppression path gains real test coverage so a future refactor can't silently regress the README-documented behaviour. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: pause relay tees on abort; flag temporary action-ref default Three Copilot review follow-ups on 80d55b2: - proxy_v2.go SessionOnAbort / relay/relay.go: after parser abort (panic / hang / mem-cap) we already closed the FakeConns so the parser's reads unblock, but the relay's tees stayed active and kept trying to push every subsequent chunk into the now-orphaned FakeConn channel. Every push fell through to the channel-full drop path, which logs at Debug — on a long-lived post-abort connection that's one log line per chunk for the rest of the session. New Relay.PauseTees() flips the atomic-bool pause flag on both tees; push then returns via the cheap DropPaused fast-path without consuming capacity or logging. Raw forwarding on the real sockets is unaffected — every byte still reaches its peer. SessionOnAbort now pauses tees FIRST, then closes the FakeConns. - chaos-broken-parser/README.md: documented failure behaviour was out of date — said the go test wrapper "fails (not skips) once docker + e2e tag are present", but chaos_test.go now also SKIPs when KEPLOY_CHAOS_WIRING is unset/falsy. Rewrote the bullet to spell out the full gate set (KEPLOY_E2E + -tags e2e + docker + KEPLOY_CHAOS_WIRING) and that only beyond that gate is a failure genuine. - .github/actions/setup-private-parsers/action.yaml: the integration-ref default was changed to 'feat/v2-native-parsers' in 5e37514 to unblock cross-repo CI during review. Added a prominent TODO(before-merge) comment above the default so it can't be silently inherited by downstream workflows after merge — the feature branch will eventually be deleted, which would break every caller that doesn't pass integration-ref explicitly. Default itself is left intentionally in place; reverting it would undo the maintainer's CI-wiring. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: revert action-ref default to main; mask stash-exhaustion EOF Two Copilot review follow-ups on f4c418e: - fakeconn/fakeconn.go Read: when draining the internal byte stash via bytes.Buffer.Read, io.EOF was surfaced to the caller any time the stash went empty on a call that DID return bytes. Go's convention (and what bufio.Reader / io.Copy / encoding/pkg readers assume) is that (n>0, io.EOF) means "here is the last n bytes". Callers would terminate the stream early even though more chunks were still arriving on f.ch. Only the channel-close path is a genuine EOF; mask the stash-exhaustion EOF whenever n > 0. New TestReadStashExhaustionNotEOF pins the contract: a two-chunk sequence with a Read that exactly empties the stash on an intermediate call must surface the stash bytes with err=nil, and a subsequent Read must still see the second chunk. - .github/actions/setup-private-parsers/action.yaml: reverted the integration-ref default from 'feat/v2-native-parsers' back to 'main'. Copilot flagged the feature-branch default as a post-merge footgun — downstream workflows that don't pass integration-ref explicitly would pin to a branch that gets deleted. The review-time cross-repo CI override is still available via the repo-level `vars.INTEGRATION_REF` variable, which the workflow env lines at the top of prepare_and_run*.yml and golangci-lint.yml resolve first: INTEGRATION_REF: ${{ vars.INTEGRATION_REF || 'main' }} So the maintainer can set `INTEGRATION_REF=feat/v2-native-parsers` as a repo variable during the review window and unset it (or delete the var) after merge, with no action.yaml churn. Added a comment block spelling out that wiring so the next reader doesn't have to reverse-engineer it. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(relay): atomic two-sided TLS upgrade Copilot review follow-up on 1d9ae97: pkg/agent/proxy/relay/directive_proc.go handleUpgradeTLS previously performed the dest-side TLS handshake first, published r.dst via atomic.Store, THEN performed the client-side handshake. If the client-side handshake failed after the dest-side store, the relay was left in a mixed state: r.dst points at a *tls.Conn, r.src still points at cleartext. Any forwarding that happened before the outer layer tore the sockets down would move TLS bytes in one direction and plaintext bytes in the other, corrupting any traffic in flight. Restructured the handler to stage both handshakes in local variables FIRST (upgradedDst, upgradedSrc) and ONLY publish via r.{dst,src}.Store after both requested handshakes succeed. On client-side failure, close the unpublished dest-side *tls.Conn so it doesn't leak, and return the error with both r.dst and r.src still pointing at their original cleartext conns — the forwarders never see the mixed state. The forwarders are parked on r.beginPause() for the entire handshake window, so publishing is observed only after r.endPause fires on the success path. The corruption window is therefore zero regardless of handshake ordering. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: close WaitGroup race; next_step on V2 error logs; test timeouts Four Copilot review follow-ups on 0dcfbe8: - proxy.go activeConns Add/Wait race: activeConns.Add(1) was called inside handleConnection, which means the Add happened AFTER clientConnErrGrp.Go returned to the accept loop. If StopProxyServer kicked off waitForConnDrain -> activeConns.Wait() in that window, Wait could observe count=0 and return before the handleConnection goroutine registered itself. sync.WaitGroup documents this as misuse ("calls with a positive delta that occur when the counter is zero must happen before a Wait") and can produce "sync: WaitGroup misuse" panics in some interleavings. Moved activeConns.Add(1) to the spawn site in the accept loop (before clientConnErrGrp.Go), Done() stays in the goroutine's defer. - proxy.go V2 record path failure log: utils.LogError("V2 record path failed") lacked an actionable next_step. Added parser tag + rollback knobs (KEPLOY_NEW_RELAY=off for this parser, or KEPLOY_DISABLE_PARSING=1 / SIGUSR1 to disable parser dispatch entirely); noted that the supervisor has already fallen through to passthrough so user traffic continues. - integrations/mysql/mysql.go recordV2: same bare utils.LogError without next_step. Added identical remediation guidance. - v2_integration_test.go: two tests (happy-path, panic-parser) blocked on unbounded `<-resCh` receives. If the supervisor/relay regress and Run never returns, the test hangs until the package-wide `go test -timeout`, which makes failures slow to surface and hard to locate. Wrapped both receives in `select <-resCh / <-time.After(3 * time.Second)` — same pattern other receives in the file already use. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(supervisor): next_step on reportPanic + fireOnAbort Error logs Two Copilot review follow-ups on d103f61: - reportPanic: the Error log fired when the configured PanicReporter itself panics lacked the repo-standard next_step field. Added guidance pointing operators at two remediations: fix the reporter implementation (it must be non-blocking and non-panicking, same contract as all supervisor callbacks), or unset supervisor.Config.PanicReporter=nil to fall back to no external reporting — the recovered parser panic is still logged at Error level on the supervisor's own path, so turning the reporter off is a safe degradation. - fireOnAbort: matching Error log on SessionOnAbort callback panic. next_step now explains the callback's contract (non-blocking, non-panicking — see proxy_v2.go for the reference implementation that just closes FakeConns and pauses tees) and offers the escape hatch of constructing the Supervisor without a SessionOnAbort callback if the caller can tolerate parser-side reads not unblocking promptly on abort. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: unstick outer-ctx-cancel goroutine leak; doc TLS boundary semantics Two Copilot review follow-ups on b18390f: - supervisor/supervisor.go outer ctx-cancel: when ctx.Done() fired, runCancel() cancelled the inner context, then the select waited 50ms for the parser to return. On clean return we skipped SessionOnAbort (correct — nothing to unstick). But on the 50ms-timeout path we ALSO skipped it, which leaked the parser goroutine: if the parser was parked in FakeConn.Read/ReadChunk (which do not observe ctx — they only unblock on Close), the goroutine would stay parked for the life of the process, and the relay tees would stay armed so every subsequent chunk fell through the channel-full drop path logging at Debug. Fire SessionOnAbort on the timeout branch (closes both FakeConns + pauses tees via the dispatcher's reference callback) and mark FallthroughToPassthrough so the outer caller routes raw — consistent with the hang and panic branches above. - directive/directive.go Ack doc: claimed BoundaryReadAt / BoundaryWrittenAt were "the exact wallclock instants at which the real sockets transitioned from cleartext to TLS", but the relay implementation captures BoundaryReadAt BEFORE handshakes begin and BoundaryWrittenAt AFTER both succeed — these are upgrade-boundary brackets, not transition instants (the TCP stack has no hook that would give the relay the exact instant, and the handshake itself is a multi-roundtrip sequence taking real time). Rewrote to describe the actual contract: a Read bound before and a Write bound after, with the post-ack guarantee that everything on ClientStream/DestStream is plaintext from the upgraded session. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Ayush Sharma <kshitij3160@gmail.com>
khareyash05
added a commit
that referenced
this pull request
Apr 30, 2026
asyncMySQLDecode reads client and server bytes from clientBuffChan and
destBuffChan and feeds both into a single FIFO decodeChan via a
non-deterministic select in handleClientQueries. On a fast loopback the
next client command can be enqueued ahead of the trailing server bytes
of the prior result set (more column defs, the rows, or the EOF /
OK_with_EOF terminator). When that happens the recorder force-flushes
the in-progress mock at line 346 with truncated columns or a missing
FinalResponse, and the encoded wire bytes at replay time look like:
column_count_pkt | column_def_1..M | <end> # M < ColumnCount
or
column_count_pkt | column_defs | row_pkt | <end> # missing terminator
Drivers that strictly require the trailing terminator — most notably
Connector/J on Java 8 — block in socketRead0 forever waiting for a
column def or terminator that was lost to the race. Connector/J on
Java 17 happens to tolerate the missing terminator on otherwise-complete
result sets, which is why the bug only surfaced on Java 8 in earlier
testing.
Fix is recorder-side only. flushMock now calls
closeIncompleteResultSetForFlush, which when the decoder state is
mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or
stateExpectRows):
1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded
column-count packet matches the column-def packets actually
emitted. Without this, replay clients block waiting for a column
def that will never arrive.
2. Synthesizes the trailing terminator on FinalResponse using the
negotiated capabilities — 0xFE-prefixed OK_with_EOF if
CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if
CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
3. Picks seqID = max(observed seqs in columns / EOFAfterColumns /
rows) + 1, falling back to 2 when the result set has no captured
columns at all.
4. Rewrites bundle.Header.Type to TextResultSet /
BinaryProtocolResultSet so wire.EncodeToBinary's type-switch
routes through the result-set encoder rather than the column-count
head packet that processFirstResponse originally stored.
The encoder (wire/phase/query/ResultSetPacket.go) and the replayer
(replayer/query.go, replayer/conn.go) are byte-identical to before —
the fix produces structurally-complete mocks at record time so replay
just works.
Verified on kind keploy-ds → keploy-replay with Spring Boot
sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and
sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting
to mysql:8.0 with require_secure_transport=ON and
sslMode=REQUIRED&useSSL=true&trustServerCertificate=true:
Java 8 run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s
Java 8 run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s
Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s
Zero context-deadline-exceeded events and zero CommunicationsException
across the runs. The "noisy" classifications are auto-increment IDs,
list ordering, and created_at timestamp drift — pure data variance,
same shape as the pre-fix Java 17 baseline.
Signed-off-by: Yash Khare <khareyash05@gmail.com>
3 tasks
khareyash05
added a commit
that referenced
this pull request
May 7, 2026
…st cases (#4157) * feat(replay): expose TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases Adds an opt-in extension point so non-OSS callers (e.g. cluster-mode auto-replay) can inject a *tls.Config into the replay HTTP client without forking pkg/util or shipping a separate transport. Default nil preserves the stdlib system-pool behaviour for every existing OSS user; no behaviour change unless a caller opts in. ## Why Replay HTTPS test cases captured against an in-cluster service keep their original URL (e.g. https://app.svc.cluster.local:8443). At replay time the dial target is rewritten to a local target — typically localhost:<port> through a kubectl port-forward to a short-lived replay pod, or some other dev/CI tunnel — so the cert the server presents cannot match the dial hostname. The replay pod commonly serves a self-signed cert generated at image build (PKCS12 keystore baked into the image, or a cert-manager Issuer signing per pod). That cert is not in any system CA bundle, and modern Go's x509 verifier additionally demands a SAN matching the dial host. Without a hook, every HTTPS test case fails the TLS handshake ("x509: certificate is not valid for any names" or "signed by unknown authority") *before* the request is dispatched, so the response body never reaches the comparison engine. A blanket InsecureSkipVerify in OSS is the wrong fix because it weakens TLS verification for every other replay path (CI lanes replaying against a real staging backend with a properly-issued cert, for example). This PR keeps strict TLS the default and lets specific callers install a curated *tls.Config — for instance one that pins the replay pod's leaf cert via VerifyConnection. ## Files - pkg/util.go: SimulationConfig gains TLSConfig *tls.Config; all 3 http.Transport branches in prepareHTTPRequest pass it through as TLSClientConfig (nil leaves the transport at stdlib defaults). - pkg/service/replay/hooks.go: Hooks.tlsConfig field + (*Hooks).SetReplayTLSConfig setter; SimulateRequest forwards it into pkg.SimulationConfig.TLSConfig for HTTP testcases. Verified end-to-end against a cluster-mode replay consumer that TOFU-pins the replay pod's leaf cert (kept out of OSS). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(mysql-recorder): synthesize result-set terminator on force-flush asyncMySQLDecode reads client and server bytes from clientBuffChan and destBuffChan and feeds both into a single FIFO decodeChan via a non-deterministic select in handleClientQueries. On a fast loopback the next client command can be enqueued ahead of the trailing server bytes of the prior result set (more column defs, the rows, or the EOF / OK_with_EOF terminator). When that happens the recorder force-flushes the in-progress mock at line 346 with truncated columns or a missing FinalResponse, and the encoded wire bytes at replay time look like: column_count_pkt | column_def_1..M | <end> # M < ColumnCount or column_count_pkt | column_defs | row_pkt | <end> # missing terminator Drivers that strictly require the trailing terminator — most notably Connector/J on Java 8 — block in socketRead0 forever waiting for a column def or terminator that was lost to the race. Connector/J on Java 17 happens to tolerate the missing terminator on otherwise-complete result sets, which is why the bug only surfaced on Java 8 in earlier testing. Fix is recorder-side only. flushMock now calls closeIncompleteResultSetForFlush, which when the decoder state is mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or stateExpectRows): 1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded column-count packet matches the column-def packets actually emitted. Without this, replay clients block waiting for a column def that will never arrive. 2. Synthesizes the trailing terminator on FinalResponse using the negotiated capabilities — 0xFE-prefixed OK_with_EOF if CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise. 3. Picks seqID = max(observed seqs in columns / EOFAfterColumns / rows) + 1, falling back to 2 when the result set has no captured columns at all. 4. Rewrites bundle.Header.Type to TextResultSet / BinaryProtocolResultSet so wire.EncodeToBinary's type-switch routes through the result-set encoder rather than the column-count head packet that processFirstResponse originally stored. The encoder (wire/phase/query/ResultSetPacket.go) and the replayer (replayer/query.go, replayer/conn.go) are byte-identical to before — the fix produces structurally-complete mocks at record time so replay just works. Verified on kind keploy-ds → keploy-replay with Spring Boot sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting to mysql:8.0 with require_secure_transport=ON and sslMode=REQUIRED&useSSL=true&trustServerCertificate=true: Java 8 run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s Java 8 run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s Zero context-deadline-exceeded events and zero CommunicationsException across the runs. The "noisy" classifications are auto-increment IDs, list ordering, and created_at timestamp drift — pure data variance, same shape as the pre-fix Java 17 baseline. Signed-off-by: Yash Khare <khareyash05@gmail.com> --------- Signed-off-by: Yash Khare <khareyash05@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Asish Kumar <87874775+officialasishkumar@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.