Skip to content

Create LICENSE#2

Merged
slayerjain merged 1 commit into
mainfrom
add-license-1
Jan 19, 2022
Merged

Create LICENSE#2
slayerjain merged 1 commit into
mainfrom
add-license-1

Conversation

@slayerjain
Copy link
Copy Markdown
Member

No description provided.

@slayerjain slayerjain merged commit d0dea01 into main Jan 19, 2022
@nehagup nehagup deleted the add-license-1 branch October 16, 2022 18:13
PranshuSrivastava pushed a commit that referenced this pull request Mar 22, 2024
manasmanohar pushed a commit to manasmanohar/keploy that referenced this pull request Jan 10, 2025
* feat: add wrapers around interfaces of keploy oss

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>

* build: add github action for build check golang

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>

* refactor: remove replace statement in go mod

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>

* fix: upgrade to newer version in go mod

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>

* refactor: remove goveralls for test-coverage

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>

* refactor: remove blobs from goreleaser

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>

---------

Signed-off-by: Sarthak Shyngle <50234097+Sarthak160@users.noreply.github.com>
gnana1905 pushed a commit to gnana1905/keploy that referenced this pull request Sep 17, 2025
ci: add github actions to deploy website
Sarthak160 pushed a commit that referenced this pull request Mar 17, 2026
slayerjain added a commit that referenced this pull request Apr 18, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full
that failed this PR's own build-and-upload job.

1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the
   pre-release read of m.outChan was still under m.mu alone,
   racing SetOutputChannel/CloseOutChan which write it under
   outChanMu. Switched to outChanStatus() — same RLock path the
   send uses — so the consistency check is now under the lock
   that protects the field.

2. SetOutputChannel idempotence: the hot path is buildRecordSession
   re-binding rule.MC (same pointer) on every accepted connection.
   Previously it unconditionally cleared outChanClosed, so a
   post-shutdown re-bind turned a closed channel back 'open' and
   the next send would panic. Now only clears the flag when the
   channel POINTER actually changes (new session / re-record). New
   test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins
   this.

3. dns.go kept calling SetOutputChannel on every mock. With the
   idempotence fix in #2 above it's now safe, but the comment
   block is updated to spell out why, and the call is ordered
   before AddMock/SendConfigMock so the first DNS query (which
   may fire before the first TCP connection triggers
   buildRecordSession) still binds the channel.

4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest
   is 'client-side only' — it isn't, dialPostgresSSLUpstream
   re-originates SSLRequest upstream when dest port is 5432.
   Rewrote the comment to match actual behaviour and point at the
   runtime_hooks.go docstring.

5. prepare_and_run_windows.yml build-and-upload failed on ee0332e
   with 'There is not enough space on the disk' writing go-build
   importcfg under SystemTemp. Ported the tiered disk cleanup
   from keploy/windows-redirector#2: fast path on >=4 GB free,
   light sweep (SystemTemp\go-build*, sibling runners' _diag /
   _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows
   Update cache), last-resort go module cache purge, hard-fail
   under 2 GB.

Verification:
  go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass
  petclinic local 5-round stress, 1295 mocks captured, zero
  WARNING: DATA RACE, 4 s clean SIGTERM shutdown.
Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain added a commit that referenced this pull request Apr 19, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full
that failed this PR's own build-and-upload job.

1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the
   pre-release read of m.outChan was still under m.mu alone,
   racing SetOutputChannel/CloseOutChan which write it under
   outChanMu. Switched to outChanStatus() — same RLock path the
   send uses — so the consistency check is now under the lock
   that protects the field.

2. SetOutputChannel idempotence: the hot path is buildRecordSession
   re-binding rule.MC (same pointer) on every accepted connection.
   Previously it unconditionally cleared outChanClosed, so a
   post-shutdown re-bind turned a closed channel back 'open' and
   the next send would panic. Now only clears the flag when the
   channel POINTER actually changes (new session / re-record). New
   test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins
   this.

3. dns.go kept calling SetOutputChannel on every mock. With the
   idempotence fix in #2 above it's now safe, but the comment
   block is updated to spell out why, and the call is ordered
   before AddMock/SendConfigMock so the first DNS query (which
   may fire before the first TCP connection triggers
   buildRecordSession) still binds the channel.

4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest
   is 'client-side only' — it isn't, dialPostgresSSLUpstream
   re-originates SSLRequest upstream when dest port is 5432.
   Rewrote the comment to match actual behaviour and point at the
   runtime_hooks.go docstring.

5. prepare_and_run_windows.yml build-and-upload failed on ee0332e
   with 'There is not enough space on the disk' writing go-build
   importcfg under SystemTemp. Ported the tiered disk cleanup
   from keploy/windows-redirector#2: fast path on >=4 GB free,
   light sweep (SystemTemp\go-build*, sibling runners' _diag /
   _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows
   Update cache), last-resort go module cache purge, hard-fail
   under 2 GB.

Verification:
  go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass
  petclinic local 5-round stress, 1295 mocks captured, zero
  WARNING: DATA RACE, 4 s clean SIGTERM shutdown.
Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain added a commit that referenced this pull request Apr 19, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full
that failed this PR's own build-and-upload job.

1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the
   pre-release read of m.outChan was still under m.mu alone,
   racing SetOutputChannel/CloseOutChan which write it under
   outChanMu. Switched to outChanStatus() — same RLock path the
   send uses — so the consistency check is now under the lock
   that protects the field.

2. SetOutputChannel idempotence: the hot path is buildRecordSession
   re-binding rule.MC (same pointer) on every accepted connection.
   Previously it unconditionally cleared outChanClosed, so a
   post-shutdown re-bind turned a closed channel back 'open' and
   the next send would panic. Now only clears the flag when the
   channel POINTER actually changes (new session / re-record). New
   test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins
   this.

3. dns.go kept calling SetOutputChannel on every mock. With the
   idempotence fix in #2 above it's now safe, but the comment
   block is updated to spell out why, and the call is ordered
   before AddMock/SendConfigMock so the first DNS query (which
   may fire before the first TCP connection triggers
   buildRecordSession) still binds the channel.

4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest
   is 'client-side only' — it isn't, dialPostgresSSLUpstream
   re-originates SSLRequest upstream when dest port is 5432.
   Rewrote the comment to match actual behaviour and point at the
   runtime_hooks.go docstring.

5. prepare_and_run_windows.yml build-and-upload failed on ee0332e
   with 'There is not enough space on the disk' writing go-build
   importcfg under SystemTemp. Ported the tiered disk cleanup
   from a companion repo workflow: fast path on >=4 GB free,
   light sweep (SystemTemp\go-build*, sibling runners' _diag /
   _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows
   Update cache), last-resort go module cache purge, hard-fail
   under 2 GB.

Verification:
  go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass
  petclinic local 5-round stress, 1295 mocks captured, zero
  WARNING: DATA RACE, 4 s clean SIGTERM shutdown.
Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain added a commit that referenced this pull request Apr 19, 2026
Four follow-up fixes on ee0332e plus the Windows runner disk-full
that failed this PR's own build-and-upload job.

1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the
   pre-release read of m.outChan was still under m.mu alone,
   racing SetOutputChannel/CloseOutChan which write it under
   outChanMu. Switched to outChanStatus() — same RLock path the
   send uses — so the consistency check is now under the lock
   that protects the field.

2. SetOutputChannel idempotence: the hot path is buildRecordSession
   re-binding rule.MC (same pointer) on every accepted connection.
   Previously it unconditionally cleared outChanClosed, so a
   post-shutdown re-bind turned a closed channel back 'open' and
   the next send would panic. Now only clears the flag when the
   channel POINTER actually changes (new session / re-record). New
   test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins
   this.

3. dns.go kept calling SetOutputChannel on every mock. With the
   idempotence fix in #2 above it's now safe, but the comment
   block is updated to spell out why, and the call is ordered
   before AddMock/SendConfigMock so the first DNS query (which
   may fire before the first TCP connection triggers
   buildRecordSession) still binds the channel.

4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest
   is 'client-side only' — it isn't, dialPostgresSSLUpstream
   re-originates SSLRequest upstream when dest port is 5432.
   Rewrote the comment to match actual behaviour and point at the
   runtime_hooks.go docstring.

5. prepare_and_run_windows.yml build-and-upload failed on ee0332e
   with 'There is not enough space on the disk' writing go-build
   importcfg under SystemTemp. Ported the tiered disk cleanup
   from a companion repo workflow: fast path on >=4 GB free,
   light sweep (SystemTemp\go-build*, sibling runners' _diag /
   _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows
   Update cache), last-resort go module cache purge, hard-fail
   under 2 GB.

Verification:
  go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass
  petclinic local 5-round stress, 1295 mocks captured, zero
  WARNING: DATA RACE, 4 s clean SIGTERM shutdown.
Signed-off-by: slayerjain <shubham@keploy.io>
slayerjain added a commit that referenced this pull request Apr 19, 2026
…4045)

* feat(proxy): extensibility hooks + Postgres SSL, ALPN, MySQL fixes

Adds generic extension points in the agent/proxy/hooks packages so
downstream builds can plug in alternative capture layers (observe-only
BPF, sidecar injection, etc.) without touching the OSS core. Also
bundles four orthogonal bug fixes that keploy hits today in the
default proxy path, plus a cgroup orphan-cleanup helper that fixes
"program already attached" errors after a crash.

No feature-specific code — the extension hooks are named after what
they do (SkipProxyListener, AgentInfoCustomizer), not after any
downstream product.

- `pkg/agent/runtime_hooks.go`: add `SkipProxyListener bool` and
  `AgentInfoCustomizer func(*structs.AgentInfo)` hook variables. When
  SkipProxyListener is true, the proxy does not bind a TCP listener
  but DNS, parsers, and session management still run. AgentInfoCustomizer
  is called right before AgentInfo is written to the BPF map so
  downstream code can populate the new Flags field without touching
  the OSS load path.
- `pkg/agent/hooks/structs/structs.go`: add `Flags uint32` field to
  AgentInfo — an extensible flag slot that downstream BPF code can
  consume (e.g. to branch cgroup/bind4 behavior).
- `pkg/service/agent/hooks.go`: export `SetSkipProxyListener` and
  `SetAgentInfoCustomizer` setter helpers.
- `pkg/agent/proxy/proxy.go`: honor SkipProxyListener by skipping
  `net.Listen` and blocking on ctx.Done instead.
- `pkg/agent/hooks/linux/hooks.go`: invoke `AgentInfoCustomizer` on
  the populated `agentInfo` before writing it to the BPF map.

- `pkg/agent/hooks/linux/cleanup.go` (**new**): helpers that find and
  detach BPF cgroup programs left behind by a crashed prior keploy
  run. `DetectAndCleanOrphanedCgroupPrograms` is called from the
  hooks load path BEFORE attaching new programs, fixing the common
  "program already attached" failure after a hard-kill. Scan is
  scoped to programs whose owning PID is dead, so it's safe when
  other live keploy instances are around. Also exposes
  `CleanOrphanedBPFPins` (pin cleanup) and an `ExtraBPFPinPatterns`
  slice so downstream builds can register additional glob patterns
  for their own pinned maps.

**Postgres SSLRequest handshake** (`pkg/agent/proxy/proxy.go`) —
`libpq` / `pgx` / `pq` with `sslmode=prefer|require|verify-*` sends
an 8-byte SSLRequest packet **before** the TLS handshake begins,
expecting the server to reply with a single byte `'S'` (accept) or
`'N'` (refuse). The proxy previously just peeked 5 bytes and tried
to detect a TLS record header, which fails on this packet — the
Postgres client would hang waiting for the server reply, and the
connection would drop after timeout. Now we peek 8 bytes, detect
the SSLRequest by length=8 + code=0x04D2162F, discard it from the
buffered reader, reply `'S'`, and re-peek for the TLS ClientHello
that arrives next. The existing TLS upgrade path then runs
verbatim.

**TLS ALPN negotiation for non-HTTP clients** (`pkg/agent/proxy/tls/tls.go`) —
The TLS handler previously forced `NextProtos = ["h2"]` for any
client that didn't explicitly advertise `http/1.1`, which caused
Go's `tls.Server` to raise
`tls: client requested unsupported application protocols` for
clients advertising things like `postgresql`, `mongodb`, or `redis`.
Now the handler leaves NextProtos empty (letting the client's
protocol selection stand) unless the client advertises `http/1.1`
or `h2` explicitly. Fixes libpq TLS handshakes against the proxy.

**MySQL `MatchType` content detection** (`pkg/agent/proxy/integrations/mysql/mysql.go`) —
`MatchType` used to unconditionally return `false`, relying on the
caller's port-based routing (3306). Capture layers that deliver
bytes without an L4 destination port (TLS uprobe streams, alternative
transports) need a content-based fallback so MySQL plaintext is
routed to the MySQL parser instead of falling through to Generic.
New heuristic matches (a) server HandshakeV10 (first body byte
0x0a) and (b) client HandshakeResponse41 (4-byte caps with
`CLIENT_PROTOCOL_41 | CLIENT_PLUGIN_AUTH` + zero-filled reserved
field), both restricted to small packets (len < 512) at low
sequence numbers. Strict enough to avoid false positives on HTTP
POST bodies.

**MySQL `CLIENT_DEPRECATE_EOF` seeding in post-TLS record path**
(`pkg/agent/proxy/integrations/mysql/recorder/conn.go`) — after a
post-TLS client handshake is decoded, the recorder was not copying
`decodeCtx.ClientCapabilities` into `decodeCtx.ClientCaps`, so
`DeprecateEOF()` returned `false` and the TextResultSet reader
expected an intermediate EOF packet between column definitions
and row data. Modern drivers (mysql-connector-python/j,
go-sql-driver, node mysql2) all advertise
`CLIENT_DEPRECATE_EOF` and send rows immediately, and the recorder
would abort with `expected EOF packet for column definition`.
We now mirror what `handleInitialHandshake` does on the non-TLS
path: seed `ClientCaps` from the decoded capabilities (and on the
seq=0 "existing connection" branch, defensively seed
`CLIENT_DEPRECATE_EOF` since we don't see a HandshakeResponse41
there).

**MySQL recorder: non-fatal EOF on TLS-short-circuited handshake**
(`pkg/agent/proxy/integrations/mysql/recorder/record.go`) — when
a capture layer detects the MySQL `CLIENT_SSL` bit mid-handshake
and closes the SimulatedConn so a TLS-aware consumer can take over,
the recorder was returning the EOF as an error. Now it's logged
at debug level and returned as nil. No observable behavior change
for the default proxy path (EOF is preserved for real TCP
hang-ups).

**MySQL replayer: fall through to plaintext mock candidates when no
SSLRequest-tagged mock exists** (`pkg/agent/proxy/integrations/mysql/replayer/conn.go`) —
previously, if no mock had a recorded SSLRequest entry, the replayer
would abort with `no mysql mocks matched the SSL request`. Now it
falls through to the full mock set and lets the post-TLS
HandshakeResponse41 match against plaintext candidates. The
post-TLS wire format is identical, so a plaintext-recorded MySQL
flow can still serve a TLS-requesting replay client.

**Docker run: defensive guard against `-p 0:0`** (`pkg/platform/docker/util.go`) —
when a caller intentionally sets `SetupOptions.ProxyPort = 0`
(e.g. running the agent without a listening proxy socket), the
alias builder was emitting `-p 0:0` which `docker run` rejects
outright. Now we conditionally build the `-p <proxy>:<proxy>`
fragment and skip it entirely when ProxyPort is zero.

`pkg/models/const.go` now exposes `CapturedReqTimeKey`,
`CapturedRespTimeKey`, `CapturedReqTime(ctx)`, `CapturedRespTime(ctx)`.
These let a capture layer install a per-connection wall-clock source
on the request context so integration parsers stamp
`reqTimestampMock` / `resTimestampMock` with true packet arrival
order instead of parser-goroutine scheduling order (the proxy path
drifts under concurrent-connection load and can invert mock
timestamps across exchanges). Both helpers fall back to
`time.Now()` when no source is installed, so call sites can use
them unconditionally. Applied inside the MySQL query recorder
(`query.go`) as the first consumer.

`github.com/cilium/ebpf` goes from v0.19.0 to v0.21.0. The newer
release contains the CO-RE field-exists fixes that downstream
capture-layer BPF code depends on; the bump is otherwise a
drop-in.

- [x] `go build ./...` — clean
- [x] `go test ./pkg/agent/proxy/integrations/mysql/... ./pkg/models/... ./pkg/agent/... ./pkg/platform/docker/...` — pass
- [x] Full grep for `proxyless` / `LowLatency` / `--low-latency` / `--container-name` across `*.go` — zero hits in new code

- No config surface changes (`config.Record` is untouched)
- No new CLI flags
- No BPF C source changes (`bpf_x86_bpfel.go/.o` left at baseline —
  the obfuscated map-name cleanup can land in a separate PR)

Signed-off-by: slayerjain <shubham@keploy.io>

* perf(record): cache next test index per test-set to avoid O(n²) dir scan

pprof showed FindLastIndex at 16% of the record client's CPU at
~3000 accumulated tests — every new test did a full getdents64 on
the tests/ directory. Cache the counter per-testset via sync.Map +
atomic.Int64; seed from FindLastIndex on first access, atomic
increment thereafter. No external process mints names in the same
directory during a record session.

Signed-off-by: slayerjain <shubham@keploy.io>

* perf(proxy): DirectChunkReader fast path in ReadBuffConn

Skip the ReadBytes goroutine hop when the conn exposes pre-framed
chunks directly (enterprise SimulatedConn backed by eBPF ringbuf).
HasPending+carry-over Read preserves Prepend semantics for protocol-
detection peeks. Non-implementing conns unchanged.

Signed-off-by: slayerjain <shubham@keploy.io>

* perf(record): async batched gob mock writer gated by KEPLOY_MOCK_FORMAT

yaml.v3 encoder was 28-30% of the record client's CPU at high
throughput (pprof c=20 PG bulk-insert). Replace it for low-latency
mode with an async gob writer:

- Single background goroutine per MockYaml instance owns the open
  file + bufio (256 KB) + gob.Encoder. The encoder stays alive
  across mocks so the per-mock cost is one Encode call plus a
  64-byte type-table delta (not a fresh encoder setup).
- InsertMock enqueues to a 4096-buffer chan; parser goroutine
  never blocks on disk or encoding. Queue-full falls back to a
  synchronous encode so mocks are never dropped — the drop count
  is tracked atomically for observability.
- readGobMocks uses one decoder that keeps the type table live
  across Decode calls, matching the writer's single-stream output.
- Close() drains the queue and flushes before returning — call it
  at record shutdown.

Reader auto-detects mocks.gob first, falling back to mocks.yaml,
so existing YAML recordings still replay.

Round-trip tested across HTTP, Redis, Kafka, Generic, Postgres,
MySQL, MongoDB, DNS, and multi-mock append. All 9 pass with
reflect.DeepEqual between input and decoded Mock. gob handles the
interface{} Message fields correctly via the gob.Register calls in
pkg/models/*.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore(bpf): regenerate x86 objects against ebpf#92 verifier fix

Earlier regen tripped the kernel 6.12 verifier in k_connect6
("dereference of modified ctx ptr R2 off=24 disallowed") after the
agent_info flags branch. ebpf#92 has the C-source fix; this regen
ships the Go + object generated from that ebpf revision.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(syncMock): recover from send on closed channel during shutdown

AddMock's default-case select did not catch closed-channel panic.
HandleOutgoing closes outChan during shutdown while a parser goroutine
can still be mid-send. recover() + silent drop is correct — shutdown
is racing the parser and the mock reader is already gone.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(record): flush async mockDB writer at record shutdown

The gob binary mock writer added in KEPLOY_MOCK_FORMAT=gob runs a
background goroutine that holds up to 4096 mocks in a channel before
encoding. Previously there was no call site for MockYaml.Close(), so
at Ctrl-C the queued tail was lost along with the writer goroutine.

Wire a deferred io.Closer-type-assertion at the top of Recorder.Start.
The default YAML MockDB doesn't implement io.Closer so the assertion
is a no-op; only the gob path's Close drains the queue and flushes
the file. Added TestMockYamlIsIOCloser as a compile-time + runtime
guard so a future refactor that drops the Close method can't silently
lose mocks.

Signed-off-by: slayerjain <shubham@keploy.io>

* feat(record): gob format config field + magic header + overflow logging + Recorder.Cleanups

Four debt items from the principal review bundled because they touch
the same area and gob-format codepath:

1. record.mockFormat config field (yaml|gob) — promotes
   KEPLOY_MOCK_FORMAT from an env-var escape hatch to a documented
   config knob. Env var still takes precedence for ad-hoc runs.
   Wired from the OSS CLI provider via mockdb.SetConfiguredMockFormat.

2. Magic header on mocks.gob — "keploy-gob-v1\n" is written at the
   start of every new file; readGobMocks rejects files without it
   with a clear error. Protects against silent struct-schema drift
   across keploy versions — future format breaks bump the suffix.

3. gobWriteSync now reuses the async writer's open file + encoder
   under the mutex. Previously it opened a second fd with a fresh
   gob.Encoder, concatenating a new type-table into the stream that
   the reader's single gob.Decoder could not resume across. This
   was a latent corruption bug that only fired when the async queue
   filled up (which the fix for item 1 makes observable).

4. MockYaml.Close now logs the queue-overflow count at Info level.
   Previously the atomic counter was incremented but never surfaced
   — no way for an operator to notice disk/encoding had become a
   bottleneck. New test TestGobMagicHeaderRejectsMismatch covers
   the header guard.

5. Recorder.Cleanups LIFO stack replaces the ad-hoc io.Closer
   type-assertion added in afcf4a18. mockDB auto-registers if it
   implements io.Closer; downstream subsystems (telemetry,
   mapping-DB, etc.) can RegisterCleanup their own flush without
   editing this file.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: regenerate arm64 BPF objects; add docs/env-vars.md

Matches the x86 regen landed earlier in this PR (1edb975f). Same
underlying ebpf#92 source commit. bpf_arm64_bpfel.go was already
identical to the regenerated version; only the .o needed updating.

docs/env-vars.md collects KEPLOY_MOCK_FORMAT, PPROF_PORT, and
GOMEMLIMIT_MB with notes on what was deliberately NOT added as a
default (GOMAXPROCS cap, auto-gob-in-low-latency). Keeps the env-var
surface documented and discoverable for reviewers.

Signed-off-by: slayerjain <shubham@keploy.io>

* style: gofmt alignment fixes

Signed-off-by: slayerjain <shubham@keploy.io>

* chore(bpf): regen .o/.go from post-#96 ebpf main (struct size=56)

The previously bundled .o files were captured from an intermediate
state of ebpf#96 where agent_info_t had grown to 64 bytes (had _pad
+ flags + _pad2). The merged version of #96 keeps the struct at 56
bytes via a size-preserving rename of the existing _pad slot to
flags. keploy's Go AgentInfo mirror is 56 bytes, which caused
"marshal value: structs.AgentInfo doesn't marshal to 56 bytes"
failures at hook load time on every platform that runs the BPF
loader (observed on macOS docker CI; same root cause elsewhere).
Regenerated from ebpf main at 9e657c6.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(hooks): align AgentInfo.Flags with BPF offset 44 (merged ebpf#96 layout)

The previous layout placed Flags at offset 56 (after RecordingStartTime)
with trailing blank padding, making the struct 64 bytes. The merged
ebpf#96 renamed the existing _pad slot to flags in-place, so the BPF
reads flags at offset 44 and the struct stays 56 bytes.

Moving Flags to the slot the old blank [4]byte occupied restores
byte-for-byte parity with the BPF map value: size 56, Flags@44,
RecordingStartTime@48.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy): gate Postgres SSLRequest interception behind runtime hook

spring-petclinic (PG via JDBC) was regressing with EOF during
doAuthentication because the proxy was unconditionally intercepting
the 8-byte SSLRequest preamble and replying 'S' itself, then
terminating TLS. The default keploy build injects a Postgres parser
via extraparsers.go which already handles SSLRequest through the
existing TLSUpgrader interface; double-handling meant the parser
never saw the SSLRequest and couldn't forward post-TLS traffic,
leading to the Java client timing out on auth reads.

Fix: add agent.InterceptPostgresSSLRequest (default false) and a
SetInterceptPostgresSSLRequest setter. The proxy-level handshake
now runs only when the flag is enabled, preserving parser-driven
behavior on the default build and letting pure-proxy downstream
builds (no PG parser registered) opt in when needed.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address Copilot review comments

- gob overflow hint now reports the actual queueCapacity and
  points at ensureGobWriter as the control (no fictitious env var).
- MySQL MatchType slices body to exactly the first packet's
  (buf[4:4+pktLen]) and uses pktLen for length checks, so a buffer
  carrying subsequent packets cannot spuriously satisfy the 23-byte
  reserved-zero window.
- Postgres SSLRequest error messages now include an actionable next
  step for the operator: what to check for each failure class
  (reset between peek/discard, one-byte write failure, missing
  ClientHello after 'S').
- record.go io.Closer comment corrected: MockYaml implements Close
  unconditionally and is a no-op in yaml mode (gobStop==nil).

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address second round of Copilot comments

- cleanup.go: rewrote detachOrphanedByName docstring to be honest
  about the name-only match (BPF_PROG_GET_INFO_BY_FD exposes no
  creator PID); safety comes from the global no-other-keploy gate
  in the caller. Stale 'CleanupStaleAgents' reference removed.
- cleanup.go: detach failure log moved from Warn to Debug and made
  actionable (cgroup path + attach type + next step).
- record.go: cleanup failure log moved from Warn to Error and
  tells the operator what to inspect (mock-file tail, disk space,
  output-dir permissions) plus the failing cleanup index.
- db.go: gobClosed latches on Close(), InsertMock after close
  returns a clear error instead of silently enqueueing into a
  channel whose consumer has exited (prevented first-4096-then-
  hang behavior in hypothetical re-record-reuse flows).

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(mockdb): gob read path now treats Mongo mocks like YAML does

Two off-by-one inclusion bugs in the gob path of GetFilteredMocks
and GetUnFilteredMocks meant Mongo mocks were:
  - silently dropped from GetFilteredMocks (per-test mocks), because
    the whitelist switch did not list 'Mongo' (or any default case);
  - accidentally promoted into GetUnFilteredMocks (config mocks),
    because !isUnfilteredKind was treated as 'include' instead of
    'exclude'.

Rewrote both paths to mirror the YAML semantics: a fixed set of
kinds (Generic / Postgres / Http / Http2 / Redis / MySQL / DNS; plus
PostgresV2 on the unfiltered side) are 'unfiltered config' mocks
and go to GetUnFilteredMocks; everything else (Mongo, gRPC, etc.)
is a per-testcase mock and goes to GetFilteredMocks. Does not fix
the pre-existing Mongo Fuzzer record_latest_replay_build flake
that lives in the integrations mongo matcher — that test uses YAML
anyway — but unblocks Mongo workloads in gob mode.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(mockdb): make gob writer lifecycle restartable + DeleteMocksForSet gob

Two bugs in the same file addressing Copilot review round 3:

1. Re-record cycle: Recorder.Start auto-registers mockDB.Close via
   RegisterCleanup. Re-record orchestration invokes Recorder.Start
   repeatedly on the same mockDB instance — under the previous
   sync.Once + gobClosed latch, the second and subsequent sessions
   would lose every mock (writer goroutine exited after first Close;
   sync.Once prevented re-init). Rewrote the lifecycle with a
   mutex-guarded gobRunning flag: Close flushes the writer and
   flips gobRunning=false; the next ensureGobWriter observes
   !gobRunning and spawns a fresh goroutine with fresh channels.
   Verified with a local repro (two Insert/Close cycles, both mock
   files written with non-zero size).

2. DeleteMocksForSet only removed mocks.yaml, leaving a stale
   mocks.gob on disk that GetFilteredMocks would then prefer on
   replay. Now deletes both variants (missing-file is not an error)
   so mocks-only refresh actually refreshes the format currently
   in use.

Also made the async gob writer's per-mock error log actionable: it
now names the output directory and tells the operator to check
disk/permissions, re-run with --debug, or fall back via
KEPLOY_MOCK_FORMAT=yaml while triaging.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address Copilot round-4 comments on mockdb + util + testdb

1. mockdb: extracted isUnfilteredMockKind() as the single source of
   truth for YAML/gob kind classification. Both GetFilteredMocks and
   GetUnFilteredMocks call it, removing the inline switch drift that
   could put PostgresV2 in different buckets depending on on-disk
   format.
2. mockdb: gobReopenLocked now truncates the target file on every
   open instead of O_APPEND. A gob stream's type table lives in the
   encoder; appending a second encoder session to an existing file
   (from a re-record cycle or a cross-test-set switchback) would
   embed a duplicate type table mid-file and break the reader's
   single gob.Decoder. Each session now owns its file exclusively.
3. mockdb: gobFlushAndClose returns the combined flush+close error,
   stored in gobFlushErr and surfaced by Close() so Recorder.Start's
   deferred cleanup logs the real disk-full/permission-change shutdown
   failure instead of silently losing the tail of mocks.gob.
4. mockdb: gobWriterLoop no longer takes InsertMock's ctx. A request-
   scoped ctx cancelling mid-session would exit the writer early and
   leave gobRunning=true with a dead consumer; the writer now shuts
   down only on gobStop (i.e. Close).
5. util: ReadBuffConn carry-over drain now loops until HasPending()
   reports false. Previously a single 64KiB Read could lose in-order
   delivery when a parser had prepended multiple buffers (peek +
   protocol-upgrade prefix).
6. testdb: nextTestIndex first-call now returns seed+1 (the next free
   index) instead of seed (which collided with the last existing
   test on disk). Race path still uses Add(1) on the winning counter.
7. tests: corrected gob_roundtrip_test.go comment that claimed the
   reader re-creates its decoder each iteration — it uses one
   decoder across the whole stream.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: tighten gob writer Close race + gate BPF pin cleanup

Two correctness fixes from Copilot round 5:

1. MockYaml.Close() now holds the lifecycle lock for the entire
   teardown. The earlier version released the lock before waiting
   for gobDone, letting a concurrent ensureGobWriter reassign
   gobQueue/gobStop/gobDone while the previous writer goroutine
   was still selecting on them — a data race that could start two
   writer goroutines on the same mocks.gob and corrupt it. Now
   gobRunning stays true while we drain, concurrent Inserts block
   on the same lifecycle lock, and a second concurrent Close cannot
   double-close gobStop because it observes gobRunning=false after
   the first Close completes. Timeout path leaves gobRunning=true
   so a retry can complete the shutdown rather than racing with a
   half-drained writer. Passes go test -race.
2. CleanOrphanedBPFPins now internally gates on
   HasOtherKeployProcesses(os.Getpid()) and returns early when a
   sibling keploy is alive. Previously the safety precondition was
   'the caller checks first', which is fragile — every new call
   site had to re-enforce it. Moving the check inside matches the
   cgroup-program cleanup path (DetectAndCleanOrphanedCgroupPrograms
   already gates the same way).

Signed-off-by: slayerjain <shubham@keploy.io>

* docs(cleanup): align queryAttachedPrograms comment with implementation

The comment claimed 'bpf.ProgQuery via the raw syscall wrapper' but
queryAttachedPrograms actually calls cilium/ebpf's link.QueryPrograms
(the typed wrapper around BPF_PROG_QUERY). Reworded to match — so a
future maintainer tracing 'which API surface does this depend on?'
reaches the right place.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(testdb): revert nextTestIndex off-by-one

FindLastIndex in pkg/platform/yaml/utils.go already does a trailing
'lastIndex++' before returning — it yields the next available index,
not the largest existing one. The earlier round of comment-driven
review mis-described this and I blindly added another +1, which
would have made a fresh test-set start at test-2 and shifted every
subsequent ID by one.

Reverting to the original contract: seed the counter with the value
FindLastIndex returned, return it directly on the first call, and
use Add(1) on subsequent calls. Passes go test -race on the package.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address Copilot round-6 comments

- db.go Close(): track gobStopClosed so retries after a flush-timeout
  don't panic on closed channel. Previously gobRunning stayed true on
  timeout but gobStop was already closed; a second Close would hit
  close(closedChan) and panic.
- db.go GetUnFilteredMocks YAML path: drop the inline kind switch and
  call isUnfilteredMockKind, same helper the gob path uses. Single
  source of truth — YAML and gob can no longer drift on which kinds
  fall into the 'unfiltered config' bucket.
- db.go DeleteMocksForSet: error log now tells the operator to check
  read-only attributes, directory ownership, and open handles — the
  only failure modes that would surface here after our os.IsNotExist
  tolerance.
- proxy.go: when SkipProxyListener is set, log 'TCP listener
  intentionally not bound' instead of 'Proxy started at port:N' so
  operators and downstream automation aren't told a port is open
  when it intentionally isn't.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address Copilot round-7 comments

- proxy.go: after replying 'S' to the Postgres SSLRequest, any non-
  nil error on Peek(5) is now treated as fatal. The previous branch
  only returned on non-EOF errors; an EOF with a partial buffer
  (len>0 but <5) would fall through into the TLS-detection path
  running against a truncated header and misclassify the stream as
  plaintext. Log now includes bytesRead for diagnostics.
- db.go GetFilteredMocks gob path: special-case PostgresV2 so it
  reaches tcsMocks alongside the Mongo / gRPC kinds, matching the
  YAML path. YAML's GetFilteredMocks switch doesn't list PostgresV2
  (so it falls through to 'include'); the gob path's
  !isUnfilteredMockKind was over-excluding it because the helper
  intentionally lists it in the unfiltered bucket. Replay behavior
  is now consistent across .gob and .yaml storage.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address Copilot round-8 comments

Five substantive fixes:

1. mockdb.insertMockGob now holds the lifecycle lock through the
   channel send. Previously the "writer is alive" check was done in
   ensureGobWriter but the actual send happened after the lock was
   released, so a concurrent Close could transition the writer to
   stop-closed between the two and the enqueued job would end up in
   a queue whose consumer had already drained+exited. With the lock
   held, the invariant 'any job we enqueue is seen by the current
   writer (or its drainAndClose)' holds atomically. Insert also now
   rejects enqueue attempts after stop-closed is set.

2. mockdb.drainAndClose records gobWriteOne errors into gobFlushErr
   (joined via errors.Join) so a disk-full / perm-change during the
   final drain actually surfaces through Close() to the Recorder.Start
   deferred cleanup log, instead of being silently discarded.

3. mockdb.DeleteMocksForSet: reject a testSetID that contains '..'
   or isn't a clean single-segment relative name. Previously, a
   re-record request with testSetID='../../etc' would let os.Remove
   delete files outside the mocks directory.

4. Rename gobWriteJob.testSet → testSetPath to match what it actually
   holds (a full '<MockPath>/<testSetID>' path, not the ID). zap
   field in the async-writer log renamed consistently.

5. proxy.go: when SkipProxyListener is on, emit the Info-level
   startup notice only once (in StartProxy near the DNS-control
   banner). The second log inside start() is now Debug — operators
   don't see the same message twice, but debug traces still have it.

Passes go test -race -count=3 on the mockdb package.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: tighten testSetID validation + fix gofmt + test comment

- gofmt: struct field alignment regressed after the testSet→testSetPath
  rename in the previous commit; reformatted via gofmt -w.
- mockdb.DeleteMocksForSet: tightened validation — rejects empty IDs,
  ".", "..", and any '/' or '\\' separator (the previous guard
  only rejected unclean paths / '..', letting 'a/b' or '.' pass).
  The test-set layout is one directory level under MockPath, so a
  single non-empty segment is the only acceptable shape.
- gob_roundtrip_test.go: TestMockYamlIsIOCloser comment updated to
  describe the RegisterCleanup + deferred drain shutdown mechanism
  (the type-assertion dance comment described the pre-PR pattern).

Not changing the Metadata["type"] accesses Copilot flagged as
nil-deref risks — Go's nil-map reads return the zero value without
panicking, and this pattern is already used at pkg/platform/yaml/mockdb/db.go:339,879
in the pre-existing YAML paths. Adding a defensive nil check would
be cargo-cult, not a real fix.

Signed-off-by: slayerjain <shubham@keploy.io>

* docs(proxy): document InterceptPostgresSSLRequest client-side-only scope

Copilot round-9 flagged that the opt-in flag replies 'S' and MITMs
TLS with the client, but tls.Dial to the upstream does a direct TLS
handshake without forwarding the 8-byte SSLRequest preamble. A real
Postgres server typically rejects direct TLS on 5432.

Rather than add upstream forwarding in this PR (substantial feature
work that needs its own design + test), the docstring now explicitly
states the flag's current scope:

- Client-side only: read SSLRequest from client, reply 'S', TLS MITM.
- NOT a full end-to-end MITM against vanilla Postgres — for that
  path, use the parser-driven TLSUpgrader in the private parser set.
- Usable today in (a) downstream builds that decrypt the client side
  and forward plaintext elsewhere, or (b) against upstreams that
  accept direct TLS on the destination port.

The flag defaults to false, so this does not change any runtime
behavior in the default build. A follow-up PR can add symmetric
upstream SSLRequest forwarding if a production consumer needs it.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy): revert to Peek(5) for default path; extend to 8 only on SSLRequest prefix

The previous Peek(8) was unconditional and deadlocked plaintext
protocols whose first client write is <8 bytes and then waits for
a server response. Redis 'PING\\r\\n' (6 bytes), memcached 'quit\\r\\n'
(6 bytes), and similar greetings would never deliver the 7th and
8th byte, hanging handleConnection until the TCP read deadline.

New flow:
- Peek 5 bytes (what the TLS record header and most prefix checks
  need). Same behavior as pre-this-PR main.
- Only when InterceptPostgresSSLRequest is on and the 5-byte prefix
  matches the SSLRequest signature '00 00 00 08 04', do a second
  Peek(8) to verify the full 8-byte packet before the 'S' reply.
  The prefix is unique enough not to collide with TLS, CONNECT, or
  any other protocol greeting, so the 3-byte wait is deterministic
  — only a real SSLRequest client incurs it.

Plus two doc/log fixes:
- hooks.go: DetectAndCleanOrphanedCgroupPrograms internally gates
  on HasOtherKeployProcesses (as of the round-5 fix). The caller's
  comment still said the check was in the caller, and the caller
  emitted an Info log that duplicated the helper's own Info log on
  cleanup. Comment corrected, duplicate log removed.
- proxy.go isPostgresSSLRequestPrefix helper added, documented
  alongside the full-signature helper.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix: gob flush-err accumulation + DirectChunks ownership copy

Three round-10 Copilot comments addressed:

1. mockdb.gobWriterLoop now accumulates per-mock encode failures
   into gobFlushErr (under gobMu), not just the Error log. Previously
   a steady-state failure in the hot path would be logged but
   Close() could still return nil — the operator would see warnings
   in the log but the cleanup handler would not flag the session as
   lossy.

2. mockdb.gobFlushAndClose now joins its flush/close error with any
   pre-accumulated gobFlushErr instead of overwriting it. The old
   assignment meant that a successful final flush masked earlier
   encode failures. Close now returns the union of errors from the
   whole session.

3. util.ReadBuffConn defensively copies each chunk from
   DirectChunks() before handing it to bufferChannel. Matches the
   allocation semantics of the regular ReadBytes path and makes
   producer-side scratch-buffer reuse safe. The DirectChunkReader
   docstring now states the ownership contract explicitly: chunks
   may alias a shared backing array, consumers must not rely on
   their lifetime past the next push.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy): remove duplicate readyChan send that leaked skipListener goroutine

readyChan is buffered size 1 and start() already sends exactly once
on all code paths — skipListener branch, listen failure, and
listen success. The goroutine's extra 'readyChan <- err' after
start() returned would block forever because the caller only
reads readyChan once: before this PR that was a latent main bug,
but the new skipListener path made the goroutine leak visible on
every shutdown (start returns nil after <-ctx.Done, the second
send blocks, the goroutine never exits, errgroup.Wait hangs).

Made start() the sole writer. Removed the post-return send. The
goroutine still returns the error correctly for errgroup.Go to
propagate.

Also: reset gobOverflows to 0 on Close via atomic.Swap so
re-record cycles on the same MockYaml instance don't accumulate
the prior session's overflow count into the next session's
Close summary.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(docker): apply ProxyPort=0 guard to darwin/colima/windows branches too

The proxyPortStr helper (empty when opts.ProxyPort==0, '-p P:P'
otherwise) was only being substituted into the linux docker-run
alias. The windows, darwin-colima, darwin-docker, and
windows-colima branches all kept the inline
'-p ' + fmt.Sprintf(...) construction, so ProxyPort=0 on those
hosts would still generate '-p 0:0' and fail docker run before
the agent ever started.

Replaced all four call sites with proxyPortStr. Guard now applies
uniformly across every supported host — the skipListener use case
(agent running without a bound proxy port) works on macOS and
Windows too.

Also: declining Copilot's readGobMocks &m 'pointer reuse' flag.
Go allocates a fresh heap address for the block-scoped 'var m'
each iteration whenever '&m' escapes via append, and the existing
TestRoundTrip_MultipleMocksAppend test already validates that
multi-mock gob reads return distinct mocks with correct fields.
Verified independently with a repro (3 append(&m) gives 3
distinct heap addresses). The suggested 'm := new(...)' rewrite
is semantically identical — it would be a no-op refactor, not a
real fix.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy): skipListener shutdown no longer closes session.MC

The non-skip shutdown path closes session.MC only after
clientConnErrGrp.Wait() guarantees every per-conn producer has
exited. My skipListener branch closed the channel immediately on
ctx.Done without that guarantee — DNS mock producers and any
external capture-layer senders still running during shutdown
could hit 'send on closed channel' and panic mid-teardown.

Removed the close. MC's lifetime ends when all references drop;
consumers driven by ctx.Done don't need an explicit close to
stop ranging. Documented the rationale inline so a future edit
doesn't reintroduce the close.

Also made the nsswitch-reset error log actionable with a recovery
hint.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(mockdb): hold lifecycle lock across sync-fallback write

insertMockGob's queue-full fallback released gobLifecycleMu
before calling gobWriteSync. A concurrent Close() could then
flush+close the writer (setting gobFile=nil), and when the
in-flight gobWriteSync acquired gobMu and called gobReopenLocked
it would hit the O_TRUNC path — destroying every mock written
earlier in the session.

Keep the lifecycle lock held across the sync write instead. This
serializes the degraded sync path against Close(), at the cost
of serializing parsers during queue overflow. Since overflow
only fires when the 4096-slot async queue is already full (itself
a degraded-throughput mode), this is an acceptable tradeoff for
the 'no dropped mocks' guarantee the overflow path exists to
provide. Passes go test -race.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: delete unused ensureGobWriter + log BPF pin cleanup errors

- mockdb.ensureGobWriter was dead code after round 8 inlined the
  writer-init inside insertMockGob (under the lifecycle lock so the
  sync-fallback race-fix works). Removed. Nothing else called it;
  doc comments in Close() and the overflow hint now point at the
  actual init site.
- cleanup.CleanOrphanedBPFPins now surfaces os.Remove / os.RemoveAll
  errors instead of silently discarding them. Previously a failed
  unpin would go unreported and the subsequent attach would fail
  with 'program already attached' or 'pin exists' — root cause
  invisible. Log is Error level with an actionable next step (check
  bpffs writability, CAP_SYS_ADMIN/CAP_BPF, no other live keploy
  holding the pin).

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: warn on gob+UpdateMocks + clarify extraparsers.go ref

- mockdb.UpdateMocks: when gob format is active and mocks.gob
  exists, log a Warn explaining that pruning is YAML-only today
  and the gob file will grow. Users who run with
  KEPLOY_MOCK_FORMAT=gob + RemoveUnusedMocks=true previously got
  a silent no-op because the YAML-only read saw no mocks.yaml.
  Gob pruning (read stream, filter, rewrite fresh encoder
  session) is deferred to a follow-up; warning with a concrete
  next step is the responsible intermediate.
- runtime_hooks.go: the InterceptPostgresSSLRequest docstring
  mentioned 'extraparsers.go' which doesn't exist in the OSS
  tree — it's generated at CI time by setup-private-parsers.
  Reworded so a maintainer reading the comment can trace where
  the Postgres parser is actually wired in (CI action +
  integrations repo), not hunt for a nonexistent file.

Signed-off-by: slayerjain <shubham@keploy.io>

* docs(mockdb): gob pruning skip log Warn→Info (per repo logging guidelines)

The gob+UpdateMocks skip is a benign outcome with a documented
next step, not a fault condition — the repo's logging guidelines
discourage new Warn logs for that shape of message. Bumped the
existing remediation text down to Info (no text change) so it
still shows up for the user but doesn't raise the alert level.

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: fix agentRegistartionMap typo + clarify mysql EOF log

- Rename the misspelled field 'agentRegistartionMap' to
  'agentRegistrationMap' so it matches 'clientRegistrationMap'.
  Both call sites updated (hooks.go + comm.go). No behavior change.

- mysql/recorder/record.go: the io.EOF branch on
  handleInitialHandshake was logged as a TLS handoff
  'short-circuit', but it also fires on ordinary client
  disconnect mid-handshake. Reworded the log to be neutral —
  'EOF during MySQL handshake' with a conditional hint about TLS
  handoff — so ordinary disconnects are not misreported as TLS
  events in production logs.

Also updated the PR body: section 4 is now labeled as opt-in
(InterceptPostgresSSLRequest defaults false; default build still
uses the parser-driven TLSUpgrader) and documents the scope
limitation (client-side-only, no upstream SSLRequest forwarding).

Signed-off-by: slayerjain <shubham@keploy.io>

* chore: address round-19 Copilot comments

- cleanup.go: log queryAttachedPrograms failures at Debug with
  actionable context instead of silently returning 0. Not all
  attach types support BPF_PROG_QUERY on every kernel, and
  insufficient privileges produce the same error shape; both
  turn orphan cleanup into a no-op, and the silent path left
  operators with no way to trace why the downstream attach
  failed with 'program already attached'.

- mockdb.DeleteMocksForSet: dropped the overly-strict
  strings.Contains(testSetID, '..') substring check. It rejected
  perfectly valid single-segment names like 'v1..v2' or
  'team..a'. The remaining checks (no separators, not absolute,
  not '.' or '..' verbatim, stable under filepath.Clean) already
  block every directory-escape case — a single-segment name
  with a '..' substring (but no separators) cannot resolve to
  a parent directory.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(mockdb): deep-copy Mock before async-gob enqueue

insertMockGob enqueued the caller's *models.Mock pointer into the
async writer and returned. A caller that subsequently mutated the
same *Mock — e.g. RecordHooks.AfterMockInsert writing telemetry
fields onto the same pointer, or a Mock-pool producer recycling
slots — would race the async gob encoder and persist an
unintended payload on disk.

Fix: call mock.DeepCopy() before enqueue. The helper already
exists in pkg/models/mock.go and handles the nested Noise /
Metadata / per-protocol-slice fields. Copy cost is bounded by
mock size. Not switching to 'encode to bytes synchronously'
because each encoder session must share one type table for the
reader's single-decoder stream — per-InsertMock byte-buffer
encoding would embed a type table per mock and break replay.

gobWriteSync (overflow fallback) does not need the copy: it runs
synchronously on the caller goroutine while the caller is still
blocked in InsertMock, so there's no window for external
mutation during the encode.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(mockdb): UpdateMocks gob-skip keyed on file presence, not current run format

The previous guard checked useGobMockFormat() before looking at
mocks.gob. That diverged from the read paths: GetFilteredMocks
and GetUnFilteredMocks prefer mocks.gob purely by file presence,
regardless of what KEPLOY_MOCK_FORMAT is set to for this run. So
a test-set recorded with gob last time but pruned today under a
yaml-configured run would silently hit the YAML-only pruning
code against a non-existent mocks.yaml — pruning would 'succeed'
by doing nothing while replay still read the untouched gob file.

Now the skip triggers on os.Stat(mocks.gob)==nil regardless of
the current run's configuration, matching the replay side. User
message updated to explain the state they're actually in ('gob
is present, replay prefers it').

Signed-off-by: slayerjain <shubham@keploy.io>

* refactor(mockdb): YAML GetFilteredMocks uses the shared isUnfilteredMockKind helper

Before this PR the YAML path had its own Kind switch that excluded
(Generic, Postgres, Http, Http2, Redis, MySQL, DNS) — default
isFilteredMock=true let PostgresV2 and Mongo fall through to the
append. The gob path I added in this PR was rewritten to use
isUnfilteredMockKind + a PostgresV2 special case, so the
classifier was shared for the unfiltered side but the filtered
side still had two implementations — and the comment on the gob
path claiming it was 'shared with YAML' was misleading.

Replaced the inline switch with the same
'!isUnfilteredMockKind(kind) || kind == PostgresV2' call the gob
path uses. Behavior is identical: PostgresV2 continues to reach
tcsMocks (matches the historical YAML fall-through), every other
kind maps through the shared helper. Both filtered-mock paths
now reference the same classifier, so a future kind addition
only has to land in isUnfilteredMockKind to cover YAML and gob
uniformly.

Signed-off-by: slayerjain <shubham@keploy.io>

* perf(cleanup): open cgroup FD once across keployBPFProgNames loop

DetectAndCleanOrphanedCgroupPrograms iterates every entry in
keployBPFProgNames and called detachOrphanedByName for each one —
and detachOrphanedByName used to open/close the cgroup dir per
call. That's 2 extra syscalls times the number of program names
on every keploy startup.

Moved the os.Open to the caller so the same *os.File is shared
across every per-name query/detach in one cleanup pass. Function
signature takes *os.File + the path (path kept for log context).

No behavior change: same scan, same detach decisions, same safety
gate (HasOtherKeployProcesses checked once before we open the
FD). Passes go test -race.

Signed-off-by: slayerjain <shubham@keploy.io>

* ci: cap per-job timeout at 30min on Windows + fuzzer workflows

A run of run_golang_native_windows/go-dedup burned 6h5m of
windows-latest runner time. Root cause: keploy's Windows native
capture layer (FFI) enters a tight 'Packet sent successfully!'
loop if the intercepted process exits before the capture
driver/DLL unloads. Job log showed 76,977 iterations in the last
4 min of runtime before GitHub Actions' default 6h hard timeout
fired. Fixing that FFI shutdown path is a separate Windows-native
change outside keploy#4045's scope.

Pragmatic CI fix landing here: add 'timeout-minutes: 30' to every
job that could potentially sit in a capture-layer shutdown loop:

- .github/workflows/golang_native_windows.yml
- .github/workflows/golang_docker_windows.yml
- .github/workflows/fuzzer_linux.yml (all 5 fuzzer jobs)

Healthy runs on these workflows finish in 3-5 min, so 30 min is
~5-10x headroom. 30 matches the existing timeout used on
golang_docker_macos.yml, java_linux.yml, prepare_and_run_macos.yml,
python_docker_macos.yml. A hung future run fails in 30 min rather
than 6h — faster feedback, no stuck runner slot.

The underlying Windows FFI shutdown bug is tracked for a separate
fix in keploy's windows-native capture implementation.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(hooks/windows): cap StopRedirector shutdown at 10s

The run_golang_native_windows/go-dedup job hit GitHub Actions'
6h runner timeout because keploy's Windows Hooks.unLoad blocked
unconditionally on StopRedirector(). That call crosses into the
Rust capture side (libwindows_redirector.a), which owns a
WinDivert recv/send loop — if its stop path hangs (deadlock in
the Rust side, slow wake of the recv thread), the Go process
lingers until the OS tears it down, and the CI log fills with
'Packet sent successfully!' from the still-running loop.

Wrap the StopRedirector call in a goroutine with a 10 s deadline:
- happy path: Rust returns, we log the error (if any) and proceed
- stuck path: we log an actionable 'driver didn't return in time'
  error naming the symptom and a pointer to src/ffi.rs, then
  return so the Go side finishes its own shutdown sequence

This doesn't fix the underlying Rust shutdown bug (that's a
separate change in the Rust redirector), but it prevents a
single stuck capture from holding the whole keploy process — and
therefore a CI runner — for 6 hours. Paired with the 30-min
workflow timeout landed earlier, any future recurrence fails
fast and visibly instead of silently burning runner time.

Signed-off-by: slayerjain <shubham@keploy.io>

* feat(proxy): gob mock pruning + Postgres SSLRequest upstream forwarding

Two follow-ups folded in — both were documented as deferred in
prior rounds; landing them here closes the scope gaps.

1. mockdb.updateMocksGob: when UpdateMocks hits a test-set backed
   by mocks.gob, it now reads the full stream via readGobMocks,
   applies the exact same filter as the yaml path (keep config
   mocks, consumed names, post-replay writes, pre-first-test
   startup mocks), and atomically rewrites a fresh single-encoder
   stream with the magic header. Quiesces any active async writer
   on this MockYaml via Close() before touching the file so a
   concurrent InsertMock cannot race the truncate-and-rewrite.
   Previously the path emitted an Info-log 'pruning is YAML-only'
   and returned nil; gob-recorded sessions with
   RemoveUnusedMocks=true would now see their mocks.gob actually
   shrink.

2. proxy.dialPostgresSSLUpstream: when
   agent.InterceptPostgresSSLRequest is on and destInfo.Port is
   5432, the upstream dial now does SSLRequest + read 'S'/'N' +
   tls.Client.Handshake instead of tls.Dial. A real Postgres
   server on 5432 rejects a direct TLS handshake; the old code
   only handled the client side of the flag, so upstreams would
   fail. Non-5432 destinations still go through tls.Dial.

   'N' from upstream surfaces a clear error: 'the upstream was
   built without TLS — enable SSL on Postgres or disable
   InterceptPostgresSSLRequest for this deployment'. Unexpected
   non-S/N bytes surface a 'not actually Postgres on 5432' error.

3. runtime_hooks.go InterceptPostgresSSLRequest docstring
   updated: the flag now covers both sides of the handshake for
   5432 destinations; no longer labeled 'client-side-only'.

Signed-off-by: slayerjain <shubham@keploy.io>

* style: gofmt comment alignment in runtime_hooks.go

Signed-off-by: slayerjain <shubham@keploy.io>

* chore(windows): bump vendored libwindows_redirector.a

Built from a companion repo workflow (branch
fix/windows-shutdown-log-spam, commit f87c18c) by the new
self-hosted Build libwindows_redirector.a workflow.

Pulls in the shutdown and logging fixes that addressed the Windows
go-dedup CI hang: bounded_join in HandleGuard's Drop with a shared
shutdown budget, hot-path log macros instead of println! spam, and
trace-level get_destination logging (off by default via fern's
LevelFilter::Debug).

sha256 3dd7a67e3522970edab08515b08c8d5a210988dcf9eaa019172e0b2a1961896b

Signed-off-by: slayerjain <shubham@keploy.io>

* fix: narrow syncMock recover() + preserve mocks.gob perms on prune

Addresses two Copilot review comments on #4045 (round 21):

1. syncMock.AddMock: the defer func(){_=recover()}() around the
   closed-channel send swallows every panic, including unrelated
   bugs unrelated to the shutdown race. Narrowed to only swallow
   runtime.Error("send on closed channel"); anything else re-panics
   so real bugs still surface. Message text is the stdlib constant
   emitted by chansend1 for closed-channel sends, stable across Go
   versions.

2. updateMocksGob: os.CreateTemp creates the tmp file 0600, so
   without a chmod before os.Rename pruning silently narrowed
   mocks.gob from 0644 (writer's typical mode via umask 0022) down
   to owner-only, breaking replay for any other user/process. Stat
   the source file first, apply the same Mode to the tmp file
   before the rename; fall back to 0644 when the source is missing
   so we don't introduce a new mode-inheritance path.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(syncMock): type-assert to error, not runtime.Error, for closed-channel panic

Copilot correctly flagged that runtime's plainError (the type used
for 'send on closed channel' via chansend1) satisfies the error
interface but does NOT implement the exported runtime.Error. The
previous r.(runtime.Error) type assertion would fail and re-panic,
defeating the shutdown-safety the recover is there to provide.

Switched to r.(error) + message match — this is the pattern the Go
stdlib tests use for this exact panic. Added two tests that pin
down the behavior so a future refactor cannot silently regress:

  - TestAddMockSwallowsClosedChannelPanic: full path through
    AddMock with a closed outChan; the goroutine must return
    without propagating the panic.
  - TestAddMockRepanicsUnrelatedValues: table-driven, proves the
    recover is narrow (closed-channel error swallowed; other
    runtime errors and plain string panics re-panic).

Signed-off-by: slayerjain <shubham@keploy.io>

* chore(windows): rebump vendored libwindows_redirector.a

Built from a companion repo workflow (branch
fix/windows-shutdown-log-spam, commit 524c407) by the self-hosted
Build libwindows_redirector.a workflow at run 24607337104.

Incorporates the follow-up Copilot fixes landed on wr#2 since the
previous bump:
  - panic payload downcast (&'static str / String) so bounded_join
    logs the actual panic message instead of a generic marker
  - 'rustup override set stable' instead of mutating the runner-wide
    default toolchain
  - explicit contents:write job permission for the release-attach
    step (GitHub Actions does not evaluate ${{}} in permissions,
    so kept static)
  - disk-space cleanup step before checkout (cargo registry/git,
    stale target/, runner _diag/_temp) to handle the shared
    self-hosted runner's C: volume filling up

sha256 4776e6bbf879a3a74b111e0a4c2c6ed8f9ae67b20934701ddcf8794aa3a81fb7

Signed-off-by: slayerjain <shubham@keploy.io>

* ci: bound build-and-upload + pull-docker-image Windows jobs at 30min

Run 24608731505's build-and-upload job sat at 'in_progress' for 40+
min on windows-runner-02 with no step progress visible, blocking
every other Windows job queued on the same runner (wr#2 stuck in
queue behind it).

prepare_and_run_windows.yml's build-and-upload and pull-docker-image
jobs had no timeout-minutes, so a stuck CGO link or docker-pull
hang could hold the runner for 6h. Added 30-min caps consistent
with the ones already on golang_native_windows.yml and
golang_docker_windows.yml. Healthy runs are ~5 min; 30 gives 6x
headroom.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy): serialize close(session.MC) with SyncMockManager sends

Root-cause of the petclinic record_build_replay_latest hang on #4045.

# What happened in CI

run_java_linux / spring-petclinic-rest (record_build_replay_latest)
failed at the 30-min workflow timeout. Last log line was the 80%
progress marker during record; then ~28 minutes of silence before
the operation-cancelled marker. Orphan processes at cleanup:
keploy, java, sh, bash — all still alive.

# Repro and RCA

Ran the same record flow locally under -race:

    WARNING: DATA RACE
    Write at 0x00c0004b5820 by goroutine 46:
      runtime.closechan()
      proxy.go:576 close(p.session.MC)
      proxy.(*Proxy).start.func2
    Previous read at 0x00c0004b5820 by goroutine 69:
      runtime.chansend()
      syncMock.(*SyncMockManager).AddMock.func1
      generic.createGenericMocksAsync.func1

createGenericMocksAsync runs as a bare 'go' goroutine spawned from
Generic.RecordOutgoing — it is NOT tracked by clientConnErrGrp, so
the proxy shutdown path's errgroup.Wait() returns while the parser
is still producing mocks. The shutdown goroutine then closes
session.MC concurrently with AddMock's outChan <- mock.

The prior recover() in AddMock swallowed the resulting 'send on
closed channel' panic, but did not prevent the race itself. Under
-race the runtime reports unsynchronized memory access, and the
race correlated with the 28-min hang.

# Fix

Added sync.RWMutex (outChanMu) + outChanClosed bool to
SyncMockManager. AddMock takes RLock across the send; a new
CloseOutChan method takes Lock across close(outChan). This
serializes every in-flight send against the close without
contending parallel sends. The recover() stays as defence in depth
for the narrow window between the closed check and the send.

proxy.go:576 now calls mgr.CloseOutChan() instead of direct close.
SetOutputChannel clears outChanClosed so re-record flows (same
manager, new channel) keep working.

# Verification

    $ go test -race ./pkg/agent/proxy/syncMock/ -run TestCloseOutChanRacesAddMock
    --- PASS: TestCloseOutChanRacesAddMock (0.00s)
    ok go.keploy.io/server/v3/pkg/agent/proxy/syncMock

Local petclinic reproducer (10 rounds × 50 chains × ~5 requests,
~2500 requests total, 2555 mocks captured): zero data-race reports,
4-second clean shutdown on SIGTERM. The 28-min hang does not recur.

New test TestCloseOutChanRacesAddMock pins the behavior: 8
goroutines × 500 sends racing against a midstream CloseOutChan
stay clean under go test -race.

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy): address full mock-channel race class + wire DNS through the guard

Applies the principal-engineer code-review to the 0985bf0 fix. That
commit blocked only the AddMock vs close race; this commit covers
the whole race class and drops the scaffolding that was only needed
for the partial fix.

# Changes

1. Single RWMutex (outChanMu) now guards both outChan and
   outChanClosed. Read outChan inside the RLock so sender observation
   and send are consistent under one lock, fixing the TOCTOU where
   a Close → SetOutputChannel(new) interleaving would have let
   AddMock send on its stale local copy of the already-closed
   channel.

2. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go) now routes
   its send loop through the same sendToOutChan helper. Its
   previous 'capture outChan under m.mu, release, send outside any
   lock' pattern was the same race class as AddMock's but on a
   different caller (DedupQueue.ResolveJob from mongo/dedup
   parsers). Without this fix, the mongo-dedup fuzzer would
   reproduce the same -race hit on a different line.

3. dns.go's async DNS mock path (pkg/agent/proxy/dns.go) now goes
   through SyncMockManager.SendConfigMock instead of a bare
   'session.MC <- mock'. SendConfigMock bypasses AddMock's
   firstReqSeen buffering — DNS is a separate key-value map (one
   mock per unique (name, qtype) via the p.recordedDNSMocks
   dedupe), not a streaming parser, so the correct behaviour is
   always forward immediately. The outChanMu guard applies
   identically to AddMock and SendConfigMock, so DNS and parser
   paths close-vs-send-serialize the same way.

4. Dropped the recover() defence-in-depth inside AddMock: once
   outChan is read under the same RLock that gates outChanClosed,
   the closed-channel panic cannot happen, and a stray recover
   would only mask real bugs.

5. Dropped ResetForNewSession — unused; SetOutputChannel already
   clears outChanClosed, and piling a second way to do the same
   thing invites drift.

6. AddMock no longer buffers after CloseOutChan. Post-close, the
   mock is dropped instead of leaking into a buffer that
   ResolveRange will never drain. (Matches the old recover-based
   drop semantics; preserves buffering when outChan is simply
   unbound.)

# Tests

All under go test -race. Added:
  - TestCloseOutChanRacesResolveRange: the ResolveRange equivalent
    of the AddMock race test. 4 goroutines × 200 ResolveRanges
    concurrent with a midstream CloseOutChan. Pre-fix this would
    race; post-fix clean.
  - TestSetOutputChannelAfterCloseReopens: pins the Close → SetOC
    → AddMock sequence. Regression for the TOCTOU surfaced in
    code review.
  - TestSendConfigMockDropsAfterClose: DNS path post-close drops
    silently.
  - TestSendConfigMockIgnoresFirstReqSeen: SendConfigMock keeps
    forwarding after SetFirstRequestSignaled (AddMock would
    buffer — DNS must not).
  - TestCloseOutChanRacesSendConfigMock: concurrent DNS-path
    senders vs close.

Kept and updated:
  - TestAddMockAfterCloseDropsSilently (was TestAddMockSwallows
    ClosedChannelPanic): same semantic pinned, new mechanism.

Removed the TestAddMockRepanicsUnrelatedValues harness — the
recover() it was pinning is gone.

# Verification

  $ go test -race ./pkg/agent/proxy/syncMock/...
  PASS (8 tests)

Full petclinic reproducer, 10 stress rounds × 50 chains × ~5
requests/chain, ~2500 requests total:
  - 2590 mock files captured
  - Zero 'WARNING: DATA RACE' lines in the keploy log
  - Clean SIGTERM shutdown in 4 s (down from the 30-min timeout
    observed on the bad run)

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy,ci): address round-24 Copilot review on syncMock refactor

Four follow-up fixes on ee0332e plus the Windows runner disk-full
that failed this PR's own build-and-upload job.

1. ResolveRange (pkg/agent/proxy/syncMock/syncMock.go): the
   pre-release read of m.outChan was still under m.mu alone,
   racing SetOutputChannel/CloseOutChan which write it under
   outChanMu. Switched to outChanStatus() — same RLock path the
   send uses — so the consistency check is now under the lock
   that protects the field.

2. SetOutputChannel idempotence: the hot path is buildRecordSession
   re-binding rule.MC (same pointer) on every accepted connection.
   Previously it unconditionally cleared outChanClosed, so a
   post-shutdown re-bind turned a closed channel back 'open' and
   the next send would panic. Now only clears the flag when the
   channel POINTER actually changes (new session / re-record). New
   test TestSetOutputChannelSamePointerAfterCloseKeepsClosed pins
   this.

3. dns.go kept calling SetOutputChannel on every mock. With the
   idempotence fix in #2 above it's now safe, but the comment
   block is updated to spell out why, and the call is ordered
   before AddMock/SendConfigMock so the first DNS query (which
   may fire before the first TCP connection triggers
   buildRecordSession) still binds the channel.

4. proxy.go:890 comment block claimed InterceptPostgresSSLRequest
   is 'client-side only' — it isn't, dialPostgresSSLUpstream
   re-originates SSLRequest upstream when dest port is 5432.
   Rewrote the comment to match actual behaviour and point at the
   runtime_hooks.go docstring.

5. prepare_and_run_windows.yml build-and-upload failed on ee0332e
   with 'There is not enough space on the disk' writing go-build
   importcfg under SystemTemp. Ported the tiered disk cleanup
   from a companion repo workflow: fast path on >=4 GB free,
   light sweep (SystemTemp\go-build*, sibling runners' _diag /
   _temp, other-repo _work >7d), heavy sweep (%TEMP%, Windows
   Update cache), last-resort go module cache purge, hard-fail
   under 2 GB.

Verification:
  go test -race ./pkg/agent/proxy/syncMock/... — 9 tests pass
  petclinic local 5-round stress, 1295 mocks captured, zero
  WARNING: DATA RACE, 4 s clean SIGTERM shutdown.
Signed-off-by: slayerjain <shubham@keploy.io>

* fix(proxy,mockdb): round-25 Copilot — ctx in gob prune, ResolveRange drops stale on close

Three items from Copilot round 25 on ee0332e/52c9ba5, plus the PR
description update (applied separately via gh api PATCH).

1. pkg/platform/yaml/mockdb/db.go — updateMocksGob ignored its ctx
   (`_ = ctx`). Wired ctx.Done() checkpoints at the entry of the
   function and every 1024 iterations of the filter and encode
   loops so a cancelled recorder aborts a large prune instead of
   finishing it for no one. 1024 is a cheap stride: syscall cost
   is amortised while cancellation latency stays under a few ms
   even on ~10^5-mock test-sets.

2. pkg/agent/proxy/syncMock/syncMock.go — ResolveRange was only
   snapshotting the `bound` boolean from outChanStatus and
   discarding `closed`. That conflated 'not wired yet' with
   'shutdown in progress': both fell through to the retain path,
   so any re-record session that rebound a fresh channel via
   SetOutputChannel could inherit stale mocks from the previous
   session's buffer. Thread `closed` through and drop instead
   of retaining in both branches (window-match and retention).

   New test: TestResolveRangeDropsStaleMocksAfterClose pins both
   branches clean across the close boundary.

Verification:
  - go test -race ./pkg/agent/proxy/syncMock/... ./pkg/platform/yaml/mockdb/... (all pass)

PR description also updated to match the actual upstream-side
SSLRequest scope (was mistakenly labeled 'client-side-only' —
dialPostgresSSLUpstream re-originates SSLRequest on port 5432).

Signed-off-by: slayerjain <shubham@keploy.io>

* fix(mockdb): reject volume-qualified testSetID in DeleteMocksForSet

Copilot round 26 on keploy#4045: the existing path-traversal guard
on DeleteMocksForSet covered separators, absolute paths, and
'.'/'..' but not Windows volume qualifiers. 'C:' has no separators,
isn't absolute on Linux, and Cleans to itself — but on Windows
filepath.Join(base, 'C:') can drop the base and resolve at the
drive root, turning os.Remove into an arbitrary-file delete on a
self-hosted Windows runner.

Added two layered rejections:

1. filepath.VolumeName(testSetID) != '' — catches the case at
   runtime on Windows (returns 'C:' for 'C:', '\\server\share'
   for UNC, '\\?\C:' for extended-length, all non-empty).

2. strings.ContainsAny(':' ...) — filepath.VolumeName is
   Windows-specific and returns '' on Linux for 'C:', so the
   explicit colon check makes the Linux build reject the same
   strings the Windows runtime would. Legitimate test-set names
   never carry a colon. strings.HasPrefix for '\\' is covered
   by the existing ContainsAny('/\\') check.

Test TestDeleteMocksForSetRejectsVolumeQualifier covers four
shapes: 'C:', 'D:foo', '\\srv', '\\?\C:'.

Also replied to Copilot's secondary comment on readGobMocks:line 961
(same-iteration &m reuse claim) explaining why it's cargo-culted
— the var is declared inside the for-body, not in the for-init,
so each iteration gets a fresh heap-escaping allocation. The
existing TestGobRoundtrip* suite would fail if the claim were
correct.

Signed-off-by: slayerjain <shubham@keploy.io>

* ci(windows): use Windows PowerShell 5.1 instead of pwsh on self-hosted runners

The new win-runner-[1-4] self-hosted pool that came online after
the old windows-runner-0[123] were retired does not ship
PowerShe…
slayerjain added a commit that referenced this pull request Apr 19, 2026
The Windows go-http-no-deps record phase on #4076 run
24629542035/job/72014589521 hit a 0-tests-recorded failure because
the readiness probe broke on a single 200 response, and that first
200 came from a transient Docker Desktop port-publish stub rather
than the in-container Go binary. Four seconds later the real
traffic got "Unable to connect to the remote server" on the first
request, and the single try/catch wrapping all six calls swallowed
the error with $sent=0. No test-set-0 directory ever formed, and
the script exited with "Test directory not found".

Three-part fix, applied identically to the docker and native
variants under golang/go-dedup:

1. Readiness probe now requires $stabilityCount (=3) consecutive
   200 responses with a 1s gap before declaring ready. A single
   flap from a port-forwarder stub can no longer race real traffic.
2. After the probe succeeds, settle for $settleSec (=5) before
   firing traffic so keploy's recording-proxy interception layer
   has time to fully wire up against the container's listener.
3. Replace the all-in-one try/catch around the six HTTP calls with
   an Invoke-WithRetry helper (5 attempts, 1.5x exponential
   backoff). A transient connection-refused on call #1 no longer
   aborts calls #2..#6.

If the probe never reaches the stability threshold inside the
deadline, the script still proceeds so downstream record
validation can surface the failure with its own clearer error
message, instead of us silently pretending the app was ready.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain added a commit that referenced this pull request Apr 19, 2026
…ilently drop mocks (#4076)

* fix(syncMock): bounded-block sendToOutChan so burst capture doesn't silently drop mocks

sendToOutChan used a non-blocking send with a default-drop:

    select {
    case m.outChan <- mock:
    default:
    }

Under burst capture on an oversubscribed runner (CI, a customer's
box under GC pressure, downstream agent momentarily backpressuring
the outChan), this silently dropped pre-first-request mocks.
Customers saw "some calls didn't replay" with no actionable log
signal — the anti-pattern the Windows redirector shutdown loop hit.

Replace with:

 1. Non-blocking fast path (unchanged — the common case where the
    consumer is keeping up has zero latency cost).
 2. On fast-path miss, fall through to a bounded block
    (sendBudget = 200 ms). This absorbs a GC pause or a transient
    downstream stall without losing data.
 3. On bounded-block timeout, increment a process-global dropCount
    and emit a rate-limited Warn (first drop immediately, then
    every 1024th). Operators see the drop instead of wondering why
    replay is missing mocks.

The RWMutex contract is preserved: the send still runs under
outChanMu.RLock, so CloseOutChan (write-lock holder) can't
interleave a close between the not-closed check and the send. The
only new cost is CloseOutChan may now wait up to sendBudget for
in-flight sends — imperceptible vs. a process shutdown, and every
concurrent sender was already racing the same RLock.

Fixes a recurring Redis loss-rate / channel-race flake on
keploy/enterprise's CI where the tests feed syncMock via
SetOutputChannel and never call SetFirstRequestSignaled, so every
recorded mock flows through this path.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(syncMock): plumb logger, atomic.Uint64 drop counter, drop-path tests

Addresses the Copilot feedback on the bounded-send PR:

- Plumb *zap.Logger through SyncMockManager (SetLogger) and log the
  overflow via that logger instead of zap.L(). syncMock is a
  package-level singleton that loads before any zap.ReplaceGlobals,
  so zap.L() silently fell back to Nop and the drop Warn never
  reached operators. Proxy.New now wires the proxy logger in once.
- Promote the sampled drop log from Warn to Error with a shorter,
  actionable message so overflow is loud and grep-friendly.
- Switch dropCount from bare uint64 + atomic.AddUint64 to
  sync/atomic.Uint64 so the 32-bit alignment requirement becomes a
  compile-time property instead of a field-ordering footgun.
- Export DropCount() for tests and external observability.
- Add regression tests covering: no-drop when the consumer drains
  within sendBudget, drop + Error-log + dropCount increment past
  budget, sampling cadence (n=1, 1024, 2048), nil-logger Nop
  fallback, and concurrent-increment atomicity under load.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci(windows/go-dedup): robust app-ready probe + per-request retry

The Windows go-http-no-deps record phase on #4076 run
24629542035/job/72014589521 hit a 0-tests-recorded failure because
the readiness probe broke on a single 200 response, and that first
200 came from a transient Docker Desktop port-publish stub rather
than the in-container Go binary. Four seconds later the real
traffic got "Unable to connect to the remote server" on the first
request, and the single try/catch wrapping all six calls swallowed
the error with $sent=0. No test-set-0 directory ever formed, and
the script exited with "Test directory not found".

Three-part fix, applied identically to the docker and native
variants under golang/go-dedup:

1. Readiness probe now requires $stabilityCount (=3) consecutive
   200 responses with a 1s gap before declaring ready. A single
   flap from a port-forwarder stub can no longer race real traffic.
2. After the probe succeeds, settle for $settleSec (=5) before
   firing traffic so keploy's recording-proxy interception layer
   has time to fully wire up against the container's listener.
3. Replace the all-in-one try/catch around the six HTTP calls with
   an Invoke-WithRetry helper (5 attempts, 1.5x exponential
   backoff). A transient connection-refused on call #1 no longer
   aborts calls #2..#6.

If the probe never reaches the stability threshold inside the
deadline, the script still proceeds so downstream record
validation can surface the failure with its own clearer error
message, instead of us silently pretending the app was ready.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci(windows/go-dedup): literal port substitution + patched YAML dump

Root cause of the #4076 run 24629876059 failure: the
`(?m)("?)8080:8080\1` backref regex intermittently produces a
malformed port line on Windows PowerShell 5.1 — the leading quote
is stripped while the trailing one is kept, yielding
`- 54652:8080"`. docker-compose parses that as `containerPort:
8080"` and rejects it with `invalid containerPort: 8080"`. No
container starts, every HTTP call gets ECONNREFUSED, and the
record phase records zero tests.

My earlier readiness-probe + retry work on this PR correctly
surfaced the downstream symptom (consecutive 200s never observed,
all six traffic requests exhaust their retry budget), but it was
papering over the actual bug: the YAML itself was broken before
docker-compose ever ran. No amount of client-side resilience can
recover from a container that was never started.

This change:

1. Replaces the backref regex with a plain `.Replace("8080:8080",
   "<port>:8080")` — the samples-go/go-dedup compose file uses the
   double-quoted short form exclusively, so literal substitution
   is both simpler and provably correct.
2. Validates that the expected port fragment is present after
   substitution, throwing loudly with the unpatched YAML printed
   if the sample ever introduces a form the script doesn't
   handle. Silent fall-through to an unpatched run was what let
   this slip undetected for so long.
3. Dumps the patched YAML to the job log so any future
   containerPort / hostPort diagnostic lands with the exact text
   docker-compose parsed — no more guessing whether the regex
   behaved.

Native-windows script is left untouched — it runs the Go binary
directly, not dockerized, and does not hit this code path.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain added a commit that referenced this pull request Apr 19, 2026
Two orthogonal fixes to stop unrelated Linux CI flakes from
blocking the Windows CI work on #4077.

gin-mongo: one record iteration on run 24631193918 hung for 140+
minutes and never produced a log we could RCA from. Three things
made it possible for a single keploy misbehavior to stall the
entire job:

1. The "wait for app to respond" loop had no deadline — any
   issue on the keploy/mongo/docker side where the app never
   started would loop on curl + sleep forever. Now bounded to
   5 minutes per iteration; fails the iteration with a clear
   error instead of stalling.
2. SIGINT was best-effort — the script sent SIGINT, waited up
   to 30s, and then proceeded to `wait`, which blocks forever
   on the tee'd stdout of a keploy record process that hasn't
   exited. Now: if keploy is still alive after the 30s grace,
   send SIGQUIT first (Go runtime dumps all goroutine stacks
   into the already-tee'd app log — exactly the RCA data we
   were missing), then SIGKILL. Future hangs leave a
   post-mortem in ${app_name}.txt.
3. (Consequence of #2) `wait` now resolves promptly because
   the background pipeline closes.

pokeapi: run 24631193918/job/72018929505 failed with a TLS
handshake timeout from proxy.golang.org when go build fetched
github.com/go-chi/chi@v1.5.5. go build has no built-in retry for
module download; one transient handshake drop kills the job.
Wrap in a bounded retry loop with exponential backoff (5s, 10s,
20s, ...) and an explicit GOPROXY fallback to direct.

Neither script touches keploy Go code. Both fixes are scoped to
their respective samples' runners.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain added a commit that referenced this pull request Apr 19, 2026
…replay stderr (#4077)

* ci(windows/go-dedup): gate probe on keploy PID, neutralize replay stderr

Two orthogonal failures surfaced on #4076 run 24630134970
once the port-patching RCA from 2fdd805 was in:

1. The readiness probe exits after ~5s because the loop gates on
   `$recJob.State -eq 'Running'` and the Start-Job tailing the
   keploy log files intermittently drops out of 'Running' on
   Windows when Get-Content -Wait races a log rotation. The tail
   job is a diagnostic helper, not the signal we want — the real
   "is keploy still working" fact is whether the keploy process
   itself (tracked in $REC_PID) is alive. Switch the loop
   terminator, drop the fragile dependency on the tail job.

2. Replay completes successfully but the script exits 1 because
   `& $env:REPLAY_BIN @testArgs 2>&1 | Tee-Object` promotes
   docker's benign stderr lifecycle lines ("Container …
   Recreate") to a terminating NativeCommandError under
   ErrorActionPreference=Stop. The pass/fail truth is the report
   YAML we validate immediately after, not PowerShell's opinion
   of native-command stderr. Temporarily flip EAP to 'Continue'
   just for the replay invocation and record $LASTEXITCODE so
   the downstream report check has more context if it fails.

Both fixes applied symmetrically to the docker and native
variants; the native script was carrying the same two patterns
even though only the docker one has failed so far.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci(windows/go-dedup): pass --port to replay so it hits the published host port

Record captures the testcase URL as `localhost:8080` — the
container-internal port the Gin app actually binds to. The script
patches docker-compose to publish on `$appPort`:8080 (random high
port) to sidestep vpnkit / wsl's chronic hold on :8080 on Windows
Docker Desktop runners. That worked for record (which hits
$appPort from the host directly), but broke replay.

keploy test re-dials the recorded URL (`localhost:8080`) on the
host. On Windows there's no host→container bridge at :8080 with
our patched port mapping — only :$appPort. All 9 recorded
testcases on run 24630753752/job/72017828188 failed with ACTUAL
STATUS 0 and "dial tcp [::1]:8080: connectex: No connection could
be made because the target machine actively refused it."

keploy's `test --port` flag is exactly for this — it rewrites the
testcase destination at replay time, so recorded `localhost:8080`
becomes `localhost:$appPort` and lands on docker's actual
publisher. The flag is documented as "Custom http port to replace
the actual port in the testcases" (cli/provider/cmd.go:345).

Docker variant only — native-windows runs the Go binary directly,
no port mapping to rewrite.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci(linux): bound gin-mongo runtime + retry pokeapi go-build

Two orthogonal fixes to stop unrelated Linux CI flakes from
blocking the Windows CI work on #4077.

gin-mongo: one record iteration on run 24631193918 hung for 140+
minutes and never produced a log we could RCA from. Three things
made it possible for a single keploy misbehavior to stall the
entire job:

1. The "wait for app to respond" loop had no deadline — any
   issue on the keploy/mongo/docker side where the app never
   started would loop on curl + sleep forever. Now bounded to
   5 minutes per iteration; fails the iteration with a clear
   error instead of stalling.
2. SIGINT was best-effort — the script sent SIGINT, waited up
   to 30s, and then proceeded to `wait`, which blocks forever
   on the tee'd stdout of a keploy record process that hasn't
   exited. Now: if keploy is still alive after the 30s grace,
   send SIGQUIT first (Go runtime dumps all goroutine stacks
   into the already-tee'd app log — exactly the RCA data we
   were missing), then SIGKILL. Future hangs leave a
   post-mortem in ${app_name}.txt.
3. (Consequence of #2) `wait` now resolves promptly because
   the background pipeline closes.

pokeapi: run 24631193918/job/72018929505 failed with a TLS
handshake timeout from proxy.golang.org when go build fetched
github.com/go-chi/chi@v1.5.5. go build has no built-in retry for
module download; one transient handshake drop kills the job.
Wrap in a bounded retry loop with exponential backoff (5s, 10s,
20s, ...) and an explicit GOPROXY fallback to direct.

Neither script touches keploy Go code. Both fixes are scoped to
their respective samples' runners.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci: fix pgrep pattern that silently missed keploy PID on Linux runners

## RCA

On #4077 run 24631193918 the `gin-mongo
(record_latest_replay_build)` job hung for 140+ minutes. The
culprit was a pgrep-pattern bug in the CI bash helpers, not a
hang in keploy itself.

Every affected script does:

    "$RECORD_BIN" record -c "./someApp" 2>&1 | tee app.txt &
    ...
    REC_PID="$(pgrep -n -f 'keploy record' || true)"
    sudo kill -INT "$REC_PID"
    ...
    wait   # at the bottom of the outer loop

The Linux CI artifact is named `keploy-record` (hyphen).
`pgrep -n -f 'keploy record'` — with a literal SPACE, not a
hyphen — never matches a command line of the form
`keploy-record record -c …`. `pgrep` exits non-zero, `|| true`
swallows it, `REC_PID` is empty, and `sudo kill -INT ""`
degenerates into a silent no-op ("failed to parse argument").

Keploy therefore never receives the shutdown signal, keeps
running indefinitely, its tee'd stdout never closes, and the
terminal `wait` at the bottom of the outer loop blocks on that
pipe until GitHub Actions' 6-hour hard timeout kills the job.

Reproduced the empty-pgrep and silent-no-op locally with the
same binary naming convention (full repro stored in
/tmp/ginmongo-repro): keploy stays alive, `kill: failed to
parse argument: ''` is printed, and subsequent iterations see
a stale keploy holding :27017 and 8080.

## Fix

Replace the hard-coded literal with the basename of
`$RECORD_BIN`, which is always set before these scripts run:

    REC_PID="$(pgrep -n -f "$(basename "${RECORD_BIN:-keploy}") record" || true)"

Matches both `keploy record …` and `keploy-record record …`
regardless of which artifact the job picked up (docker builds
produce `keploy`; the linux native artifact produces
`keploy-record`). Defaults to `keploy` so manual invocations
(`bash script.sh` with no env) still work.

## Scope

Applied to all 13 bash scripts carrying the buggy pattern:
fuzzer/{mongo,postgres,mysql,rerecord}, golang/{gin_mongo/*,
risk_profile,dns_mock,echo_sql/docker,proxy_stress_test/docker},
node/express_mongoose, python-docker, python/django_postgres.

## Defense-in-depth (already in 76e9dc6)

The gin-mongo helper also gained SIGQUIT→SIGKILL escalation
after 30s, so if pgrep ever misses again we get goroutine
stacks in the tee log instead of another blank 6-hour timeout.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain added a commit that referenced this pull request Apr 24, 2026
…ext_step, block-comment allowlist, honest chaos skip

Responds to Copilot tick #2 on #4113:

- fakeconn/fakeconn.go: Read/ReadChunk now re-fetch the deadline
  channel on every loop iteration. A concurrent SetReadDeadline
  closes a "deadline changed" broadcast channel that the select
  observes, the loop reloads both channels, and the new deadline
  takes effect on in-flight reads. Previously the dlCh was captured
  once and concurrent SetReadDeadline updates were invisible to
  blocked readers.
- util/util.go: RecoverWithoutClose's Error log now includes the
  recovered panic value and an actionable next_step (supervisor
  falls through to passthrough; file the Sentry issue with the
  parser owner or set KEPLOY_NEW_RELAY=off). No shortcut suppression;
  the log is genuinely actionable.
- tools/lint/no_timestamp_in_parser/analyzer.go: collectSuppressLines
  now matches the marker in both line comments ("// allow:time.Now")
  and block comments ("/* allow:time.Now ... */"), matching the
  documented contract. ast.Comment.Text retains outer delimiters so
  we strip them explicitly before the prefix check.
- tests/e2e/chaos-broken-parser: the in-process V2 proxy wiring is
  still stubbed (broken_parser.go returns errChaosNotYetWired).
  Rather than pretend the test can pass, three aligned fixes:
    * chaos_test.go now passes -tags chaos_broken_parser to `go run`
      so the tagged path is actually compiled.
    * chaos_test.go adds a KEPLOY_CHAOS_WIRING gate; without it the
      test skips with a clear "wiring not yet implemented" message.
    * harness/main.go propagates errChaosNotYetWired from
      startBrokenParserProxyIfEnabled and skips the supervisor-
      fallback invariant when the wiring is absent (I1 and I3 still
      apply — they are properties of the docker-compose stack itself).
  errChaosNotYetWired is now declared in broken_parser_stub.go so
  main.go can reference it regardless of build tag.

All tests pass under -race.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>
slayerjain added a commit that referenced this pull request Apr 24, 2026
…ve migration (#4113)

* feat(proxy): supervisor + relay + fakeconn foundation for parser isolation

Introduce the record-mode architecture that isolates integration parsers
from real network sockets, so a parser panic, hang, or resource
exhaustion bug cannot affect the application's live TCP traffic.

New packages (pkg/agent/proxy/):

- fakeconn: read-only net.Conn populated with timestamped Chunks by
  the relay. Write always returns ErrFakeConnNoWrite so parsers cannot
  accidentally corrupt peer traffic. Deadlines, close, chunk blocking
  reads all covered by unit tests.
- directive: parser -> supervisor control protocol for mid-stream
  operations (TLS upgrade, abort, finalize, pause, resume). Parsers
  no longer touch *tls.Conn directly; the relay handles both sides
  of the handshake.
- supervisor: panic firewall + activity-based hang watchdog + memory
  cap wrapping each parser goroutine. Session type is a superset of
  the legacy integrations.RecordSession surface plus an
  incomplete-mock gate (EmitMock / MarkMockIncomplete / MarkMockComplete).
- relay: sole owner and writer of real client/dest sockets. Stamps
  ReadAt/WrittenAt on every chunk at the syscall boundary and tees a
  copy into the parser's FakeConn via a bounded channel with drop
  accounting on memoryguard pressure / per-connection cap / channel
  full. Directive processor handles TLS upgrade by swapping the conn
  pointer under a pause barrier.
- util.RecoverWithoutClose: companion to util.Recover that catches
  panics without closing sockets, for parsers that do not own the
  real sockets under the new path.
- util.KillSwitch: process-wide parsing disable via env
  (KEPLOY_DISABLE_PARSING), SIGUSR1, or Trip/Reset. Unix/Windows
  build-tag split for the signal handler.

Integration surface (additive, opt-in):

- integrations.RecordSession gains V2 *supervisor.Session; un-migrated
  parsers ignore it and continue on the legacy path unchanged.
- integrations.IntegrationsV2 marker interface (IsV2() bool) lets
  migrated parsers opt in.
- proxy.handleConnection: checks util.DefaultKillSwitch.Enabled()
  first (routes to globalPassThrough), then matchedParser as
  IntegrationsV2 with IsV2()==true (routes to recordViaSupervisor),
  else legacy path untouched. KEPLOY_NEW_RELAY=off forces legacy
  even for V2 parsers as a rollback knob.
- recordViaSupervisor runs the parser under the supervisor; on
  FallthroughToPassthrough (panic/hang/mem-cap) hands the real
  sockets to globalPassThrough so user traffic continues regardless
  of parser state.
- newProxyTLSUpgradeFn adapts the existing pTls.HandleTLSConnection
  helper and crypto/tls.Client into the relay's TLSUpgradeFn shape.

End-to-end integration tests (pkg/agent/proxy/v2_integration_test.go)
prove invariants I1-I5 / I7:

- TestV2_HappyPath_ChunkTimestampsCarried: traffic flows end-to-end,
  parser reads Chunks from FakeConns, emits mocks anchored to
  chunk.ReadAt / chunk.WrittenAt.
- TestV2_PanicDoesNotBlockTraffic: panic caught, PanicReporter
  invoked, zero mocks emitted, client STILL receives the server's
  reply through the relay.
- TestV2_HangDetected: activity-based watchdog fires within budget,
  supervisor returns StatusHung + FallthroughToPassthrough.
- TestV2_KillSwitchLifecycle: Trip/Reset/Enabled round-trip.

Zero behavior change for current traffic: no parser implements IsV2()
yet, so every connection continues on the legacy path. Subsequent PRs
migrate parsers one at a time.

All new packages have unit tests that pass under -race. Full
pkg/agent/proxy/... test suite passes; go vet clean; gofmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* feat(http): native-migrate record path to V2 FakeConn architecture

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* feat(mysql): native-migrate record path to V2 FakeConn architecture

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* feat(generic): native-migrate record path to V2 FakeConn architecture

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(mysql): annotate record_v2 MITM TLS config for CodeQL

The V2 MySQL recorder's buildDestTLSConfigV2 sets InsecureSkipVerify=true
because keploy acts as a MITM between the app and the real upstream —
the same pattern used elsewhere in keploy's record-mode TLS handling.
CodeQL flags the new occurrence as go/disabled-certificate-check;
document the MITM rationale on the call site and suppress with both
lgtm and gosec directives so the existing keploy pattern keeps passing
security scans.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(mysql): use system trust store in V2 dest TLS config

CodeQL flagged the previous InsecureSkipVerify=true as a new security
alert. Replace it with default verification against the system trust
store. When the upstream's cert doesn't verify (self-signed, custom
CA), the supervisor's fallback-to-passthrough ensures traffic still
flows — the mock is dropped but the app's connection continues. This
is strictly safer and aligns with the stability-over-fidelity trade-
off documented elsewhere in the V2 architecture.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* feat(lint): forbid time.Now in parser record files (I5 invariant)

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* test(e2e): chaos test for broken parser — panic should not break app's DB connection

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* docs+shutdown: architecture README, matcher contract, parser migration guide, coordinated shutdown

- pkg/agent/proxy/README.md: V2 architecture overview, safety
  invariants I1-I8 with enforcement mapping, migration pattern
  skeleton, test coverage summary, rollout knobs. Preserves legacy
  doc at the bottom.
- docs/reference/mock-matcher-contract.md: pins what the replay
  matcher consumes from every mock (ReqTimestampMock,
  ResTimestampMock, Kind, ConnectionID, reserved Metadata keys).
  Documents the I5 timestamp rule and the per-session monotonicity
  clamp.
- docs/contributing/parser-migration-guide.md: step-by-step recipe
  for porting a legacy parser to V2, including mid-stream TLS
  handling via the directive channel and the 'things that bite
  people' list.
- proxy.StopProxyServer: coordinated shutdown sequence -- trip
  kill switch first (in-flight accepts bypass parsers), close
  listener, drain grace 5s, then force-close tracked client
  connections. Matches PLAN.md §3.7 for the userspace half of
  invariant I8. The kernel-side proxy_ready BPF gate requires
  BPF source changes (source not in this repo) and is documented
  as deferred.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* docs(chaos-e2e): correct stub header — V2 packages exist on branch

Original stub was written in a worktree based off an older snapshot
and documented the V2 packages as missing. They exist. Update the
header to reflect what's actually left: define the broken parser
type implementing IntegrationsV2, register with high priority
synthetic IntegrationType, instantiate an in-process proxy with a
stub MockManager. A follow-up PR can close the loop.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* style: gofmt fix broken_parser stub header indent

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy): address Copilot review — record-only kill switch, relay-continues-on-fallback, logging guideline, godoc, benign errors

Responds to Copilot review on #4113:

- proxy.go: kill switch now only gates MODE_RECORD; MODE_TEST replay
  dispatch is unaffected so mocks keep matching when record-time
  parsing is disabled.
- proxy_v2.go (architectural): on supervisor FallthroughToPassthrough,
  do NOT cancel the relay before starting passthrough — the relay
  keeps forwarding client<->dest bytes raw until peer close. The old
  code stopped forwarding during the fallback gap, defeating the I1
  invariant. Removes the redundant globalPassThrough call (the relay
  IS the passthrough once the tee is cut via SessionOnAbort). Log
  downgraded to Debug with actionable next_step guidance.
- proxy_v2.go: clarified comment on why ErrGroup remains populated
  on the V2 path (legacy integration helpers access it via ctx-value
  in shared code, migration tracked).
- relay/directive_proc.go: TLS-upgrade failures log at Debug with
  directive_reason + next_step instead of Warn (expected condition,
  ack surfaces the error to the parser).
- relay/relay.go: isBenignNetErr now matches net.ErrClosed and the
  "use of closed network connection" text the doc claimed was
  already handled.
- directive/doc.go: godoc link updated to [Ack] (type is Ack, not
  DirectiveAck).
- util/kill_switch.go: comment on envDisableParsing no longer claims
  it's exported — it's a package-local constant.
- util/util.go: RecoverWithoutClose's sentry.Flush now runs only on
  the actual panic path, not on every deferred call.
- supervisor/supervisor.go: hang-abort log downgraded from Warn to
  Debug with next_step (tune HangBudget, or set KEPLOY_NEW_RELAY=off).
- fakeconn/fakeconn.go: synthetic chunks drained from the Read-stash
  now carry the source chunk's ReadAt/WrittenAt/Dir; previously the
  doc claimed this but the impl returned zero timestamps.

All tests pass under -race (pkg/agent/proxy/...).

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy/tools/chaos): address Copilot review — deadline re-fetch, next_step, block-comment allowlist, honest chaos skip

Responds to Copilot tick #2 on #4113:

- fakeconn/fakeconn.go: Read/ReadChunk now re-fetch the deadline
  channel on every loop iteration. A concurrent SetReadDeadline
  closes a "deadline changed" broadcast channel that the select
  observes, the loop reloads both channels, and the new deadline
  takes effect on in-flight reads. Previously the dlCh was captured
  once and concurrent SetReadDeadline updates were invisible to
  blocked readers.
- util/util.go: RecoverWithoutClose's Error log now includes the
  recovered panic value and an actionable next_step (supervisor
  falls through to passthrough; file the Sentry issue with the
  parser owner or set KEPLOY_NEW_RELAY=off). No shortcut suppression;
  the log is genuinely actionable.
- tools/lint/no_timestamp_in_parser/analyzer.go: collectSuppressLines
  now matches the marker in both line comments ("// allow:time.Now")
  and block comments ("/* allow:time.Now ... */"), matching the
  documented contract. ast.Comment.Text retains outer delimiters so
  we strip them explicitly before the prefix check.
- tests/e2e/chaos-broken-parser: the in-process V2 proxy wiring is
  still stubbed (broken_parser.go returns errChaosNotYetWired).
  Rather than pretend the test can pass, three aligned fixes:
    * chaos_test.go now passes -tags chaos_broken_parser to `go run`
      so the tagged path is actually compiled.
    * chaos_test.go adds a KEPLOY_CHAOS_WIRING gate; without it the
      test skips with a clear "wiring not yet implemented" message.
    * harness/main.go propagates errChaosNotYetWired from
      startBrokenParserProxyIfEnabled and skips the supervisor-
      fallback invariant when the wiring is absent (I1 and I3 still
      apply — they are properties of the docker-compose stack itself).
  errChaosNotYetWired is now declared in broken_parser_stub.go so
  main.go can reference it regardless of build tag.

All tests pass under -race.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(supervisor/docs): implement the documented behaviour — monotonic clamp, ConnectionID propagation, relay wiring doc

Responds to Copilot tick #3 on #4113:

- supervisor/session.go: EmitMock now implements the per-session
  ReqTimestampMock monotonicity clamp that matcher-contract.md
  describes — a mock earlier than the most-recent emitted mock on
  the same session gets its timestamp clamped to lastReq+1ns.
  SetDebugMonotonic(true) in test binaries turns the clamp into a
  panic so parser bugs surface immediately. Guarded by a short
  mutex; lock-free hot paths are preserved for hookMu / mockIncomplete.
- supervisor/session.go: EmitMock now auto-propagates ClientConnID
  to mock.ConnectionID when the parser didn't already populate it,
  matching the documented Session.ClientConnID contract. Parsers
  that want to override can still assign before EmitMock.
- relay/doc.go: package doc no longer says "not yet wired into
  handleConnection" — it is, via proxy_v2.go::recordViaSupervisor
  for V2-capable parsers. Mentions the KEPLOY_NEW_RELAY rollback knob.
- docs/reference/mock-matcher-contract.md: rewrote the monotonicity
  and ConnectionID sections to match the actual implementation —
  simplified code example removed, behaviour described including
  the SetDebugMonotonic toggle and the zero-timestamp bypass.

All tests pass under -race.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(lint/chaos): narrow lint scope to *_v2.go; reusable tickers; README accuracy

Responds to Copilot tick #4 on #4113:

- tools/lint/no_timestamp_in_parser/analyzer.go: narrowed inScope
  to *_v2.go files (and recorder_v2/ directory) instead of matching
  any encode*.go. Legacy encode.go files in integrations/generic
  and integrations/http use time.Now() legitimately; the old scope
  would have flagged them as violations. The V2 contract only
  applies to the *_v2.go files that adopted the chunk-timestamp
  rule, so that's the precise scope. Package doc rewritten to
  document the narrowed scope + rationale.
- tests/e2e/chaos-broken-parser/harness/main.go: replaced the two
  time.After-in-loop sites (driveQueries inter-query spacing and
  waitForPostgresHealthy retry loop) with reusable *time.Ticker
  + defer Stop(). Matches the repo's polling-loop convention and
  avoids per-iteration timer allocation.
- tests/e2e/chaos-broken-parser/README.md: "Status on this branch"
  section now accurately reflects that the V2 packages ARE landed
  on feat/proxy-v2-foundation (they're introduced by this PR). What
  remains stubbed is the in-process proxy instantiation inside
  broken_parser.go — the TODO checklist in its header documents it.

All tests pass under -race; `go vet`, `gofmt -l`, `go build ./...`
clean.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy,docs): wire incomplete-mock + pending-work hooks; Chunk.IsZero ignores Dir; Debug-level FakeConn.Write; doc drift

Responds to Copilot tick #5 on #4113:

- proxy_v2.go: relay.Config now wires OnMarkMockIncomplete to the
  session's MarkMockIncomplete so tee drops (memoryguard, per-conn
  cap, channel full, write error, short write, KindAbortMock) mark
  the in-flight mock incomplete. Previously the relay would drop
  silently and the parser could still emit a partial mock — exact
  invariant I4 violation Copilot flagged.
- relay/config.go, relay/relay.go: new OnClientChunkTeed callback
  invoked on each successful client→dest tee. proxy_v2.go wires it
  to sv.MarkPendingWork and sets svSess.OnPendingCleared to
  sv.ClearPendingWork. EmitMock calls OnPendingCleared after a
  successful emit. Net effect: the supervisor's activity watchdog
  can now distinguish an idle connection (no pending requests)
  from a parser that received bytes but isn't emitting a mock
  (hang candidate), closing a real gap in watchdog semantics.
- supervisor/session.go: OnPendingCleared field + EmitMock wiring.
- fakeconn/chunk.go: Chunk.IsZero drops the `Dir == 0` check
  because Direction(0) is the valid value FromClient; a real empty
  FromClient chunk with zero timestamps would have been misclassified
  as zero. Now checks only payload + timestamps + SeqNo, which is
  the genuine "unset" signal.
- fakeconn/fakeconn.go: FakeConn.Write's log call downgraded Warn
  → Debug. The returned ErrFakeConnNoWrite is the primary signal
  for the parser-misuse bug; the log is diagnostic. logger
  interface's method also renamed Warn → Debug for consistency.
- tests/e2e/chaos-broken-parser/harness/broken_parser.go: header
  updated — the supervisor fallback KEEPS the relay running for
  raw forwarding rather than invoking globalPassThrough. Matches
  the current recordViaSupervisor behaviour; previous wording was
  from an earlier design iteration.
- tools/lint/no_timestamp_in_parser/README.md: scope description
  rewritten to match the narrowed *_v2.go / recorder_v2/ scope.
  Legacy encode.go files explicitly documented as out-of-scope.
- docs/reference/mock-matcher-contract.md: enforcement paragraph
  aligned with the narrowed analyzer scope.

All tests pass under -race.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(lint/chaos): TypesInfo-based time-package resolution; single-source suppression marker; doc/filename drift

Responds to Copilot tick #6 on #4113:

- tools/lint/no_timestamp_in_parser/analyzer.go: time-package
  resolution now goes through pass.TypesInfo (refersToStdlibTimePackage)
  so the rule catches aliased imports like `import stdtime "time"`
  and renamed copies, not just bare `time` identifiers. Falls back
  to the old name-match when TypesInfo isn't populated (e.g. testdata
  without full type-check).
- tools/lint/no_timestamp_in_parser/analyzer.go: suppressionComment
  is now the single source of truth for the marker — the matcher
  derives its prefix from the constant (no more `_ = suppressionComment`
  workaround) and the diagnostic message interpolates it too.
- tools/lint/no_timestamp_in_parser/analyzer.go: inScope docstring
  rewritten to match the actual "directory-agnostic *_v2.go / *recorder_v2/*
  suffix match" implementation, with rationale for the cross-repo
  consistency reason.
- tools/lint/no_timestamp_in_parser/analyzer_test.go: top-of-file
  comment updated — the testdata fixtures exercise the *_v2.go suffix
  match, not a /recorder/ substring match.
- tests/e2e/chaos-broken-parser/chaos_test.go: comment on
  e2eTagEnabled referenced chaos_test_e2e_tag.go / chaos_test_no_tag.go
  which don't exist; real sentinel files are e2e_tag_enabled_test.go
  and e2e_tag_disabled_test.go. Updated so future maintainers find
  the right files.

All tests pass under -race.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* feat(fakeconn): add LastWrittenTime accessor

FakeConn already tracks each chunk's WrittenAt timestamp — but only
via the chunk returned from ReadChunk. Parsers that consume bytes via
Read (and therefore never see Chunk structs) had no way to observe
the response-side timestamp without refactoring their read loop.

Adds a symmetric LastWrittenTime() accessor alongside LastReadTime()
so parsers can source ResTimestampMock from chunk.WrittenAt in the
same way ReqTimestampMock is sourced from chunk.ReadAt via
LastReadTime. Implementation mirrors LastReadTime: an atomic.Int64
updated by every ReadChunk delivery (including the drain-from-stash
path) and returned as time.Unix(0, n).

Consumer fix follows in keploy/integrations#133 (Postgres v3 session
mock switches from LastReadTime to LastWrittenTime for resTs).

Test coverage: TestReadFromChunks now asserts both LastReadTime and
LastWrittenTime across a two-chunk Read; new
TestLastWrittenTimeZeroBeforeAnyChunk pins the zero-value contract.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: clear pending-work on dropped mocks; disambiguate V2 success log

Four Copilot review follow-ups on 1e8f383:

- supervisor/session.go: EmitMock's drop-due-to-mockIncomplete path
  now calls OnPendingCleared before returning. Without this the
  supervisor's hang watchdog stays armed after a benign drop
  (chunk gate, memory pressure, short write) and eventually fires a
  spurious abort on an idle connection. New
  TestSessionEmitMockDropPathClearsPending pins the behavior.

- integrations/generic/encode_v2.go: flushMock's incomplete-drop
  early-return path now also calls sess.OnPendingCleared. The early
  return was an allocation optimisation and previously skipped the
  pending-clear that EmitMock would have done, for the same watchdog
  reason.

- proxy.go (V2 dispatcher): the post-recordViaSupervisor log used to
  claim "successfully recorded via V2 path" even when the supervisor
  had fallen through to passthrough (parser panic / hang / explicit
  fallback), because recordViaSupervisor returns nil error in both
  cases. Split the reporting: recordViaSupervisor now logs
  "V2 parser recorded outgoing message successfully" (with the
  supervisor result.Status) only on the genuine non-fallthrough
  success path; the caller log softens to "V2 record path returned".

- relay/tee.go: comment on the per-conn byte-cap check said
  "Reserve capacity optimistically, then undo on channel-full" but
  the code only increments t.bytes after a successful staging send,
  so nothing is ever "undone" on drop paths. Rewrote the comment to
  match actual accounting so a future change doesn't rely on an
  incorrect description.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: doc accuracy + stderr fallback + panic next_step

Four Copilot review follow-ups on c2e63d2:

- directive/doc.go: package doc said directives go "to a channel
  owned by the supervisor, which forwards ... to the relay".
  Actually the relay exposes the directive channel directly
  (session.Directives = r.Directives()) and processes them inside
  relay.processDirectives; the supervisor never touches them.
  Rewrote to match the actual ownership/flow.

- util/util.go RecoverWithoutClose: the nil-logger fallback printed
  "Recovered from panic in parser (no logger available)" to stdout.
  Library code writing to stdout can corrupt CLI/tool-parseable
  output streams. Switched to fmt.Fprintln(os.Stderr, ...) — same
  signal, predictable stream.

- supervisor/supervisor.go classifyReturn panic path: Error log
  lacked an actionable next_step. Added guidance pointing at the
  supervisor's automatic fallthrough-to-passthrough behaviour plus
  the three escape hatches (KEPLOY_NEW_RELAY=off for this parser,
  KEPLOY_DISABLE_PARSING=1 / SIGUSR1 to disable parser dispatch
  entirely) so operators have a clear remediation path.

- proxy_v2.go recordViaSupervisor doc: responsibility #4 said
  FallthroughToPassthrough would "run globalPassThrough on the
  real sockets", but the implementation intentionally does NOT
  call globalPassThrough — it keeps the relay forwarding until
  peer close (invariant I1 in pkg/agent/proxy/README.md; calling
  globalPassThrough would create a read-loop gap, exactly the
  stall V2 was designed to eliminate). Aligned the comment.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: drop unused srcPtr/dstPtr; correct README I3 fallback claim

Three Copilot review follow-ups on 0799eb3:

- proxy_v2.go newProxyTLSUpgradeFn: the signature took srcPtr and
  dstPtr *net.Conn parameters that were never used. The function
  only performs the handshake and returns the upgraded net.Conn;
  pointer mutation is the relay's responsibility. Dropped the two
  parameters and updated the single call site. The doc comment
  already explained this ownership split; now the signature
  matches.

- README.md "Safety invariants" table I3: claimed recordViaSupervisor
  routes FallthroughToPassthrough to globalPassThrough. Actual
  behaviour (tick 16's proxy_v2.go doc fix) is the opposite — the
  relay keeps forwarding until peer close so there is never a
  read-loop gap. Rewrote the I3 "Enforced by" cell to match.

- README.md recordViaSupervisor overview paragraph: same
  globalPassThrough claim in prose form. Rewrote to describe the
  relay-continuation fallback + explicit note that calling
  globalPassThrough would reintroduce the exact stall V2 was
  designed to eliminate.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: shutdown drain now actually works; correct Metadata[type] doc

Two Copilot review follow-ups on c88f4f5:

- proxy.go / proxy_v2.go: waitForConnDrain was ineffective because
  p.clientConnections was only read and nil-cleared on shutdown,
  never populated by any call path. len() was always 0, so the
  5-second drain grace returned immediately and shutdown proceeded
  without waiting for in-flight parsers. The force-close loop that
  followed was also a no-op.

  Replaced with a sync.WaitGroup approach:

  - Proxy.activeConns sync.WaitGroup replaces the
    clientConnections []net.Conn slice + connMutex coordination.
  - handleConnection calls activeConns.Add(1) / defer Done() at
    entry so every invocation is tracked, regardless of which
    return path it takes.
  - waitForConnDrain spins a sentinel goroutine that closes a done
    channel when Wait() returns, then selects done vs ctx.Done().
    Net effect: a single Wait() drains the whole active set with
    a deadline.
  - The dead force-close loop is gone. Connections that don't
    finish inside the grace window exit via the parent ctx
    cancellation their parserCtx inherits; their deferred
    srcConn/dstConn closes fire on return without any double-close
    race from a shared slice.

- docs/reference/mock-matcher-contract.md: Metadata["type"] was
  documented as "protocol name (e.g. mysql)" but the codebase
  actually uses it as a lifetime tag (see pkg/models/lifetime.go):
  "config" for session-lifetime mocks, "connection" for
  connection-lifetime mocks. Rewrote the reserved-keys table row
  with the accurate semantic and a pointer to lifetime.go.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* Change default integration ref to 'feat/v2-native-parsers'

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* fix: align shutdown-step doc; test block-comment timestamp suppression

Two Copilot review follow-ups on 5e37514:

- proxy.go StopProxyServer shutdown-step comment: said step 4 was
  "Force-close any still-live client connections at the end", but
  the implementation no longer holds a slice of live conns (dropped
  in tick 20's activeConns WaitGroup refactor). Rewrote step 3 to
  describe the WaitGroup-based drain and added step 4 explaining
  the deliberate design: stragglers exit asynchronously via parent
  ctx cancellation, without a shared slice and without a
  double-close race against their own deferred closes. Shutdown is
  still bounded by the 5-second grace timeout.

- tools/lint/no_timestamp_in_parser: the README documents both
  line-comment (// allow:time.Now) and block-comment
  (/* allow:time.Now */) suppression, but the analysistest
  fixtures only exercised the line form. Added two new exported
  functions to the "suppressed" fixture:

    LogUptimeBlockComment — single-line block comment above a
      time.Since() call, proves the /* ... */ branch of
      collectSuppressLines fires.
    LogBootTwoLineBlock — multi-line block comment; verifies that
      the analyzer's "end line + 1" lookup honours the block's
      END line, not its start line.

  With these fixtures the block-comment suppression path gains
  real test coverage so a future refactor can't silently regress
  the README-documented behaviour.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: pause relay tees on abort; flag temporary action-ref default

Three Copilot review follow-ups on 80d55b2:

- proxy_v2.go SessionOnAbort / relay/relay.go: after parser abort
  (panic / hang / mem-cap) we already closed the FakeConns so the
  parser's reads unblock, but the relay's tees stayed active and
  kept trying to push every subsequent chunk into the now-orphaned
  FakeConn channel. Every push fell through to the channel-full
  drop path, which logs at Debug — on a long-lived post-abort
  connection that's one log line per chunk for the rest of the
  session. New Relay.PauseTees() flips the atomic-bool pause flag
  on both tees; push then returns via the cheap DropPaused
  fast-path without consuming capacity or logging. Raw forwarding
  on the real sockets is unaffected — every byte still reaches
  its peer. SessionOnAbort now pauses tees FIRST, then closes the
  FakeConns.

- chaos-broken-parser/README.md: documented failure behaviour was
  out of date — said the go test wrapper "fails (not skips) once
  docker + e2e tag are present", but chaos_test.go now also SKIPs
  when KEPLOY_CHAOS_WIRING is unset/falsy. Rewrote the bullet to
  spell out the full gate set (KEPLOY_E2E + -tags e2e + docker +
  KEPLOY_CHAOS_WIRING) and that only beyond that gate is a failure
  genuine.

- .github/actions/setup-private-parsers/action.yaml: the
  integration-ref default was changed to 'feat/v2-native-parsers'
  in 5e37514 to unblock cross-repo CI during review. Added a
  prominent TODO(before-merge) comment above the default so it
  can't be silently inherited by downstream workflows after merge
  — the feature branch will eventually be deleted, which would
  break every caller that doesn't pass integration-ref explicitly.
  Default itself is left intentionally in place; reverting it
  would undo the maintainer's CI-wiring.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: revert action-ref default to main; mask stash-exhaustion EOF

Two Copilot review follow-ups on f4c418e:

- fakeconn/fakeconn.go Read: when draining the internal byte stash
  via bytes.Buffer.Read, io.EOF was surfaced to the caller any
  time the stash went empty on a call that DID return bytes. Go's
  convention (and what bufio.Reader / io.Copy / encoding/pkg
  readers assume) is that (n>0, io.EOF) means "here is the last
  n bytes". Callers would terminate the stream early even though
  more chunks were still arriving on f.ch. Only the channel-close
  path is a genuine EOF; mask the stash-exhaustion EOF whenever
  n > 0.

  New TestReadStashExhaustionNotEOF pins the contract: a two-chunk
  sequence with a Read that exactly empties the stash on an
  intermediate call must surface the stash bytes with err=nil, and
  a subsequent Read must still see the second chunk.

- .github/actions/setup-private-parsers/action.yaml: reverted the
  integration-ref default from 'feat/v2-native-parsers' back to
  'main'. Copilot flagged the feature-branch default as a
  post-merge footgun — downstream workflows that don't pass
  integration-ref explicitly would pin to a branch that gets
  deleted. The review-time cross-repo CI override is still
  available via the repo-level `vars.INTEGRATION_REF` variable,
  which the workflow env lines at the top of
  prepare_and_run*.yml and golangci-lint.yml resolve first:

      INTEGRATION_REF: ${{ vars.INTEGRATION_REF || 'main' }}

  So the maintainer can set `INTEGRATION_REF=feat/v2-native-parsers`
  as a repo variable during the review window and unset it (or
  delete the var) after merge, with no action.yaml churn. Added
  a comment block spelling out that wiring so the next reader
  doesn't have to reverse-engineer it.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(relay): atomic two-sided TLS upgrade

Copilot review follow-up on 1d9ae97:

pkg/agent/proxy/relay/directive_proc.go handleUpgradeTLS previously
performed the dest-side TLS handshake first, published r.dst via
atomic.Store, THEN performed the client-side handshake. If the
client-side handshake failed after the dest-side store, the relay
was left in a mixed state: r.dst points at a *tls.Conn, r.src
still points at cleartext. Any forwarding that happened before the
outer layer tore the sockets down would move TLS bytes in one
direction and plaintext bytes in the other, corrupting any traffic
in flight.

Restructured the handler to stage both handshakes in local
variables FIRST (upgradedDst, upgradedSrc) and ONLY publish via
r.{dst,src}.Store after both requested handshakes succeed. On
client-side failure, close the unpublished dest-side *tls.Conn so
it doesn't leak, and return the error with both r.dst and r.src
still pointing at their original cleartext conns — the forwarders
never see the mixed state.

The forwarders are parked on r.beginPause() for the entire
handshake window, so publishing is observed only after r.endPause
fires on the success path. The corruption window is therefore
zero regardless of handshake ordering.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: close WaitGroup race; next_step on V2 error logs; test timeouts

Four Copilot review follow-ups on 0dcfbe8:

- proxy.go activeConns Add/Wait race: activeConns.Add(1) was called
  inside handleConnection, which means the Add happened AFTER
  clientConnErrGrp.Go returned to the accept loop. If
  StopProxyServer kicked off waitForConnDrain -> activeConns.Wait()
  in that window, Wait could observe count=0 and return before the
  handleConnection goroutine registered itself. sync.WaitGroup
  documents this as misuse ("calls with a positive delta that
  occur when the counter is zero must happen before a Wait") and
  can produce "sync: WaitGroup misuse" panics in some
  interleavings. Moved activeConns.Add(1) to the spawn site in the
  accept loop (before clientConnErrGrp.Go), Done() stays in the
  goroutine's defer.

- proxy.go V2 record path failure log: utils.LogError("V2 record
  path failed") lacked an actionable next_step. Added parser tag +
  rollback knobs (KEPLOY_NEW_RELAY=off for this parser, or
  KEPLOY_DISABLE_PARSING=1 / SIGUSR1 to disable parser dispatch
  entirely); noted that the supervisor has already fallen through
  to passthrough so user traffic continues.

- integrations/mysql/mysql.go recordV2: same bare utils.LogError
  without next_step. Added identical remediation guidance.

- v2_integration_test.go: two tests (happy-path, panic-parser)
  blocked on unbounded `<-resCh` receives. If the supervisor/relay
  regress and Run never returns, the test hangs until the
  package-wide `go test -timeout`, which makes failures slow to
  surface and hard to locate. Wrapped both receives in
  `select <-resCh / <-time.After(3 * time.Second)` — same pattern
  other receives in the file already use.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(supervisor): next_step on reportPanic + fireOnAbort Error logs

Two Copilot review follow-ups on d103f61:

- reportPanic: the Error log fired when the configured
  PanicReporter itself panics lacked the repo-standard next_step
  field. Added guidance pointing operators at two remediations:
  fix the reporter implementation (it must be non-blocking and
  non-panicking, same contract as all supervisor callbacks), or
  unset supervisor.Config.PanicReporter=nil to fall back to no
  external reporting — the recovered parser panic is still logged
  at Error level on the supervisor's own path, so turning the
  reporter off is a safe degradation.

- fireOnAbort: matching Error log on SessionOnAbort callback
  panic. next_step now explains the callback's contract
  (non-blocking, non-panicking — see proxy_v2.go for the reference
  implementation that just closes FakeConns and pauses tees) and
  offers the escape hatch of constructing the Supervisor without
  a SessionOnAbort callback if the caller can tolerate parser-side
  reads not unblocking promptly on abort.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: unstick outer-ctx-cancel goroutine leak; doc TLS boundary semantics

Two Copilot review follow-ups on b18390f:

- supervisor/supervisor.go outer ctx-cancel: when ctx.Done() fired,
  runCancel() cancelled the inner context, then the select waited
  50ms for the parser to return. On clean return we skipped
  SessionOnAbort (correct — nothing to unstick). But on the
  50ms-timeout path we ALSO skipped it, which leaked the parser
  goroutine: if the parser was parked in FakeConn.Read/ReadChunk
  (which do not observe ctx — they only unblock on Close), the
  goroutine would stay parked for the life of the process, and the
  relay tees would stay armed so every subsequent chunk fell
  through the channel-full drop path logging at Debug.

  Fire SessionOnAbort on the timeout branch (closes both
  FakeConns + pauses tees via the dispatcher's reference
  callback) and mark FallthroughToPassthrough so the outer caller
  routes raw — consistent with the hang and panic branches above.

- directive/directive.go Ack doc: claimed BoundaryReadAt /
  BoundaryWrittenAt were "the exact wallclock instants at which
  the real sockets transitioned from cleartext to TLS", but the
  relay implementation captures BoundaryReadAt BEFORE handshakes
  begin and BoundaryWrittenAt AFTER both succeed — these are
  upgrade-boundary brackets, not transition instants (the TCP
  stack has no hook that would give the relay the exact instant,
  and the handshake itself is a multi-roundtrip sequence taking
  real time). Rewrote to describe the actual contract: a Read
  bound before and a Write bound after, with the post-ack
  guarantee that everything on ClientStream/DestStream is
  plaintext from the upgraded session.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Ayush Sharma <kshitij3160@gmail.com>
khareyash05 added a commit that referenced this pull request Apr 30, 2026
asyncMySQLDecode reads client and server bytes from clientBuffChan and
destBuffChan and feeds both into a single FIFO decodeChan via a
non-deterministic select in handleClientQueries. On a fast loopback the
next client command can be enqueued ahead of the trailing server bytes
of the prior result set (more column defs, the rows, or the EOF /
OK_with_EOF terminator). When that happens the recorder force-flushes
the in-progress mock at line 346 with truncated columns or a missing
FinalResponse, and the encoded wire bytes at replay time look like:

    column_count_pkt | column_def_1..M | <end>     # M < ColumnCount
or
    column_count_pkt | column_defs | row_pkt | <end>   # missing terminator

Drivers that strictly require the trailing terminator — most notably
Connector/J on Java 8 — block in socketRead0 forever waiting for a
column def or terminator that was lost to the race. Connector/J on
Java 17 happens to tolerate the missing terminator on otherwise-complete
result sets, which is why the bug only surfaced on Java 8 in earlier
testing.

Fix is recorder-side only. flushMock now calls
closeIncompleteResultSetForFlush, which when the decoder state is
mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or
stateExpectRows):

  1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded
     column-count packet matches the column-def packets actually
     emitted. Without this, replay clients block waiting for a column
     def that will never arrive.
  2. Synthesizes the trailing terminator on FinalResponse using the
     negotiated capabilities — 0xFE-prefixed OK_with_EOF if
     CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if
     CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
  3. Picks seqID = max(observed seqs in columns / EOFAfterColumns /
     rows) + 1, falling back to 2 when the result set has no captured
     columns at all.
  4. Rewrites bundle.Header.Type to TextResultSet /
     BinaryProtocolResultSet so wire.EncodeToBinary's type-switch
     routes through the result-set encoder rather than the column-count
     head packet that processFirstResponse originally stored.

The encoder (wire/phase/query/ResultSetPacket.go) and the replayer
(replayer/query.go, replayer/conn.go) are byte-identical to before —
the fix produces structurally-complete mocks at record time so replay
just works.

Verified on kind keploy-ds → keploy-replay with Spring Boot
sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and
sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting
to mysql:8.0 with require_secure_transport=ON and
sslMode=REQUIRED&useSSL=true&trustServerCertificate=true:

  Java 8  run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s
  Java 8  run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s
  Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s

Zero context-deadline-exceeded events and zero CommunicationsException
across the runs. The "noisy" classifications are auto-increment IDs,
list ordering, and created_at timestamp drift — pure data variance,
same shape as the pre-fix Java 17 baseline.

Signed-off-by: Yash Khare <khareyash05@gmail.com>
khareyash05 added a commit that referenced this pull request May 7, 2026
…st cases (#4157)

* feat(replay): expose TLSConfig hook on SimulationConfig + Hooks for HTTPS test cases

Adds an opt-in extension point so non-OSS callers (e.g.
cluster-mode auto-replay) can inject a *tls.Config into the replay
HTTP client without forking pkg/util or shipping a separate
transport. Default nil preserves the stdlib system-pool behaviour
for every existing OSS user; no behaviour change unless a caller
opts in.

## Why

Replay HTTPS test cases captured against an in-cluster service
keep their original URL (e.g. https://app.svc.cluster.local:8443).
At replay time the dial target is rewritten to a local target —
typically localhost:<port> through a kubectl port-forward to a
short-lived replay pod, or some other dev/CI tunnel — so the cert
the server presents cannot match the dial hostname.

The replay pod commonly serves a self-signed cert generated at
image build (PKCS12 keystore baked into the image, or a
cert-manager Issuer signing per pod). That cert is not in any
system CA bundle, and modern Go's x509 verifier additionally
demands a SAN matching the dial host.

Without a hook, every HTTPS test case fails the TLS handshake
("x509: certificate is not valid for any names" or "signed by
unknown authority") *before* the request is dispatched, so the
response body never reaches the comparison engine.

A blanket InsecureSkipVerify in OSS is the wrong fix because it
weakens TLS verification for every other replay path (CI lanes
replaying against a real staging backend with a properly-issued
cert, for example). This PR keeps strict TLS the default and lets
specific callers install a curated *tls.Config — for instance one
that pins the replay pod's leaf cert via VerifyConnection.

## Files

- pkg/util.go: SimulationConfig gains TLSConfig *tls.Config; all 3
  http.Transport branches in prepareHTTPRequest pass it through as
  TLSClientConfig (nil leaves the transport at stdlib defaults).
- pkg/service/replay/hooks.go: Hooks.tlsConfig field +
  (*Hooks).SetReplayTLSConfig setter; SimulateRequest forwards it
  into pkg.SimulationConfig.TLSConfig for HTTP testcases.

Verified end-to-end against a cluster-mode replay consumer that
TOFU-pins the replay pod's leaf cert (kept out of OSS).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(mysql-recorder): synthesize result-set terminator on force-flush

asyncMySQLDecode reads client and server bytes from clientBuffChan and
destBuffChan and feeds both into a single FIFO decodeChan via a
non-deterministic select in handleClientQueries. On a fast loopback the
next client command can be enqueued ahead of the trailing server bytes
of the prior result set (more column defs, the rows, or the EOF /
OK_with_EOF terminator). When that happens the recorder force-flushes
the in-progress mock at line 346 with truncated columns or a missing
FinalResponse, and the encoded wire bytes at replay time look like:

    column_count_pkt | column_def_1..M | <end>     # M < ColumnCount
or
    column_count_pkt | column_defs | row_pkt | <end>   # missing terminator

Drivers that strictly require the trailing terminator — most notably
Connector/J on Java 8 — block in socketRead0 forever waiting for a
column def or terminator that was lost to the race. Connector/J on
Java 17 happens to tolerate the missing terminator on otherwise-complete
result sets, which is why the bug only surfaced on Java 8 in earlier
testing.

Fix is recorder-side only. flushMock now calls
closeIncompleteResultSetForFlush, which when the decoder state is
mid-result-set (stateExpectColumns, stateExpectEOFAfterColumns, or
stateExpectRows):

  1. Truncates rs.ColumnCount to len(rs.Columns) so the encoded
     column-count packet matches the column-def packets actually
     emitted. Without this, replay clients block waiting for a column
     def that will never arrive.
  2. Synthesizes the trailing terminator on FinalResponse using the
     negotiated capabilities — 0xFE-prefixed OK_with_EOF if
     CLIENT_DEPRECATE_EOF is set (with a trailing lenenc info byte if
     CLIENT_SESSION_TRACK is also set), legacy 5-byte EOF otherwise.
  3. Picks seqID = max(observed seqs in columns / EOFAfterColumns /
     rows) + 1, falling back to 2 when the result set has no captured
     columns at all.
  4. Rewrites bundle.Header.Type to TextResultSet /
     BinaryProtocolResultSet so wire.EncodeToBinary's type-switch
     routes through the result-set encoder rather than the column-count
     head packet that processFirstResponse originally stored.

The encoder (wire/phase/query/ResultSetPacket.go) and the replayer
(replayer/query.go, replayer/conn.go) are byte-identical to before —
the fix produces structurally-complete mocks at record time so replay
just works.

Verified on kind keploy-ds → keploy-replay with Spring Boot
sample-app-java8 (Temurin 8-jre / Connector/J 8.4.0) and
sample-app-java17 (Temurin 17-jre / Connector/J 8.4.0) both connecting
to mysql:8.0 with require_secure_transport=ON and
sslMode=REQUIRED&useSSL=true&trustServerCertificate=true:

  Java 8  run #1: 9 testcases, failed_count=0, noisy_count=4, 11.42 s
  Java 8  run #2: 4 testcases, failed_count=0, noisy_count=2, 11.40 s
  Java 17 regression: 8 testcases, failed_count=0, noisy_count=3, 11.24 s

Zero context-deadline-exceeded events and zero CommunicationsException
across the runs. The "noisy" classifications are auto-increment IDs,
list ordering, and created_at timestamp drift — pure data variance,
same shape as the pre-fix Java 17 baseline.

Signed-off-by: Yash Khare <khareyash05@gmail.com>

---------

Signed-off-by: Yash Khare <khareyash05@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Asish Kumar <87874775+officialasishkumar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant