feat: add proxy-stress-test sample app for e2e regression testing#217
feat: add proxy-stress-test sample app for e2e regression testing#217slayerjain merged 6 commits intomainfrom
Conversation
Go app that exercises Keploy proxy under stress conditions: - 10+ concurrent HTTPS connections through a forward proxy (tinyproxy) via CONNECT tunnels, triggering TLS MITM cert generation - PostgreSQL queries returning 50+ rows with 100KB+ payloads, testing wire protocol handling for large DataRow responses - HTTP POST through CONNECT tunnel for MatchType validation Includes Docker Compose with postgres + tinyproxy + the app. Used by the proxy-stress-test CI pipeline in keploy/keploy. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
There was a problem hiding this comment.
Pull request overview
Adds a new proxy-stress-test sample Go app + compose environment intended to stress Keploy’s proxy recording/replay via concurrent HTTPS CONNECT tunneling, large PostgreSQL DataRow payloads, and POST-through-proxy flows.
Changes:
- Introduces a Go HTTP server with endpoints that generate proxy + DB stress patterns.
- Adds Docker Compose stack (Postgres + tinyproxy + app) and SQL seeding for large/“wide” rows.
- Adds a standalone
test.shscript to run record + replay and validate reports.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| proxy-stress-test/main.go | Implements the stress-test HTTP server, proxy/TLS behavior, Postgres queries, and optional OTel/noise traffic. |
| proxy-stress-test/docker-compose.yml | Brings up Postgres, tinyproxy, and the app with stress-test env configuration. |
| proxy-stress-test/init.sql | Seeds large and wide Postgres rows to produce fragmented/large DataRow responses. |
| proxy-stress-test/test.sh | Automates record + replay phases and checks for races/panics/report status. |
| proxy-stress-test/Dockerfile | Builds and packages the sample app container image. |
| proxy-stress-test/go.mod | Defines the module and dependencies for the sample app. |
| proxy-stress-test/go.sum | Locks dependency checksums for reproducible builds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| source ./../../.github/workflows/test_workflow_scripts/test-iid.sh | ||
|
|
||
| docker compose build | ||
|
|
||
| sudo rm -rf keploy/ | ||
|
|
||
| $RECORD_BIN config --generate |
There was a problem hiding this comment.
source ./../../.github/workflows/test_workflow_scripts/test-iid.sh will fail because that path doesn't exist in this repository, and with set -u it will also leave $RECORD_BIN/$REPLAY_BIN undefined. Either add the referenced script in this PR or inline/define the needed variables in this script.
| source ./../../.github/workflows/test_workflow_scripts/test-iid.sh | |
| docker compose build | |
| sudo rm -rf keploy/ | |
| $RECORD_BIN config --generate | |
| RECORD_BIN="${RECORD_BIN:-keploy}" | |
| REPLAY_BIN="${REPLAY_BIN:-keploy}" | |
| require_binary() { | |
| local bin_name="$1" | |
| if [ -x "$bin_name" ] || command -v "$bin_name" >/dev/null 2>&1; then | |
| return 0 | |
| fi | |
| echo "Error: Required binary '$bin_name' is not available. Install Keploy or set RECORD_BIN/REPLAY_BIN to the correct executable path, then rerun this script." >&2 | |
| exit 1 | |
| } | |
| require_binary "$RECORD_BIN" | |
| require_binary "$REPLAY_BIN" | |
| docker compose build | |
| sudo rm -rf keploy/ | |
| "$RECORD_BIN" config --generate |
|
|
||
| docker compose build | ||
|
|
||
| sudo rm -rf keploy/ |
There was a problem hiding this comment.
Using sudo rm -rf keploy/ makes the script fail in environments without sudo (and is brittle in CI). Prefer removing the directory without sudo (or conditionally use sudo only when required after checking ownership/permissions).
| sudo rm -rf keploy/ | |
| if [ -d keploy/ ]; then | |
| rm -rf keploy/ 2>/dev/null || { | |
| if command -v sudo >/dev/null 2>&1; then | |
| sudo rm -rf keploy/ | |
| else | |
| echo "Unable to remove keploy/. Please adjust permissions or remove the directory manually, then rerun the script." >&2 | |
| exit 1 | |
| fi | |
| } | |
| fi |
| // Init OTel (Issue 2 — exports to localhost:4318 with no collector) | ||
| shutdownOTel, err := initOTel(ctx) | ||
| if err != nil { | ||
| log.Printf("WARN: OTel init failed (expected if no collector): %v", err) |
There was a problem hiding this comment.
log.Printf("WARN: ...") introduces a warning-level log line; please avoid adding warnings and instead log at a normal level with a clear next step (e.g., tell the user to start an OTel collector or set OTEL_ENABLED=false).
| log.Printf("WARN: OTel init failed (expected if no collector): %v", err) | |
| log.Printf("OTel init failed: %v. To enable telemetry, start an OTel collector at %s, or set OTEL_ENABLED=false to run without telemetry.", err, otelEndpoint) |
| // Init Postgres | ||
| db, err := initDB() | ||
| if err != nil { | ||
| log.Fatalf("Failed to connect to database: %v", err) |
There was a problem hiding this comment.
log.Fatalf("Failed to connect to database: %v", err) doesn’t give the user an actionable next step. Consider including which env var/DSN is being used and a suggestion like verifying DATABASE_URL or that docker compose up db is healthy.
| log.Fatalf("Failed to connect to database: %v", err) | |
| log.Fatalf("Failed to connect to database using DSN %q: %v. Next step: verify the configured database DSN, confirm Postgres is running and reachable, and check credentials and network access.", dbDSN, err) |
| IdleConnTimeout: 1 * time.Nanosecond, | ||
| TLSClientConfig: &tls.Config{ | ||
| InsecureSkipVerify: true, // accept Keploy's MITM cert | ||
| }, |
There was a problem hiding this comment.
InsecureSkipVerify: true disables TLS certificate verification, which is risky if this sample is ever run outside a controlled test environment. Consider making this opt-in via an env var (default false) and/or restrict it to when a Keploy proxy is explicitly enabled.
| } | ||
| _ = wideRows.Scan(ptrs...) | ||
| } |
There was a problem hiding this comment.
wideRows.Scan(ptrs...) ignores the scan error, which can hide driver/protocol issues (particularly relevant for this stress test). Capture and surface the error (e.g., set result.DBError) and stop iterating on failure.
| break | ||
| } | ||
| count++ | ||
| } |
There was a problem hiding this comment.
After iterating rows.Next(), the code doesn’t check rows.Err(), so mid-stream errors can be silently missed. Check rows.Err() before closing and propagate it into result.DBError.
| } | |
| } | |
| if err := rows.Err(); err != nil && result.DBError == "" { | |
| result.DBError = err.Error() | |
| } |
| succeeded++ | ||
| } else { | ||
| failed++ | ||
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) |
There was a problem hiding this comment.
On non-2xx HTTP responses, errors is appended to without any cap, which can grow unbounded for large batchSize. Apply the same max-length guard used for the err != nil branch.
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) | |
| if len(errors) < 5 { | |
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) | |
| } |
| proxy: | ||
| image: vimagick/tinyproxy | ||
| ports: | ||
| - "3128:3128" |
There was a problem hiding this comment.
Exposing the proxy on the host (3128:3128) can create an open proxy on the developer/CI machine. Consider binding to localhost (e.g., 127.0.0.1:3128:3128) unless host access is required.
| - "3128:3128" | |
| - "127.0.0.1:3128:3128" |
| echo 'Timeout 300' >> /tmp/tp.conf && | ||
| echo 'MaxClients 100' >> /tmp/tp.conf && | ||
| echo 'Allow 0.0.0.0/0' >> /tmp/tp.conf && | ||
| echo 'ConnectPort 443' >> /tmp/tp.conf && |
There was a problem hiding this comment.
Allow 0.0.0.0/0 makes tinyproxy accept connections from anywhere that can reach the container. Prefer restricting the allowlist to the Docker network/subnet used by this compose project.
- Fix module path to match repo layout (samples-go/proxy-stress-test) - Remove dependency on external test-iid.sh in test.sh - Remove sudo usage in test.sh for portability - Pin tinyproxy image tag, bind proxy to localhost, restrict Allow - Replace WARN log with normal log level - Add actionable hint to database connection fatal - Handle http.NewRequest error in background noise goroutine - Redact database credentials in log output - Check rows.Err() after iteration for mid-stream error detection - Handle wideRows.Scan error instead of discarding - Add nolint:gosec comment for intentional InsecureSkipVerify - Update go.mod after module path change Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| send_request & | ||
| keploy record -c "docker compose up" --container-name "$container_name" --generateGithubActions=false |& tee "${container_name}.txt" |
There was a problem hiding this comment.
The keploy record invocation uses --generateGithubActions=false, but the replay command below uses --generate-github-actions=false. Mixing flag names/casing is likely to break on some Keploy CLI versions; please use the same (documented) flag name consistently for both record and test runs.
| sleep 5 | ||
| docker compose down | ||
|
|
||
| keploy test -c 'docker compose up' --containerName "$container_name" --apiTimeout 60 --delay 20 --generate-github-actions=false &> "${container_name}-replay.txt" |
There was a problem hiding this comment.
keploy test is using --containerName, while other samples in this repo use --container-name (e.g. http-postgres/README.md and fasthttp-postgres/README.md). If --containerName isn’t a valid alias in the Keploy CLI, the replay step will fail—please switch to the same --container-name flag used elsewhere for consistency and compatibility.
| keploy test -c 'docker compose up' --containerName "$container_name" --apiTimeout 60 --delay 20 --generate-github-actions=false &> "${container_name}-replay.txt" | |
| keploy test -c 'docker compose up' --container-name "$container_name" --apiTimeout 60 --delay 20 --generate-github-actions=false &> "${container_name}-replay.txt" |
| send_request(){ | ||
| sleep 15 | ||
| app_started=false | ||
| while [ "$app_started" = false ]; do | ||
| if curl -sf http://localhost:8080/health > /dev/null 2>&1; then | ||
| app_started=true | ||
| fi | ||
| sleep 3 | ||
| done |
There was a problem hiding this comment.
The /health wait loop has no overall timeout/attempt limit. If the app never becomes healthy (build failure, port conflict, etc.), this script can hang indefinitely in CI. Consider adding a max wait duration/attempt count and exiting with a helpful error message when exceeded.
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
This block isn’t gofmt-formatted (indentation/bracing around the scanErr check is off). Since the repo enables the gofmt linter (see .golangci.yml:16-18), this will fail linting—please run gofmt on the file (or at least fix this block’s formatting).
| break | |
| } | |
| } | |
| break | |
| } | |
| } |
- Use --generate-github-actions=false and --container-name consistently (matching other samples in the repo). - Add max_attempts=30 timeout to health wait loop to fail fast on broken builds instead of looping indefinitely. - Run gofmt on main.go to fix indentation around scan error check. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| results := make([]httpsResult, concurrentConns) | ||
| var wg sync.WaitGroup | ||
| for i := 0; i < concurrentConns; i++ { | ||
| wg.Add(1) |
There was a problem hiding this comment.
concurrentConns is used to size a slice and control a loop; if the env var is set to 0 or a negative value, this will panic at runtime (make([]httpsResult, concurrentConns)). Consider validating/clamping concurrentConns (and similarly BATCH_SIZE) to a sane minimum (>=1) during startup and returning a clear error if invalid.
| return | ||
| } | ||
| log.Printf("Starting background noise: %d outgoing connections/sec", bgNoiseConns) | ||
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) |
There was a problem hiding this comment.
time.Second / time.Duration(bgNoiseConns) can evaluate to 0 when BG_NOISE_CONNS is very large, which will cause time.NewTicker(0) to panic. Add a guard to ensure the computed interval is >0 (e.g., cap the max rate or fall back to 1ns) before constructing the ticker.
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) | |
| interval := time.Second / time.Duration(bgNoiseConns) | |
| if interval <= 0 { | |
| interval = time.Nanosecond | |
| } | |
| ticker := time.NewTicker(interval) |
| func healthHandler(db *sql.DB) http.HandlerFunc { | ||
| return func(w http.ResponseWriter, r *http.Request) { | ||
| status := "ok" | ||
| if err := db.PingContext(r.Context()); err != nil { |
There was a problem hiding this comment.
/health calls db.PingContext(r.Context()), but the request context typically has no deadline, so a stalled DB connection can hang the health endpoint indefinitely. Consider wrapping the ping in a short timeout context (e.g., 1–2s) so health checks reliably return.
| if err := db.PingContext(r.Context()); err != nil { | |
| ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second) | |
| defer cancel() | |
| if err := db.PingContext(ctx); err != nil { |
| for i := 0; i < 3; i++ { | ||
| rows, err := db.QueryContext(ctx, | ||
| `SELECT id, name, description, large_payload FROM large_records ORDER BY id LIMIT 100`) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| for rows.Next() { | ||
| var id int | ||
| var name, desc, payload string | ||
| _ = rows.Scan(&id, &name, &desc, &payload) | ||
| dbRows++ | ||
| } | ||
| rows.Close() | ||
| } |
There was a problem hiding this comment.
The comment says the Postgres result sets are queried "concurrently", but the implementation runs the three queries sequentially in a for-loop. Either update the comment to match the current behavior or actually execute these queries concurrently if that concurrency is important for the intended stress profile.
| for i := 0; i < 3; i++ { | |
| rows, err := db.QueryContext(ctx, | |
| `SELECT id, name, description, large_payload FROM large_records ORDER BY id LIMIT 100`) | |
| if err != nil { | |
| continue | |
| } | |
| for rows.Next() { | |
| var id int | |
| var name, desc, payload string | |
| _ = rows.Scan(&id, &name, &desc, &payload) | |
| dbRows++ | |
| } | |
| rows.Close() | |
| } | |
| var dbWG sync.WaitGroup | |
| dbRowsCh := make(chan int, 3) | |
| for i := 0; i < 3; i++ { | |
| dbWG.Add(1) | |
| go func() { | |
| defer dbWG.Done() | |
| rows, err := db.QueryContext(ctx, | |
| `SELECT id, name, description, large_payload FROM large_records ORDER BY id LIMIT 100`) | |
| if err != nil { | |
| return | |
| } | |
| defer rows.Close() | |
| localRows := 0 | |
| for rows.Next() { | |
| var id int | |
| var name, desc, payload string | |
| _ = rows.Scan(&id, &name, &desc, &payload) | |
| localRows++ | |
| } | |
| dbRowsCh <- localRows | |
| }() | |
| } | |
| dbWG.Wait() | |
| close(dbRowsCh) | |
| for rowCount := range dbRowsCh { | |
| dbRows += rowCount | |
| } |
| res, err := resource.New(ctx, | ||
| resource.WithAttributes( | ||
| semconv.ServiceName("repro-app"), | ||
| semconv.ServiceVersion("0.1.0"), | ||
| ), |
There was a problem hiding this comment.
The service name/version are hard-coded to repro-app/0.1.0, which doesn’t match the sample directory/name (proxy-stress-test) and can make CI traces/logs harder to attribute. Consider renaming these attributes to align with the sample (or sourcing from env/build metadata).
| retries: 10 | ||
|
|
||
| proxy: | ||
| image: vimagick/tinyproxy:latest |
There was a problem hiding this comment.
vimagick/tinyproxy:latest is not pinned, which can make this e2e regression environment non-reproducible (a new image can change behavior unexpectedly). Consider pinning to a specific version tag or immutable digest.
| image: vimagick/tinyproxy:latest | |
| image: vimagick/tinyproxy@sha256:<replace-with-validated-image-digest> |
|
|
||
| docker compose build | ||
| rm -rf keploy/ | ||
|
|
There was a problem hiding this comment.
The script brings up containers but doesn’t ensure cleanup on early exit (because set -euo pipefail will exit immediately on failure). Add a trap to always run docker compose down (and optionally remove-orphans/volumes) so CI and local runs don’t leave stray containers/networks behind.
| cleanup() { | |
| exit_code="$1" | |
| trap - EXIT | |
| set +e | |
| if ! docker compose down --remove-orphans; then | |
| echo "Docker Compose cleanup did not complete successfully. Please run 'docker compose down --remove-orphans' manually and inspect 'docker compose ps' for any remaining resources." | |
| fi | |
| exit "$exit_code" | |
| } | |
| trap 'cleanup $?' EXIT |
| container_kill() { | ||
| REC_PID="$(pgrep -n -f 'keploy record' || true)" | ||
| echo "Keploy record PID: $REC_PID" | ||
| kill -INT "$REC_PID" 2>/dev/null || true | ||
| } |
There was a problem hiding this comment.
container_kill uses pgrep -n -f 'keploy record' which can match and interrupt an unrelated keploy record process on the same machine. It would be safer to capture the PID of the keploy record invocation you start (or match on a more specific pattern including --container-name $container_name).
| while [ "$app_started" = false ]; do | ||
| if curl -sf http://localhost:8080/health > /dev/null 2>&1; then | ||
| app_started=true | ||
| fi | ||
| sleep 3 | ||
| attempt=$((attempt + 1)) | ||
| if [ "$attempt" -ge "$max_attempts" ]; then | ||
| echo "App failed to start after $max_attempts attempts" | ||
| exit 1 | ||
| fi | ||
| done | ||
| echo "App started" | ||
| curl -sf http://localhost:8080/api/transfer | ||
| curl -sf http://localhost:8080/api/batch-transfer | ||
| curl -sf http://localhost:8080/api/post-transfer | ||
| curl -sf http://localhost:8080/health |
There was a problem hiding this comment.
The curl calls don’t set any request timeout, so a hung HTTP call can stall the entire e2e run. Consider adding --max-time (and optionally a small --retry/--retry-connrefused) to the health polling and the API calls to keep CI runs bounded.
| while [ "$app_started" = false ]; do | |
| if curl -sf http://localhost:8080/health > /dev/null 2>&1; then | |
| app_started=true | |
| fi | |
| sleep 3 | |
| attempt=$((attempt + 1)) | |
| if [ "$attempt" -ge "$max_attempts" ]; then | |
| echo "App failed to start after $max_attempts attempts" | |
| exit 1 | |
| fi | |
| done | |
| echo "App started" | |
| curl -sf http://localhost:8080/api/transfer | |
| curl -sf http://localhost:8080/api/batch-transfer | |
| curl -sf http://localhost:8080/api/post-transfer | |
| curl -sf http://localhost:8080/health | |
| health_check_curl_opts=(--connect-timeout 2 --max-time 5 --retry 2 --retry-delay 1 --retry-connrefused -sf) | |
| api_curl_opts=(--connect-timeout 2 --max-time 10 -sf) | |
| while [ "$app_started" = false ]; do | |
| if curl "${health_check_curl_opts[@]}" http://localhost:8080/health > /dev/null 2>&1; then | |
| app_started=true | |
| fi | |
| sleep 3 | |
| attempt=$((attempt + 1)) | |
| if [ "$attempt" -ge "$max_attempts" ]; then | |
| echo "App failed to start after $max_attempts attempts. Check docker compose logs and confirm the service is listening on http://localhost:8080/health." | |
| exit 1 | |
| fi | |
| done | |
| echo "App started" | |
| curl "${api_curl_opts[@]}" http://localhost:8080/api/transfer | |
| curl "${api_curl_opts[@]}" http://localhost:8080/api/batch-transfer | |
| curl "${api_curl_opts[@]}" http://localhost:8080/api/post-transfer | |
| curl "${health_check_curl_opts[@]}" http://localhost:8080/health |
| echo "Recording completed" | ||
| sleep 5 | ||
| docker compose down | ||
|
|
||
| keploy test -c 'docker compose up' --container-name "$container_name" --apiTimeout 60 --delay 20 --generate-github-actions=false &> "${container_name}-replay.txt" | ||
|
|
There was a problem hiding this comment.
After replay (keploy test -c 'docker compose up' ...), the script doesn’t call docker compose down, which can leave the stack running after the test completes (or fails). Ensure the compose stack is torn down after replay as well (or via a top-level trap).
- Rename binary from repro-app to proxy-stress-test in Dockerfile. - Update OTel service name to match sample directory name. - Add cleanup trap (docker compose down) on script exit. - Add curl --max-time to all requests to prevent CI hangs. - Add docker compose down after replay phase. - Add 5s timeout to /health db ping. - Fix misleading "concurrently" comment (queries are sequential). Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| docker compose build | ||
| rm -rf keploy/ | ||
|
|
There was a problem hiding this comment.
docker compose build (and subsequent compose commands) will fail if this script is invoked from outside the proxy-stress-test/ directory (e.g. bash proxy-stress-test/test.sh from repo root). Consider cd-ing to the script directory at the start (and using paths relative to it) so the script is location-independent.
| send_request & | ||
| keploy record -c "docker compose up" --container-name "$container_name" --generate-github-actions=false |& tee "${container_name}.txt" |
There was a problem hiding this comment.
send_request runs in the background but its exit status is never checked, and the wait inside send_request doesn’t wait for the keploy record process (it only waits for jobs spawned by that background function). If any curl fails, send_request can exit early without interrupting recording, leaving keploy record running/hanging. Capture the background PID (req_pid=$!) and wait for it in the main flow (and/or ensure failures still trigger container_kill).
| succeeded++ | ||
| } else { | ||
| failed++ | ||
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) |
There was a problem hiding this comment.
errors is capped for network errors (len(errors) < 5), but non-2xx HTTP statuses append without any limit. With a large BATCH_SIZE, this can create a very large response payload and unnecessary memory growth. Consider applying the same cap for non-2xx statuses too.
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) | |
| if len(errors) < 5 { | |
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) | |
| } |
Ensures the script works when invoked from any working directory. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| send_request(){ | ||
| sleep 15 | ||
| max_attempts=30 | ||
| attempt=0 | ||
| while ! curl -sf --max-time 5 http://localhost:8080/health > /dev/null 2>&1; do | ||
| attempt=$((attempt + 1)) | ||
| if [ "$attempt" -ge "$max_attempts" ]; then | ||
| echo "App failed to start after $max_attempts attempts" | ||
| exit 1 | ||
| fi | ||
| sleep 3 |
There was a problem hiding this comment.
send_request runs in the background, so exit 1 inside it will not fail the main script; if the app never becomes healthy, keploy record can keep running indefinitely. Consider propagating failures back to the main process (e.g., have the failure path also stop recording and have the main script wait on the request job PID and exit non-zero if it failed).
| container_kill() { | ||
| REC_PID="$(pgrep -n -f 'keploy record' || true)" | ||
| if [ -n "$REC_PID" ]; then | ||
| kill -INT "$REC_PID" 2>/dev/null || true | ||
| fi | ||
| } |
There was a problem hiding this comment.
container_kill finds a PID via pgrep -f 'keploy record', which can match/kill the wrong process if multiple Keploy runs exist on the host. It will be more reliable to capture the keploy record PID explicitly (or otherwise scope the match) and signal that specific process.
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if err := db.PingContext(ctx); err != nil { |
There was a problem hiding this comment.
If PingContext fails, initDB returns without closing the opened *sql.DB, leaking resources on the failure path. Close the DB before returning the ping error.
| if err := db.PingContext(ctx); err != nil { | |
| if err := db.PingContext(ctx); err != nil { | |
| if closeErr := db.Close(); closeErr != nil { | |
| return nil, fmt.Errorf("ping failed: %w; closing DB after failed ping also failed: %v", err, closeErr) | |
| } |
| if err == nil { | ||
| t.Proxy = http.ProxyURL(proxyURL) | ||
| } |
There was a problem hiding this comment.
When HTTP_PROXY_URL is set but cannot be parsed, the proxy configuration is silently ignored and requests go direct, which can make the stress test produce false positives. Consider surfacing this as an error (or at least logging and failing fast) when a non-empty proxy URL is invalid.
| if err == nil { | |
| t.Proxy = http.ProxyURL(proxyURL) | |
| } | |
| if err != nil || proxyURL.Scheme == "" || proxyURL.Host == "" { | |
| log.Fatalf("Invalid HTTP_PROXY_URL %q. Update HTTP_PROXY_URL to a valid proxy URL such as http://proxy.example:8080 and rerun the stress test.", httpProxyURL) | |
| } | |
| t.Proxy = http.ProxyURL(proxyURL) |
| log.Printf("Starting background noise: %d outgoing connections/sec", bgNoiseConns) | ||
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) | ||
| defer ticker.Stop() | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case <-ticker.C: | ||
| go func() { | ||
| // Each of these during replay = "no matching mock found" = error to errChannel | ||
| client := &http.Client{Transport: newTransport(), Timeout: 5 * time.Second} | ||
| req, err := http.NewRequest("GET", httpsTarget, nil) | ||
| if err != nil { | ||
| return | ||
| } | ||
| resp, err := client.Do(req) | ||
| if err == nil { | ||
| resp.Body.Close() | ||
| } | ||
| }() |
There was a problem hiding this comment.
backgroundNoise spawns a new goroutine on every tick with no upper bound; with higher BG_NOISE_CONNS this can accumulate and exhaust resources if requests take longer than the tick interval. Consider reusing a fixed worker pool / rate-limited loop (or a semaphore) and reusing the http.Client/Transport.
| log.Printf("Starting background noise: %d outgoing connections/sec", bgNoiseConns) | |
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) | |
| defer ticker.Stop() | |
| for { | |
| select { | |
| case <-ctx.Done(): | |
| return | |
| case <-ticker.C: | |
| go func() { | |
| // Each of these during replay = "no matching mock found" = error to errChannel | |
| client := &http.Client{Transport: newTransport(), Timeout: 5 * time.Second} | |
| req, err := http.NewRequest("GET", httpsTarget, nil) | |
| if err != nil { | |
| return | |
| } | |
| resp, err := client.Do(req) | |
| if err == nil { | |
| resp.Body.Close() | |
| } | |
| }() | |
| log.Printf("Starting background noise: %d outgoing connections/sec", bgNoiseConns) | |
| client := &http.Client{ | |
| Transport: newTransport(), | |
| Timeout: 5 * time.Second, | |
| } | |
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) | |
| defer ticker.Stop() | |
| // Bound the number of in-flight requests so slower responses cannot cause | |
| // unbounded goroutine growth. Matching the configured request rate keeps the | |
| // background load steady without exhausting resources. | |
| inFlight := make(chan struct{}, bgNoiseConns) | |
| for { | |
| select { | |
| case <-ctx.Done(): | |
| return | |
| case <-ticker.C: | |
| select { | |
| case inFlight <- struct{}{}: | |
| go func() { | |
| defer func() { <-inFlight }() | |
| // Each of these during replay = "no matching mock found" = error to errChannel | |
| req, err := http.NewRequestWithContext(ctx, "GET", httpsTarget, nil) | |
| if err != nil { | |
| return | |
| } | |
| resp, err := client.Do(req) | |
| if err == nil { | |
| resp.Body.Close() | |
| } | |
| }() | |
| default: | |
| // Skip this tick when all worker slots are busy. | |
| } |
| // Init OTel (Issue 2 — exports to localhost:4318 with no collector) | ||
| shutdownOTel, err := initOTel(ctx) | ||
| if err != nil { | ||
| log.Printf("OTel init returned error (expected if no collector): %v", err) |
There was a problem hiding this comment.
This log line reports an OTel init error but doesn't provide a clear next step (e.g., run a collector at the configured endpoint or disable OTel via OTEL_ENABLED=false). Adding an actionable hint would make troubleshooting easier.
| log.Printf("OTel init returned error (expected if no collector): %v", err) | |
| log.Printf("OTel init failed: %v. To enable telemetry, start an OTLP collector at %s. If telemetry is not needed, disable it with OTEL_ENABLED=false.", err, otelEndpoint) |
| postTarget := env("HTTPS_POST_TARGET", "https://httpbin.org/post") | ||
| req, err := http.NewRequestWithContext(ctx, "POST", postTarget, bytes.NewReader([]byte(`{"action":"transfer","amount":100}`))) | ||
| if err != nil { | ||
| http.Error(w, err.Error(), 500) |
There was a problem hiding this comment.
Use http.StatusInternalServerError instead of the literal 500 for clarity and to avoid magic numbers.
| http.Error(w, err.Error(), 500) | |
| http.Error(w, err.Error(), http.StatusInternalServerError) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pg_size_pretty(avg(length(large_payload)::bigint)) AS avg_payload, | ||
| pg_size_pretty(sum(length(large_payload)::bigint)) AS total_payload | ||
| FROM large_records | ||
| UNION ALL | ||
| SELECT 'wide_records', | ||
| count(*), | ||
| pg_size_pretty(avg(octet_length(col_01)::bigint)), |
There was a problem hiding this comment.
The pg_size_pretty(avg(...::bigint)) calls will fail because avg(bigint) returns numeric, and Postgres doesn't have pg_size_pretty(numeric). Cast the result of avg(...) to bigint (or compute the average via sum()/count() and cast) so the init script doesn't error during container startup.
| pg_size_pretty(avg(length(large_payload)::bigint)) AS avg_payload, | |
| pg_size_pretty(sum(length(large_payload)::bigint)) AS total_payload | |
| FROM large_records | |
| UNION ALL | |
| SELECT 'wide_records', | |
| count(*), | |
| pg_size_pretty(avg(octet_length(col_01)::bigint)), | |
| pg_size_pretty(avg(length(large_payload)::bigint)::bigint) AS avg_payload, | |
| pg_size_pretty(sum(length(large_payload)::bigint)) AS total_payload | |
| FROM large_records | |
| UNION ALL | |
| SELECT 'wide_records', | |
| count(*), | |
| pg_size_pretty(avg(octet_length(col_01)::bigint)::bigint), |
| ports: | ||
| - "5432:5432" |
There was a problem hiding this comment.
Binding Postgres to host port 5432:5432 is likely to conflict with a local Postgres (and differs from other samples that use non-default host ports). Consider removing the host port mapping (the app can reach db on the compose network) or mapping to a non-default host port to make the test more reliable.
| ports: | |
| - "5432:5432" |
| func newTransport() *http.Transport { | ||
| t := &http.Transport{ | ||
| // Force new connections — no pooling, no keep-alive. | ||
| // Each request = new TCP conn = new CONNECT tunnel = new cert. | ||
| DisableKeepAlives: true, | ||
| MaxIdleConns: 0, | ||
| IdleConnTimeout: 1 * time.Nanosecond, | ||
| TLSClientConfig: &tls.Config{ | ||
| InsecureSkipVerify: true, //nolint:gosec // intentional: accept Keploy MITM proxy cert | ||
| }, | ||
| // Custom dialer with short timeouts | ||
| DialContext: (&net.Dialer{ | ||
| Timeout: 10 * time.Second, | ||
| KeepAlive: -1, // disable TCP keep-alive | ||
| }).DialContext, | ||
| } |
There was a problem hiding this comment.
newTransport() may negotiate HTTP/2 with the external HTTPS target. If the goal is to reproduce the proxy bug that keyes off an HTTP/1.1 payload prefix (e.g., starting with POST ), HTTP/2 will prevent that reproduction. Consider explicitly disabling HTTP/2 on the transport to keep the traffic HTTP/1.1 and make the repro deterministic.
| return | ||
| } | ||
| log.Printf("Starting background noise: %d outgoing connections/sec", bgNoiseConns) | ||
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) |
There was a problem hiding this comment.
time.NewTicker(time.Second / time.Duration(bgNoiseConns)) will panic if bgNoiseConns is set high enough that the computed interval becomes 0. Consider clamping bgNoiseConns to a safe maximum or computing the interval with a minimum of 1ns/1ms to avoid runtime panics from misconfiguration.
| ticker := time.NewTicker(time.Second / time.Duration(bgNoiseConns)) | |
| interval := time.Second / time.Duration(bgNoiseConns) | |
| if interval <= 0 { | |
| interval = time.Nanosecond | |
| } | |
| ticker := time.NewTicker(interval) |
| } else if status >= 200 && status < 300 { | ||
| succeeded++ | ||
| } else { | ||
| failed++ | ||
| errors = append(errors, fmt.Sprintf("HTTP %d", status)) | ||
| } |
There was a problem hiding this comment.
In batchTransferHandler, non-2xx HTTP responses always append to errors without any cap, which can grow unbounded if BATCH_SIZE is increased. Consider limiting the number of stored error strings (similar to the len(errors) < 5 check used for err != nil) to keep response size predictable.
| send_request & | ||
| keploy record -c "docker compose up" --container-name "$container_name" --generate-github-actions=false |& tee "${container_name}.txt" | ||
|
|
||
| if grep -q "panic:" "${container_name}.txt"; then | ||
| echo "Panic detected during recording" | ||
| cat "${container_name}.txt" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Recording completed" | ||
| sleep 5 | ||
| docker compose down | ||
|
|
||
| keploy test -c 'docker compose up' --container-name "$container_name" --apiTimeout 60 --delay 20 --generate-github-actions=false &> "${container_name}-replay.txt" | ||
|
|
||
| docker compose down | ||
|
|
||
| if grep -q "panic:" "${container_name}-replay.txt"; then | ||
| echo "Panic detected during replay" | ||
| cat "${container_name}-replay.txt" | ||
| exit 1 |
There was a problem hiding this comment.
Because the script runs with set -e, a non-zero exit from keploy test (or keploy record) will abort the script before the later diagnostics (panic grep, report parsing, log printing) run. If those diagnostics are important for CI debugging, capture the exit code and continue to the checks (then fail at the end).
| // Also query wide_records (20 columns × 3.2KB each = 64KB per DataRow) | ||
| wideRows, err := db.QueryContext(ctx, | ||
| `SELECT * FROM wide_records LIMIT 20`) | ||
| if err == nil { | ||
| for wideRows.Next() { | ||
| var id int | ||
| cols := make([]string, 20) | ||
| ptrs := make([]interface{}, 21) | ||
| ptrs[0] = &id | ||
| for i := range cols { | ||
| ptrs[i+1] = &cols[i] | ||
| } | ||
| if scanErr := wideRows.Scan(ptrs...); scanErr != nil { | ||
| break | ||
| } | ||
| } | ||
| wideRows.Close() | ||
| } |
There was a problem hiding this comment.
The wide_records query ignores scan/iteration errors: on Scan error it just breaks without surfacing anything in the response, and wideRows.Err() is never checked. If this endpoint is meant to detect Postgres/proxy decoding failures, capture these errors into result.DBError (and ensure wideRows is closed via defer after a successful query).
| // Also query multiple large Postgres result sets | ||
| dbRows := 0 | ||
| for i := 0; i < 3; i++ { | ||
| rows, err := db.QueryContext(ctx, | ||
| `SELECT id, name, description, large_payload FROM large_records ORDER BY id LIMIT 100`) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| for rows.Next() { | ||
| var id int | ||
| var name, desc, payload string | ||
| _ = rows.Scan(&id, &name, &desc, &payload) | ||
| dbRows++ | ||
| } | ||
| rows.Close() | ||
| } |
There was a problem hiding this comment.
The batch DB query loop discards rows.Scan errors and never checks rows.Err(). That can silently undercount dbRows and hide the very Postgres/proxy decode failures this stress test is supposed to surface. Consider handling Scan/Err and returning (or recording) a DB error when they occur.
| var ( | ||
| listenAddr = env("LISTEN_ADDR", ":8080") | ||
| dbDSN = env("DATABASE_URL", "postgres://repro:repro@localhost:5432/reprodb?sslmode=disable") | ||
| httpsTarget = env("HTTPS_TARGET", "https://httpbin.org/get") | ||
| httpProxyURL = env("HTTP_PROXY_URL", "") // e.g. http://forward-proxy:3128 | ||
| concurrentConns = envInt("CONCURRENT_CONNS", 20) // number of parallel HTTPS connections per request | ||
| otelEndpoint = env("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4318") | ||
| otelEnabled = env("OTEL_ENABLED", "true") == "true" | ||
| otelInterval = envDuration("OTEL_EXPORT_INTERVAL", 5*time.Second) // configurable OTel export interval | ||
| bgNoiseConns = envInt("BG_NOISE_CONNS", 0) // background HTTP noise connections per second |
There was a problem hiding this comment.
OTEL_EXPORTER_OTLP_ENDPOINT is a standard OpenTelemetry env var that is commonly set as a URL (e.g. http://localhost:4318). The OTLP HTTP exporters here use WithEndpoint(otelEndpoint), which expects a host:port (no scheme). Consider either accepting both formats (strip http(s):// when present) or using a custom env var name to avoid surprising misconfiguration.
- Set CONCURRENT_CONNS=42 and BATCH_SIZE=42 matching production traffic pattern from Agoda travel-card-api. - Enable OTel with no collector to generate mock-not-found errors that stress the error channel during replay. - Add BG_NOISE_CONNS=2 for background connection noise. - Set deploy.resources.limits: cpus=0.50, memory=512M to simulate resource-constrained K8s pods where the bugs are observable. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func initDB() (*sql.DB, error) { | ||
| db, err := sql.Open("postgres", dbDSN) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| db.SetMaxOpenConns(10) | ||
| db.SetMaxIdleConns(5) | ||
| db.SetConnMaxLifetime(5 * time.Minute) | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if err := db.PingContext(ctx); err != nil { |
There was a problem hiding this comment.
initDB() returns early on PingContext error without closing the *sql.DB, which leaks resources (and can keep background goroutines running) on startup failures. Close db before returning the ping error (or defer a close until success is confirmed).
| func initDB() (*sql.DB, error) { | |
| db, err := sql.Open("postgres", dbDSN) | |
| if err != nil { | |
| return nil, err | |
| } | |
| db.SetMaxOpenConns(10) | |
| db.SetMaxIdleConns(5) | |
| db.SetConnMaxLifetime(5 * time.Minute) | |
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | |
| defer cancel() | |
| if err := db.PingContext(ctx); err != nil { | |
| func initDB() (db *sql.DB, err error) { | |
| db, err = sql.Open("postgres", dbDSN) | |
| if err != nil { | |
| return nil, err | |
| } | |
| defer func() { | |
| if err != nil { | |
| _ = db.Close() | |
| } | |
| }() | |
| db.SetMaxOpenConns(10) | |
| db.SetMaxIdleConns(5) | |
| db.SetConnMaxLifetime(5 * time.Minute) | |
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | |
| defer cancel() | |
| if err = db.PingContext(ctx); err != nil { |
| // If a forward proxy is configured, route through it (HTTP CONNECT tunnel) | ||
| if httpProxyURL != "" { | ||
| proxyURL, err := url.Parse(httpProxyURL) | ||
| if err == nil { | ||
| t.Proxy = http.ProxyURL(proxyURL) | ||
| } | ||
| } |
There was a problem hiding this comment.
newTransport() silently ignores invalid HTTP_PROXY_URL values (parse errors), which can cause the test to run without a proxy and invalidate the intended stress scenario. Consider validating/parsing the proxy URL once at startup and failing fast (or at least logging a clear error) when it’s invalid.
| for _, r := range results { | ||
| if r.err != nil { | ||
| result.HTTPSError = r.err.Error() | ||
| } else { | ||
| result.HTTPSStatus = r.status |
There was a problem hiding this comment.
The loop for _, r := range results shadows the r *http.Request parameter from the handler, which is easy to misread and can lead to mistakes when editing this function. Rename the loop variable to avoid shadowing.
| for _, r := range results { | |
| if r.err != nil { | |
| result.HTTPSError = r.err.Error() | |
| } else { | |
| result.HTTPSStatus = r.status | |
| for _, res := range results { | |
| if res.err != nil { | |
| result.HTTPSError = res.err.Error() | |
| } else { | |
| result.HTTPSStatus = res.status |
| send_request & | ||
| keploy record -c "docker compose up" --container-name "$container_name" --generate-github-actions=false |& tee "${container_name}.txt" | ||
|
|
There was a problem hiding this comment.
docker compose up is invoked without --compatibility, but the resource constraints are defined under deploy.resources.limits, which Docker Compose typically ignores unless compatibility mode (or Swarm) is used. If the CPU/memory limit is required for this regression, update the command to run Compose in compatibility mode or switch to Compose-supported limit fields so the stress conditions are actually enforced in CI.
| status := "ok" | ||
| ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second) | ||
| defer cancel() | ||
| if err := db.PingContext(ctx); err != nil { | ||
| status = "db_down: " + err.Error() | ||
| } | ||
| w.Header().Set("Content-Type", "application/json") |
There was a problem hiding this comment.
/health always returns HTTP 200 even when the DB ping fails (status = "db_down: ..."). Since the script uses curl -sf on /health to gate test execution, DB failures won’t fail the readiness check. Consider returning a non-2xx status (e.g., 503) when PingContext fails.
| status := "ok" | |
| ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second) | |
| defer cancel() | |
| if err := db.PingContext(ctx); err != nil { | |
| status = "db_down: " + err.Error() | |
| } | |
| w.Header().Set("Content-Type", "application/json") | |
| status := "ok" | |
| httpStatus := http.StatusOK | |
| ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second) | |
| defer cancel() | |
| if err := db.PingContext(ctx); err != nil { | |
| status = "db_down: " + err.Error() | |
| httpStatus = http.StatusServiceUnavailable | |
| } | |
| w.Header().Set("Content-Type", "application/json") | |
| w.WriteHeader(httpStatus) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // /api/post-transfer — POST endpoint to trigger SQS parser misclassification (Issue 4) | ||
| // The SQS parser matches any buffer starting with "POST " — this HTTP POST through | ||
| // the CONNECT tunnel will be first evaluated by the SQS parser before falling back. | ||
| func postTransferHandler(db *sql.DB) http.HandlerFunc { |
There was a problem hiding this comment.
postTransferHandler takes db *sql.DB but does not use it. Either remove the parameter, rename it to _ to indicate it’s intentionally unused, or use it (e.g., for a lightweight DB query) so the handler signature doesn’t suggest a dependency it doesn’t have.
| func postTransferHandler(db *sql.DB) http.HandlerFunc { | |
| func postTransferHandler(_ *sql.DB) http.HandlerFunc { |
The proxy-stress-test sample app is in keploy/samples-go#217 which hasn't been merged to main yet. The CI checks out samples-go from main, so the directory doesn't exist and the test fails. The test script remains in the repo and will be re-added to the matrix once the samples-go PR is merged. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
* fix: add TLS cert cache and continuous error channel drain 1. TLS cert cache (Issue 1): CertForClient now caches generated certificates by hostname using an expirable LRU (size=1024, TTL=24h). Previously, every CONNECT tunnel generated a fresh X.509 cert even for the same hostname — 42 concurrent tunnels meant 42 parallel cert generations, saturating CPU in K8s pods. 2. Continuous error drain (Issue 2): Added StartErrorDrain() which runs a background goroutine that continuously reads from the error channel. Errors are routed to a test-scoped accumulator (BeginTestErrorCapture/ EndTestErrorCapture) during active tests, and discarded between tests. This prevents background services (OTel, health checks) from filling the 100-slot error channel and starving test coordination. Closes: travel-card-api recording 322s p99 latency and replay hang. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: address Copilot review comments on TLS cert cache and error drain - Fix goroutine leak in helperClientHello: replace self-referencing done channel with a Read-based block and cleanup waits for dial goroutine exit. - Use t.Errorf+return instead of t.Fatal inside goroutines in TestCertCacheHit to avoid panics from t.Fatal called outside the test goroutine. - Check x509.ParseCertificate error in both TestCertCacheHit and TestCertCacheDistinctHostnames to prevent nil-pointer panics. - Use unique hostnames in BenchmarkCertForClient so it measures actual cert generation instead of cache hits. - Fix busy loop in GetMockErrors when errChannel is closed: use labeled break to exit the for loop instead of only breaking out of select. - Guard StartErrorDrain with sync.Once to prevent multiple drain goroutines. - Reduce noisy debug logging for discarded errors: log only the first and every 100th discard instead of every single one. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: pre-populate tls.Certificate.Leaf to prevent data race on cache hit crypto/tls lazily parses Certificate.Leaf on first use, which is not goroutine-safe. When a cached certificate is reused across concurrent TLS handshakes, this lazy initialization could cause a data race. Parse Leaf eagerly before caching to ensure all consumers see a fully-initialized certificate. Also document that the cache key (hostname only) is sufficient because caPrivKey, caCertParsed, and backdate are constant per proxy/test-run. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: add proxy-stress-test e2e pipeline for regression testing Adds a new Docker-based e2e test that exercises: - Concurrent HTTPS through CONNECT tunnel (TLS cert caching) - Large PostgreSQL DataRow responses (wire protocol reassembly) - HTTP POST through forward proxy (MatchType validation) The test records traffic from a Go app that makes 10+ concurrent HTTPS calls through tinyproxy and queries Postgres with 100KB+ rows, then replays and verifies all test sets pass. Sample app lives in keploy/samples-go/proxy-stress-test. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: address all remaining Copilot comments on cert cache and tests - helperClientHello returns error instead of calling t.Fatal, making it safe to call from goroutines in concurrent tests. - Benchmark uses unique hostnames and proper error handling; no leaked goroutines. - Handle ParseCertificate error when pre-populating Leaf: log and skip caching instead of silently discarding the error. - Add nolint:govet annotation for intentional sync.Once reset in tests. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: wire up StartErrorDrain in StartProxy initialization StartErrorDrain was defined but never called. Now invoked in StartProxy before any connections are handled, ensuring the error channel is continuously drained from proxy start. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: add panic recovery to error drain goroutine Added defer utils.Recover to the StartErrorDrain goroutine, matching the pattern used by all other goroutines in the proxy. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: add CPU throttling to proxy stress test for reliable reproduction Apply 0.5 CPU limit to the Keploy container after startup via docker update. This simulates the resource-constrained K8s pod where cert storm causes measurable latency and error channel saturation becomes visible. Without throttling, the bugs exist but are masked by excess CPU on CI runners. Also added curl timeouts, health check retry limit, and docker compose down after replay. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: simplify proxy stress test CI script for reliability - Remove CPU throttling via docker update (caused timing issues in CI) - Add missing report file check to avoid grep failure on missing files - Add docker compose down status logging - Increase health check attempts for slower CI runners - Reduce delay for faster test completion Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: remove proxy-stress-test from matrix until samples-go PR is merged The proxy-stress-test sample app is in keploy/samples-go#217 which hasn't been merged to main yet. The CI checks out samples-go from main, so the directory doesn't exist and the test fails. The test script remains in the repo and will be re-added to the matrix once the samples-go PR is merged. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: re-add proxy-stress-test to matrix (samples-go#217 merged) The proxy-stress-test sample app is now on samples-go main branch. Re-adding to the CI matrix for e2e regression testing. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: prevent early exit on replay failure in proxy-stress-test CI GitHub Actions bash steps run with set -eo pipefail by default. The replay command can return non-zero on test failures, which causes the script to exit before checking the reports. Added || true so the script continues to the report-checking logic. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: e2e validate all 4 fixes with OTel enabled Validates all fixes e2e: recording+replay complete, no panics, no data races, no error channel overflow, no PG decode failures. OTel enabled (no collector) to exercise error drain under load. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com>
* fix: add TLS cert cache and continuous error channel drain 1. TLS cert cache (Issue 1): CertForClient now caches generated certificates by hostname using an expirable LRU (size=1024, TTL=24h). Previously, every CONNECT tunnel generated a fresh X.509 cert even for the same hostname — 42 concurrent tunnels meant 42 parallel cert generations, saturating CPU in K8s pods. 2. Continuous error drain (Issue 2): Added StartErrorDrain() which runs a background goroutine that continuously reads from the error channel. Errors are routed to a test-scoped accumulator (BeginTestErrorCapture/ EndTestErrorCapture) during active tests, and discarded between tests. This prevents background services (OTel, health checks) from filling the 100-slot error channel and starving test coordination. Closes: travel-card-api recording 322s p99 latency and replay hang. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: address Copilot review comments on TLS cert cache and error drain - Fix goroutine leak in helperClientHello: replace self-referencing done channel with a Read-based block and cleanup waits for dial goroutine exit. - Use t.Errorf+return instead of t.Fatal inside goroutines in TestCertCacheHit to avoid panics from t.Fatal called outside the test goroutine. - Check x509.ParseCertificate error in both TestCertCacheHit and TestCertCacheDistinctHostnames to prevent nil-pointer panics. - Use unique hostnames in BenchmarkCertForClient so it measures actual cert generation instead of cache hits. - Fix busy loop in GetMockErrors when errChannel is closed: use labeled break to exit the for loop instead of only breaking out of select. - Guard StartErrorDrain with sync.Once to prevent multiple drain goroutines. - Reduce noisy debug logging for discarded errors: log only the first and every 100th discard instead of every single one. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: pre-populate tls.Certificate.Leaf to prevent data race on cache hit crypto/tls lazily parses Certificate.Leaf on first use, which is not goroutine-safe. When a cached certificate is reused across concurrent TLS handshakes, this lazy initialization could cause a data race. Parse Leaf eagerly before caching to ensure all consumers see a fully-initialized certificate. Also document that the cache key (hostname only) is sufficient because caPrivKey, caCertParsed, and backdate are constant per proxy/test-run. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: add proxy-stress-test e2e pipeline for regression testing Adds a new Docker-based e2e test that exercises: - Concurrent HTTPS through CONNECT tunnel (TLS cert caching) - Large PostgreSQL DataRow responses (wire protocol reassembly) - HTTP POST through forward proxy (MatchType validation) The test records traffic from a Go app that makes 10+ concurrent HTTPS calls through tinyproxy and queries Postgres with 100KB+ rows, then replays and verifies all test sets pass. Sample app lives in keploy/samples-go/proxy-stress-test. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: address all remaining Copilot comments on cert cache and tests - helperClientHello returns error instead of calling t.Fatal, making it safe to call from goroutines in concurrent tests. - Benchmark uses unique hostnames and proper error handling; no leaked goroutines. - Handle ParseCertificate error when pre-populating Leaf: log and skip caching instead of silently discarding the error. - Add nolint:govet annotation for intentional sync.Once reset in tests. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: wire up StartErrorDrain in StartProxy initialization StartErrorDrain was defined but never called. Now invoked in StartProxy before any connections are handled, ensuring the error channel is continuously drained from proxy start. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: add panic recovery to error drain goroutine Added defer utils.Recover to the StartErrorDrain goroutine, matching the pattern used by all other goroutines in the proxy. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: add CPU throttling to proxy stress test for reliable reproduction Apply 0.5 CPU limit to the Keploy container after startup via docker update. This simulates the resource-constrained K8s pod where cert storm causes measurable latency and error channel saturation becomes visible. Without throttling, the bugs exist but are masked by excess CPU on CI runners. Also added curl timeouts, health check retry limit, and docker compose down after replay. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: simplify proxy stress test CI script for reliability - Remove CPU throttling via docker update (caused timing issues in CI) - Add missing report file check to avoid grep failure on missing files - Add docker compose down status logging - Increase health check attempts for slower CI runners - Reduce delay for faster test completion Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: remove proxy-stress-test from matrix until samples-go PR is merged The proxy-stress-test sample app is in keploy/samples-go#217 which hasn't been merged to main yet. The CI checks out samples-go from main, so the directory doesn't exist and the test fails. The test script remains in the repo and will be re-added to the matrix once the samples-go PR is merged. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: re-add proxy-stress-test to matrix (samples-go#217 merged) The proxy-stress-test sample app is now on samples-go main branch. Re-adding to the CI matrix for e2e regression testing. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: prevent early exit on replay failure in proxy-stress-test CI GitHub Actions bash steps run with set -eo pipefail by default. The replay command can return non-zero on test failures, which causes the script to exit before checking the reports. Added || true so the script continues to the report-checking logic. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: e2e validate all 4 fixes with OTel enabled Validates all fixes e2e: recording+replay complete, no panics, no data races, no error channel overflow, no PG decode failures. OTel enabled (no collector) to exercise error drain under load. Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: Shubham Jain <shubham@keploy.io> Signed-off-by: slayerjain <shubhamkjain@outlook.com> Signed-off-by: Priyanshu Singh <s.priyanshu.coder@gmail.com>
Summary
Adds a Go sample app that exercises Keploy proxy under stress conditions for e2e regression testing.
What the app does
Resource constraints
Why it's needed
This sample app is used by the `proxy-stress-test` CI pipeline in keploy/keploy to verify that recording and replay work correctly under conditions that previously caused high latency and replay hangs.
Files