-
Notifications
You must be signed in to change notification settings - Fork 12
HYPERFLEET-625 - fix: Add timeout to testcontainer teardown to prevent Prow hang #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughTestcontainer.Close now creates a 30-second timeout context for teardown. Helper adds a new stopHealthServer() method, starts the Health server during initialization, removes CleanDB from the teardown sequence, and reorders teardown to stop API, Metrics, and Health servers before terminating the test container and environment. Integration test main adds a background watchdog goroutine after m.Run() that waits 45 seconds and, if teardown hangs, logs an error and forcibly exits the process with the test exit code. Sequence Diagram(s)sequenceDiagram
participant TestRunner
participant Helper
participant APIServer
participant MetricsServer
participant HealthServer
participant Testcontainer
participant DB
TestRunner->>Helper: trigger teardown
Helper->>APIServer: stopAPIServer()
Note right of APIServer: shutdown
Helper->>MetricsServer: stopMetricsServer()
Note right of MetricsServer: shutdown
Helper->>HealthServer: stopHealthServer()
Note right of HealthServer: shutdown
Helper->>Testcontainer: Close(ctx with 30s timeout)
Note right of Testcontainer: container.Terminate(ctx)
Testcontainer->>DB: close SQL connection
Testcontainer-->>Helper: container terminated
Helper-->>TestRunner: teardown complete
sequenceDiagram
participant TestRunner
participant TestProcess
participant Watchdog
participant Logger
TestRunner->>TestProcess: run tests (m.Run())
TestProcess->>Watchdog: start 45s timer (goroutine)
alt Teardown completes < 45s
Watchdog-->>TestProcess: cancel timer
else 45s elapsed and teardown hanging
Watchdog->>Logger: log error "teardown timeout"
Watchdog->>TestProcess: os.Exit(exitCode)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
ea61c8a to
660a8ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@test/integration/integration_test.go`:
- Around line 61-73: The timeout goroutine mutates and reads the shared variable
exitCode causing a potential data race; fix by avoiding shared mutable access —
capture a local copy of exitCode before starting the goroutine (e.g., localExit
:= exitCode) and use that local variable inside the goroutine when setting
os.Exit, or alternatively use an atomic/chan to communicate the exit code (refer
to exitCode, the anonymous goroutine, logger.Error and os.Exit in the snippet
and update the goroutine to use the safely captured value or atomic.Load/Store).
b280dd4 to
44c2fb2
Compare
…t Prow hang When integration tests fail with a panic, the process continues to the teardown phase where container.Terminate() is called with no timeout. If the Docker container termination hangs, the process never exits, causing the Prow CI job to stay stuck in "pending" state indefinitely. Add a 30-second timeout context to Testcontainer.Close() so the teardown always completes, allowing the process to exit and Prow to report the test failure status back to GitHub.
44c2fb2 to
b7c4270
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rh-amarin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rh-amarin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
Summary
Prevents Prow CI integration test jobs from getting stuck in "pending" state when a test panics, by ensuring all teardown resources are cleaned up with timeouts.
Root Cause
When an integration test panics, Go's test runner recovers the panic and marks the test as failed. The process then enters
TestMain's teardown phase. Several teardown steps could hang indefinitely:Testcontainer.Close()calledcontainer.Terminate()withcontext.Background()(no timeout)apiServer.Stop()andmetricsServer.Stop()calledhttpServer.Shutdown(context.Background())(no timeout)stopMetricsServerandstopHealthServerwere never called during teardownIf any of these hung, the process never exited, the Prow sidecar never collected results, and the job stayed in "pending" state on GitHub indefinitely.
A complementary fix to the CI job script is in openshift/release#74635, which adds a
trap EXITto ensure thepodman system servicebackground process is always killed.Changes
1. Timeout on
Testcontainer.Close()(pkg/db/db_session/testcontainer.go)Replace
context.Background()with a 30-second timeout context socontainer.Terminate()doesn't hang indefinitely.2. Timeout on
apiServer.Stop()andmetricsServer.Stop()(cmd/hyperfleet-api/server/)Add 10-second timeout to
httpServer.Shutdown()for consistency withhealthServer.Stop()which already uses a timeout.3. Watchdog goroutine in
TestMain(test/integration/integration_test.go)Force
os.Exitafter 45s if teardown hangs, ensuring the process always terminates and Prow can report the failure.4. Teardown ordering and completeness (
test/helper.go)teardownEnvruns first to destroy the container before HTTP server shutdownsstopMetricsServerandstopHealthServerto teardown list (previously started but never stopped)CleanDB(container is destroyed anyway, and SQL against a broken connection can hang)Test plan
make test)make lint)Jira: https://issues.redhat.com/browse/HYPERFLEET-625