HYPERFLEET-625 - fix: Add timeout to testcontainer teardown to prevent Prow hang #51

rafabene · 2026-02-09T13:45:57Z

Summary

Prevents Prow CI integration test jobs from getting stuck in "pending" state when a test panics, by ensuring all teardown resources are cleaned up with timeouts.

Root Cause

When an integration test panics, Go's test runner recovers the panic and marks the test as failed. The process then enters TestMain's teardown phase. Several teardown steps could hang indefinitely:

Testcontainer.Close() called container.Terminate() with context.Background() (no timeout)
apiServer.Stop() and metricsServer.Stop() called httpServer.Shutdown(context.Background()) (no timeout)
stopMetricsServer and stopHealthServer were never called during teardown
No force-exit mechanism if teardown hung overall

If any of these hung, the process never exited, the Prow sidecar never collected results, and the job stayed in "pending" state on GitHub indefinitely.

A complementary fix to the CI job script is in openshift/release#74635, which adds a trap EXIT to ensure the podman system service background process is always killed.

Changes

1. Timeout on `Testcontainer.Close()` (`pkg/db/db_session/testcontainer.go`)

Replace context.Background() with a 30-second timeout context so container.Terminate() doesn't hang indefinitely.

2. Timeout on `apiServer.Stop()` and `metricsServer.Stop()` (`cmd/hyperfleet-api/server/`)

Add 10-second timeout to httpServer.Shutdown() for consistency with healthServer.Stop() which already uses a timeout.

3. Watchdog goroutine in `TestMain` (`test/integration/integration_test.go`)

Force os.Exit after 45s if teardown hangs, ensuring the process always terminates and Prow can report the failure.

4. Teardown ordering and completeness (`test/helper.go`)

Reorder teardowns: teardownEnv runs first to destroy the container before HTTP server shutdowns
Add stopMetricsServer and stopHealthServer to teardown list (previously started but never stopped)
Remove CleanDB (container is destroyed anyway, and SQL against a broken connection can hang)

Test plan

Unit tests pass (make test)
Lint passes (make lint)
Integration tests pass in CI (all checks green)
Verify Prow correctly reports failure status when a test panics (requires openshift/release#74635)

Jira: https://issues.redhat.com/browse/HYPERFLEET-625

coderabbitai · 2026-02-09T13:46:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Testcontainer.Close now creates a 30-second timeout context for teardown. Helper adds a new stopHealthServer() method, starts the Health server during initialization, removes CleanDB from the teardown sequence, and reorders teardown to stop API, Metrics, and Health servers before terminating the test container and environment. Integration test main adds a background watchdog goroutine after m.Run() that waits 45 seconds and, if teardown hangs, logs an error and forcibly exits the process with the test exit code.

Sequence Diagram(s)

sequenceDiagram
  participant TestRunner
  participant Helper
  participant APIServer
  participant MetricsServer
  participant HealthServer
  participant Testcontainer
  participant DB

  TestRunner->>Helper: trigger teardown
  Helper->>APIServer: stopAPIServer()
  Note right of APIServer: shutdown
  Helper->>MetricsServer: stopMetricsServer()
  Note right of MetricsServer: shutdown
  Helper->>HealthServer: stopHealthServer()
  Note right of HealthServer: shutdown
  Helper->>Testcontainer: Close(ctx with 30s timeout)
  Note right of Testcontainer: container.Terminate(ctx)
  Testcontainer->>DB: close SQL connection
  Testcontainer-->>Helper: container terminated
  Helper-->>TestRunner: teardown complete

sequenceDiagram
  participant TestRunner
  participant TestProcess
  participant Watchdog
  participant Logger

  TestRunner->>TestProcess: run tests (m.Run())
  TestProcess->>Watchdog: start 45s timer (goroutine)
  alt Teardown completes < 45s
    Watchdog-->>TestProcess: cancel timer
  else 45s elapsed and teardown hanging
    Watchdog->>Logger: log error "teardown timeout"
    Watchdog->>TestProcess: os.Exit(exitCode)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly reflects the main change: adding a timeout to testcontainer teardown to prevent Prow hangs, which matches the core purpose of the PR.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@test/integration/integration_test.go`:
- Around line 61-73: The timeout goroutine mutates and reads the shared variable
exitCode causing a potential data race; fix by avoiding shared mutable access —
capture a local copy of exitCode before starting the goroutine (e.g., localExit
:= exitCode) and use that local variable inside the goroutine when setting
os.Exit, or alternatively use an atomic/chan to communicate the exit code (refer
to exitCode, the anonymous goroutine, logger.Error and os.Exit in the snippet
and update the goroutine to use the safely captured value or atomic.Load/Store).

test/integration/integration_test.go

…t Prow hang When integration tests fail with a panic, the process continues to the teardown phase where container.Terminate() is called with no timeout. If the Docker container termination hangs, the process never exits, causing the Prow CI job to stay stuck in "pending" state indefinitely. Add a 30-second timeout context to Testcontainer.Close() so the teardown always completes, allowing the process to exit and Prow to report the test failure status back to GitHub.

rh-amarin · 2026-02-09T18:45:41Z

/lgtm

openshift-ci · 2026-02-09T18:45:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rh-amarin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-02-09T18:45:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rh-amarin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rafabene · 2026-02-09T19:18:43Z

/retest

openshift-ci bot requested review from jsell-rh and tirthct February 9, 2026 13:46

rafabene force-pushed the HYPERFLEET-625 branch 5 times, most recently from ea61c8a to 660a8ff Compare February 9, 2026 17:03

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

test/integration/integration_test.go Show resolved Hide resolved

rafabene force-pushed the HYPERFLEET-625 branch 5 times, most recently from b280dd4 to 44c2fb2 Compare February 9, 2026 18:15

rafabene force-pushed the HYPERFLEET-625 branch from 44c2fb2 to b7c4270 Compare February 9, 2026 18:45

openshift-ci bot assigned rh-amarin Feb 9, 2026

openshift-ci bot added the lgtm label Feb 9, 2026

openshift-ci bot added the approved label Feb 9, 2026

rafabene merged commit fd4fcc0 into openshift-hyperfleet:main Feb 9, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HYPERFLEET-625 - fix: Add timeout to testcontainer teardown to prevent Prow hang #51

HYPERFLEET-625 - fix: Add timeout to testcontainer teardown to prevent Prow hang #51

Uh oh!

rafabene commented Feb 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 9, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

rh-amarin commented Feb 9, 2026

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

rafabene commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HYPERFLEET-625 - fix: Add timeout to testcontainer teardown to prevent Prow hang #51

HYPERFLEET-625 - fix: Add timeout to testcontainer teardown to prevent Prow hang #51

Uh oh!

Conversation

rafabene commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

1. Timeout on Testcontainer.Close() (pkg/db/db_session/testcontainer.go)

2. Timeout on apiServer.Stop() and metricsServer.Stop() (cmd/hyperfleet-api/server/)

3. Watchdog goroutine in TestMain (test/integration/integration_test.go)

4. Teardown ordering and completeness (test/helper.go)

Test plan

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rh-amarin commented Feb 9, 2026

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

rafabene commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafabene commented Feb 9, 2026 •

edited

Loading

1. Timeout on `Testcontainer.Close()` (`pkg/db/db_session/testcontainer.go`)

2. Timeout on `apiServer.Stop()` and `metricsServer.Stop()` (`cmd/hyperfleet-api/server/`)

3. Watchdog goroutine in `TestMain` (`test/integration/integration_test.go`)

4. Teardown ordering and completeness (`test/helper.go`)

coderabbitai bot commented Feb 9, 2026 •

edited

Loading