Skip to content

Conversation

@rafabene
Copy link
Contributor

@rafabene rafabene commented Feb 9, 2026

Summary

Prevents Prow CI integration test jobs from getting stuck in "pending" state when a test panics, by ensuring all teardown resources are cleaned up with timeouts.

Root Cause

When an integration test panics, Go's test runner recovers the panic and marks the test as failed. The process then enters TestMain's teardown phase. Several teardown steps could hang indefinitely:

  • Testcontainer.Close() called container.Terminate() with context.Background() (no timeout)
  • apiServer.Stop() and metricsServer.Stop() called httpServer.Shutdown(context.Background()) (no timeout)
  • stopMetricsServer and stopHealthServer were never called during teardown
  • No force-exit mechanism if teardown hung overall

If any of these hung, the process never exited, the Prow sidecar never collected results, and the job stayed in "pending" state on GitHub indefinitely.

A complementary fix to the CI job script is in openshift/release#74635, which adds a trap EXIT to ensure the podman system service background process is always killed.

Changes

1. Timeout on Testcontainer.Close() (pkg/db/db_session/testcontainer.go)

Replace context.Background() with a 30-second timeout context so container.Terminate() doesn't hang indefinitely.

2. Timeout on apiServer.Stop() and metricsServer.Stop() (cmd/hyperfleet-api/server/)

Add 10-second timeout to httpServer.Shutdown() for consistency with healthServer.Stop() which already uses a timeout.

3. Watchdog goroutine in TestMain (test/integration/integration_test.go)

Force os.Exit after 45s if teardown hangs, ensuring the process always terminates and Prow can report the failure.

4. Teardown ordering and completeness (test/helper.go)

  • Reorder teardowns: teardownEnv runs first to destroy the container before HTTP server shutdowns
  • Add stopMetricsServer and stopHealthServer to teardown list (previously started but never stopped)
  • Remove CleanDB (container is destroyed anyway, and SQL against a broken connection can hang)

Test plan

  • Unit tests pass (make test)
  • Lint passes (make lint)
  • Integration tests pass in CI (all checks green)
  • Verify Prow correctly reports failure status when a test panics (requires openshift/release#74635)

Jira: https://issues.redhat.com/browse/HYPERFLEET-625

@openshift-ci openshift-ci bot requested review from jsell-rh and tirthct February 9, 2026 13:46
@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Testcontainer.Close now creates a 30-second timeout context for teardown. Helper adds a new stopHealthServer() method, starts the Health server during initialization, removes CleanDB from the teardown sequence, and reorders teardown to stop API, Metrics, and Health servers before terminating the test container and environment. Integration test main adds a background watchdog goroutine after m.Run() that waits 45 seconds and, if teardown hangs, logs an error and forcibly exits the process with the test exit code.

Sequence Diagram(s)

sequenceDiagram
  participant TestRunner
  participant Helper
  participant APIServer
  participant MetricsServer
  participant HealthServer
  participant Testcontainer
  participant DB

  TestRunner->>Helper: trigger teardown
  Helper->>APIServer: stopAPIServer()
  Note right of APIServer: shutdown
  Helper->>MetricsServer: stopMetricsServer()
  Note right of MetricsServer: shutdown
  Helper->>HealthServer: stopHealthServer()
  Note right of HealthServer: shutdown
  Helper->>Testcontainer: Close(ctx with 30s timeout)
  Note right of Testcontainer: container.Terminate(ctx)
  Testcontainer->>DB: close SQL connection
  Testcontainer-->>Helper: container terminated
  Helper-->>TestRunner: teardown complete
Loading
sequenceDiagram
  participant TestRunner
  participant TestProcess
  participant Watchdog
  participant Logger

  TestRunner->>TestProcess: run tests (m.Run())
  TestProcess->>Watchdog: start 45s timer (goroutine)
  alt Teardown completes < 45s
    Watchdog-->>TestProcess: cancel timer
  else 45s elapsed and teardown hanging
    Watchdog->>Logger: log error "teardown timeout"
    Watchdog->>TestProcess: os.Exit(exitCode)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main change: adding a timeout to testcontainer teardown to prevent Prow hangs, which matches the core purpose of the PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@rafabene rafabene force-pushed the HYPERFLEET-625 branch 5 times, most recently from ea61c8a to 660a8ff Compare February 9, 2026 17:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@test/integration/integration_test.go`:
- Around line 61-73: The timeout goroutine mutates and reads the shared variable
exitCode causing a potential data race; fix by avoiding shared mutable access —
capture a local copy of exitCode before starting the goroutine (e.g., localExit
:= exitCode) and use that local variable inside the goroutine when setting
os.Exit, or alternatively use an atomic/chan to communicate the exit code (refer
to exitCode, the anonymous goroutine, logger.Error and os.Exit in the snippet
and update the goroutine to use the safely captured value or atomic.Load/Store).

@rafabene rafabene force-pushed the HYPERFLEET-625 branch 5 times, most recently from b280dd4 to 44c2fb2 Compare February 9, 2026 18:15
…t Prow hang

When integration tests fail with a panic, the process continues to the
teardown phase where container.Terminate() is called with no timeout.
If the Docker container termination hangs, the process never exits,
causing the Prow CI job to stay stuck in "pending" state indefinitely.

Add a 30-second timeout context to Testcontainer.Close() so the teardown
always completes, allowing the process to exit and Prow to report the
test failure status back to GitHub.
@rh-amarin
Copy link
Contributor

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 9, 2026
@openshift-ci
Copy link

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rafabene
Copy link
Contributor Author

rafabene commented Feb 9, 2026

/retest

@rafabene rafabene merged commit fd4fcc0 into openshift-hyperfleet:main Feb 9, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants