Skip to content

NO-JIRA: Fix data race in TestControllerScheduled#2201

Open
tchap wants to merge 1 commit into
openshift:masterfrom
tchap:fix-flaky-test-controller-scheduled
Open

NO-JIRA: Fix data race in TestControllerScheduled#2201
tchap wants to merge 1 commit into
openshift:masterfrom
tchap:fix-flaky-test-controller-scheduled

Conversation

@tchap
Copy link
Copy Markdown
Contributor

@tchap tchap commented May 6, 2026

The observer goroutine could call t.Logf after the test returned,
racing with testing cleanup. Fix by using sync.WaitGroup to ensure
all goroutines finish before the test exits, making the sync callback
cancel-aware to prevent deadlock, using atomic.Int32 for the sync
counter, and replacing deprecated wait.PollImmediateUntil with
wait.PollUntilContextCancel.

=================
WARNING: DATA RACE
Read at 0x00c00056a293 by goroutine 133:
  testing.(*common).destination()
      /usr/local/go/src/testing/testing.go:1058 +0x96
  testing.(*common).log()
      /usr/local/go/src/testing/testing.go:1036 +0xbe
  testing.(*common).Logf()
      /usr/local/go/src/testing/testing.go:1200 +0x8e
  github.com/openshift/library-go/pkg/controller/factory.TestControllerScheduled.func2()
      /home/okupka/Projects/tchap/library-go/pkg/controller/factory/factory_test.go:248 +0x1d5

Previous write at 0x00c00056a293 by goroutine 131:
  testing.tRunner.func1()
      /usr/local/go/src/testing/testing.go:2023 +0x904
  runtime.deferreturn()
      /usr/local/go/src/runtime/panic.go:668 +0x5d
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:2101 +0x38

Goroutine 133 (running) created at:
  github.com/openshift/library-go/pkg/controller/factory.TestControllerScheduled()
      /home/okupka/Projects/tchap/library-go/pkg/controller/factory/factory_test.go:239 +0x713
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:2036 +0x21c
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:2101 +0x38

Goroutine 131 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:2101 +0xb12
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:2585 +0x85
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:2036 +0x21c
  testing.runTests()
      /usr/local/go/src/testing/testing.go:2583 +0x9e9
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:2443 +0xf4b
  main.main()
      _testmain.go:72 +0x164
==================

Summary by CodeRabbit

  • Tests
    • Improved controller test reliability by adding coordinated observation of repeated runs, better cancellation handling, and orderly shutdown to reduce flakiness.
    • Strengthened timing/wait logic to more robustly verify multiple execution cycles.

Note: These are internal testing improvements with no user-visible changes.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@tchap: This pull request explicitly references no jira issue.

Details

In response to this:

The observer goroutine could call t.Logf after the test function returned, racing with testing cleanup. Use sync.WaitGroup to ensure the goroutine finishes before the test exits. Also replace deprecated wait.PollImmediateUntil with wait.PollUntilContextCancel.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e06c318-74bf-40b0-b70b-6140a9c75714

📥 Commits

Reviewing files that changed from the base of the PR and between 4a9a875 and 10bd453.

📒 Files selected for processing (1)
  • pkg/controller/factory/factory_test.go

Walkthrough

Refactors a unit test in pkg/controller/factory/factory_test.go: reorders an import, makes the WithSync callback context-aware, replaces ad-hoc goroutine sync with an sync.WaitGroup and atomic.Int32 counter, waits (up to 10s) for ≥3 sync calls using wait.PollUntilContextCancel, then cancels and wg.Wait()s.

Changes

Test Concurrency Refactoring

Layer / File(s) Summary
Import Organization
pkg/controller/factory/factory_test.go
Moved clocktesting import position within the file's import block.
WithSync Behavior
pkg/controller/factory/factory_test.go
WithSync callback changed from an unconditional send to a non-blocking select that also listens for ctx.Done().
Synchronization Primitives
pkg/controller/factory/factory_test.go
Introduced var syncCounter atomic.Int32 and var wg sync.WaitGroup to coordinate observer and controller goroutines.
Goroutine Coordination
pkg/controller/factory/factory_test.go
Replaced the previous ad-hoc observer goroutine with two wg-managed goroutines: one to observe syncCalled and increment syncCounter, another to run controller.Run.
Waiting & Shutdown
pkg/controller/factory/factory_test.go
Replaced simple polling with wait.PollUntilContextCancel (10s timeout, 500ms interval) to wait until syncCounter >= 3; then cancels context and calls wg.Wait() for graceful shutdown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely identifies the main fix: resolving a data race in the TestControllerScheduled test, which matches the primary change in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR modifies factory_test.go, which uses standard Go testing framework (func TestXxx), not Ginkgo. The custom check for Ginkgo test name stability is not applicable.
Test Structure And Quality ✅ Passed Test meets all quality requirements: single responsibility, proper cleanup with WaitGroup, 10s timeout with 500ms polling, meaningful error messages, and follows codebase patterns consistently.
Microshift Test Compatibility ✅ Passed Check not applicable. PR modifies an existing Go unit test (TestControllerScheduled), not new Ginkgo e2e tests. This check only applies when new Ginkgo tests are added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies a standard Go unit test, not a Ginkgo e2e test. The custom check applies only to new Ginkgo e2e tests. No Ginkgo tests are added; the changes fix a data race in an existing test.
Topology-Aware Scheduling Compatibility ✅ Passed Only test file modified (factory_test.go). No deployment manifests, operator code, or scheduling constraints introduced. Check not applicable here.
Ote Binary Stdout Contract ✅ Passed All changes are within test function body. No process-level stdout writes introduced. Import reorganization and test refactoring do not violate OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies an existing standard Go unit test (TestControllerScheduled) in library-go, not Ginkgo e2e tests. The custom check only applies to new Ginkgo e2e tests, so it's not applicable here.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from deads2k and p0lyn0mial May 6, 2026 08:56
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controller/factory/factory_test.go (1)

241-251: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't let shutdown strand a blocked syncCalled sender.

Once cancel() closes ctx.Done(), this observer can exit even if a sync callback is already blocked on the unbuffered syncCalled <- struct{}{} send. In that case controller.Run never returns and wg.Wait() can hang the test.

Suggested fix
 controller := New().ResyncSchedule("@every 1s").WithSync(func(ctx context.Context, controllerContext SyncContext) error {
-	syncCalled <- struct{}{}
+	select {
+	case syncCalled <- struct{}{}:
+	case <-ctx.Done():
+	}
 	return nil
 }).ToController("test", events.NewInMemoryRecorder("fake-controller", clocktesting.NewFakePassiveClock(time.Now())))

Also applies to: 270-271

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/factory/factory_test.go` around lines 241 - 251, The test's
goroutine reading from syncCalled can exit on ctx.Done() while a blocked sender
(the sync callback that does syncCalled <- struct{}{}) remains blocked, causing
controller.Run and wg.Wait() to hang; fix by making the send non-blocking or
cancel-aware: update the sync callback (the place that sends to the syncCalled
channel) to use a select that attempts to send to syncCalled but also selects on
ctx.Done() (or use a buffered syncCalled) so the send never blocks indefinitely;
adjust both occurrences referenced (the send near lines 241-251 and the similar
send at 270-271) to use this cancel-aware non-blocking send.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/factory/factory_test.go`:
- Around line 262-265: The wait condition in the PollUntilContextCancel call is
off-by-one: change the callback used in wait.PollUntilContextCancel (the closure
checking syncCounter) to return syncCounter >= 3 instead of syncCounter > 3 so
the wait completes after observing three syncs as the test and error message
expect; update the closure that references syncCounter and ensure the
surrounding call to wait.PollUntilContextCancel remains unchanged.

---

Outside diff comments:
In `@pkg/controller/factory/factory_test.go`:
- Around line 241-251: The test's goroutine reading from syncCalled can exit on
ctx.Done() while a blocked sender (the sync callback that does syncCalled <-
struct{}{}) remains blocked, causing controller.Run and wg.Wait() to hang; fix
by making the send non-blocking or cancel-aware: update the sync callback (the
place that sends to the syncCalled channel) to use a select that attempts to
send to syncCalled but also selects on ctx.Done() (or use a buffered syncCalled)
so the send never blocks indefinitely; adjust both occurrences referenced (the
send near lines 241-251 and the similar send at 270-271) to use this
cancel-aware non-blocking send.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f50f69de-4948-4af2-9dfb-a72550290c8a

📥 Commits

Reviewing files that changed from the base of the PR and between da43577 and 193e638.

📒 Files selected for processing (1)
  • pkg/controller/factory/factory_test.go

Comment thread pkg/controller/factory/factory_test.go Outdated
@tchap tchap force-pushed the fix-flaky-test-controller-scheduled branch from 193e638 to 4a9a875 Compare May 6, 2026 09:09
The observer goroutine could call t.Logf after the test returned,
racing with testing cleanup. Fix by using sync.WaitGroup to ensure
all goroutines finish before the test exits, making the sync callback
cancel-aware to prevent deadlock, using atomic.Int32 for the sync
counter, and replacing deprecated wait.PollImmediateUntil with
wait.PollUntilContextCancel.
@tchap tchap force-pushed the fix-flaky-test-controller-scheduled branch from 4a9a875 to 10bd453 Compare May 6, 2026 09:19
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants