Skip to content

OCPBUGS-86511: Fix flaky TestAsyncCache backend test#16495

Open
jhadvig wants to merge 1 commit into
openshift:mainfrom
jhadvig:OCPBUGS-86511
Open

OCPBUGS-86511: Fix flaky TestAsyncCache backend test#16495
jhadvig wants to merge 1 commit into
openshift:mainfrom
jhadvig:OCPBUGS-86511

Conversation

@jhadvig
Copy link
Copy Markdown
Member

@jhadvig jhadvig commented May 26, 2026

Analysis / Root cause:
The TestAsyncCache test fails intermittently in CI (example failure) with Expected 0 to be greater than 0. Two issues:

  1. wait.UntilWithContext fires runCache immediately when Run() is called, replacing the cached pointer before the test loop's first GetItem() — so matches is always 0 on fast machines.
  2. The fixed sleep-based loop (2s reload period, 1s sleeps, 3 iterations) has no margin for goroutine scheduling delays on loaded CI machines.

Additionally, initializationRetryInterval and initializationTimeout were set after the first NewAsyncCache call, having no effect on initialization.

Solution description:

  • Move retry/timeout overrides before NewAsyncCache so they take effect for the error-case test
  • Re-grab the reference item after Run() fires its immediate reload
  • Replace the fixed sleep loop with require.Eventually to tolerate slow CI
  • Verify cache stability between reloads with require.Equal

Verified with 50 consecutive runs (go test -count=50) — all passing.

Screenshots / screen recording:
N/A — backend test only.

Test setup:
No special setup required.

Test cases:

  • go test -v -run TestAsyncCache ./pkg/serverutils/asynccache/ passes consistently
  • go test -count=50 -run TestAsyncCache ./pkg/serverutils/asynccache/ — 50/50 passes

Browser conformance:
N/A — backend test only.

Additional info:
Jira: https://redhat.atlassian.net/browse/OCPBUGS-86511

Summary by CodeRabbit

  • Tests
    • Enhanced async cache test suite with improved assertion patterns and stricter validation logic. Adjusted timing parameters and added verification steps to better validate cache initialization, item persistence, and refresh behavior across reload cycles.

Replace timing-dependent sleep loop with require.Eventually and
re-grab the reference item after Run() fires its immediate reload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jhadvig: This pull request references Jira Issue OCPBUGS-86511, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Analysis / Root cause:
The TestAsyncCache test fails intermittently in CI (example failure) with Expected 0 to be greater than 0. Two issues:

  1. wait.UntilWithContext fires runCache immediately when Run() is called, replacing the cached pointer before the test loop's first GetItem() — so matches is always 0 on fast machines.
  2. The fixed sleep-based loop (2s reload period, 1s sleeps, 3 iterations) has no margin for goroutine scheduling delays on loaded CI machines.

Additionally, initializationRetryInterval and initializationTimeout were set after the first NewAsyncCache call, having no effect on initialization.

Solution description:

  • Move retry/timeout overrides before NewAsyncCache so they take effect for the error-case test
  • Re-grab the reference item after Run() fires its immediate reload
  • Replace the fixed sleep loop with require.Eventually to tolerate slow CI
  • Verify cache stability between reloads with require.Equal

Verified with 50 consecutive runs (go test -count=50) — all passing.

Screenshots / screen recording:
N/A — backend test only.

Test setup:
No special setup required.

Test cases:

  • go test -v -run TestAsyncCache ./pkg/serverutils/asynccache/ passes consistently
  • go test -count=50 -run TestAsyncCache ./pkg/serverutils/asynccache/ — 50/50 passes

Browser conformance:
N/A — backend test only.

Additional info:
Jira: https://redhat.atlassian.net/browse/OCPBUGS-86511

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jhadvig
Copy link
Copy Markdown
Member Author

jhadvig commented May 26, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jhadvig: This pull request references Jira Issue OCPBUGS-86511, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from Leo6Leo and TheRealJon May 26, 2026 14:25
@openshift-ci openshift-ci Bot added the component/backend Related to backend label May 26, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhadvig

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jhadvig: This pull request references Jira Issue OCPBUGS-86511, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Analysis / Root cause:
The TestAsyncCache test fails intermittently in CI (example failure) with Expected 0 to be greater than 0. Two issues:

  1. wait.UntilWithContext fires runCache immediately when Run() is called, replacing the cached pointer before the test loop's first GetItem() — so matches is always 0 on fast machines.
  2. The fixed sleep-based loop (2s reload period, 1s sleeps, 3 iterations) has no margin for goroutine scheduling delays on loaded CI machines.

Additionally, initializationRetryInterval and initializationTimeout were set after the first NewAsyncCache call, having no effect on initialization.

Solution description:

  • Move retry/timeout overrides before NewAsyncCache so they take effect for the error-case test
  • Re-grab the reference item after Run() fires its immediate reload
  • Replace the fixed sleep loop with require.Eventually to tolerate slow CI
  • Verify cache stability between reloads with require.Equal

Verified with 50 consecutive runs (go test -count=50) — all passing.

Screenshots / screen recording:
N/A — backend test only.

Test setup:
No special setup required.

Test cases:

  • go test -v -run TestAsyncCache ./pkg/serverutils/asynccache/ passes consistently
  • go test -count=50 -run TestAsyncCache ./pkg/serverutils/asynccache/ — 50/50 passes

Browser conformance:
N/A — backend test only.

Additional info:
Jira: https://redhat.atlassian.net/browse/OCPBUGS-86511

Summary by CodeRabbit

  • Tests
  • Enhanced async cache test suite with improved assertion patterns and stricter validation logic. Adjusted timing parameters and added verification steps to better validate cache initialization, item persistence, and refresh behavior across reload cycles.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

The TestAsyncCache test in pkg/serverutils/asynccache/asyncccache_test.go was refactored to improve timing reliability and assertion clarity. The test now defines explicit initialization constants, replaces manual error checks with require assertions, increases the context timeout from 5s to 10s, and introduces a 100ms sleep after starting Run() to allow initial cache execution to settle. Cache behavior is validated by confirming the cached item's timestamp is non-zero, verifying item stability between reloads, and using require.Eventually to assert cache refresh occurs within a bounded window. The error-handling test case remains functionally unchanged but is repositioned within the refactored structure.

Suggested reviewers

  • rhamilto
  • logonoff
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test mutates global timing variables without restoration via t.Cleanup, violating test isolation and risking order-dependent failures in concurrent test runs. Add t.Cleanup to restore initializationRetryInterval and initializationTimeout to their original values after the test completes.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references the Jira issue and accurately describes the primary change: fixing a flaky backend test.
Description check ✅ Passed The description is comprehensive and well-structured, covering root cause analysis, solution details, testing methodology, and verification results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test file uses standard Go testing syntax (func TestAsyncCache), not Ginkgo. Custom check for Ginkgo test names is not applicable to this PR.
Microshift Test Compatibility ✅ Passed This PR modifies a standard Go unit test (TestAsyncCache using testing.T), not a Ginkgo e2e test. The custom check applies only to new Ginkgo e2e tests, making it inapplicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR modifies only a standard Go unit test (TestAsyncCache using testing.T), which falls outside SNO compatibility check scope.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test infrastructure in asyncccache_test.go with no deployment manifests, operator code, controllers, or scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes found. All klog/fmt calls are inside functions (not init/TestMain/BeforeSuite/top-level initializers). klog outputs to stderr by default.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies a standard Go unit test (not Ginkgo e2e), with no IPv4 assumptions, IP parsing, or external connectivity. Custom check applies only to Ginkgo e2e tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/serverutils/asynccache/asyncccache_test.go`:
- Around line 31-33: The test mutates package-level timing variables
initializationRetryInterval and initializationTimeout and does not restore them;
save their current values at the start of the test and register a t.Cleanup that
restores them (use the same variable names) so other tests aren’t affected;
update the test in asyncccache_test.go to capture prevRetry :=
initializationRetryInterval and prevTimeout := initializationTimeout and call
t.Cleanup(func() { initializationRetryInterval = prevRetry;
initializationTimeout = prevTimeout }) immediately after changing them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 15df2c83-3a18-48e9-ab0c-0b09cd5754cb

📥 Commits

Reviewing files that changed from the base of the PR and between f489afe and 977d899.

📒 Files selected for processing (1)
  • pkg/serverutils/asynccache/asyncccache_test.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
**/*_test.go

📄 CodeRabbit inference engine (Custom checks)

**/*_test.go: When new Ginkgo e2e tests are added, check for IPv4 assumptions: hardcoded IPv4 addresses (10.0.0.1, 192.168.1.1, 172.16.0.0/12), IPv4 localhost (127.0.0.1 without ::1), IPv4-only IP parsing, hardcoded IPv4 CIDRs in Service/Endpoint objects, hardcoded net.ParseIP/ParseCIDR values, IPv4-only pod/node IP assumptions, IPv4-only network policies, and URLs built without brackets for IPv6 (use net.JoinHostPort instead)
When new Ginkgo e2e tests are added, check for external connectivity requirements: connections to public internet hosts, pulling images from public registries without mirrors, downloading from external URLs, DNS resolution of public hostnames, and connections to external APIs/services outside the cluster. If requiring external connectivity that cannot be adapted, add [Skipped:Disconnected] to test name
When new Ginkgo e2e tests are added, check for usage of unavailable MicroShift APIs: Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM resources, MachineSet/Machine/MachineHealthCheck, ClusterAutoscaler, Console, Monitoring stack, ImageRegistry operator, Samples operator, OperatorHub/CatalogSource, CloudCredential, Storage operator, Network operator CRDs, and any OpenShift API except Route and SecurityContextConstraints. Also flag references to namespaces openshift-kube-apiserver, openshift-kube-controller-manager, openshift-kube-scheduler. If flagged and test is intentional, use [Skipped:MicroShift] label or [apigroup:...] tag or guard with exutil.IsMicroShiftCluster() check
OpenShift Tests Extension (OTE) binaries must output valid JSON to stdout at process level. Flag any non-JSON stdout writes in main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers. Common violations: klog writing to stdout (use klog.SetOutput(os.Stderr) or klog.LogToStderr(true)), fmt.Print/Println/Printf to stdout in main...

Files:

  • pkg/serverutils/asynccache/asyncccache_test.go
**/*.{yaml,yml,go}

📄 CodeRabbit inference engine (Custom checks)

When deployment manifests, operator code, or controllers are added/modified, check for scheduling constraints assuming standard HA topology (3+ CP nodes, dedicated workers). Flag: required pod anti-affinity with hostname topology key, pod topology spread with DoNotSchedule and hostname key, replica counts derived from node count, nodeSelector/affinity targeting control-plane nodes, broad tolerations that could schedule to resource-constrained arbiter nodes, assuming dedicated worker nodes exist, or PodDisruptionBudgets designed for 3+ nodes. Consider SNO (1 node), DualReplica (2 CP nodes), HighlyAvailableArbiter (2 CP + 1 arbiter), and External (HyperShift) topologies

Files:

  • pkg/serverutils/asynccache/asyncccache_test.go
**/*.go

📄 CodeRabbit inference engine (STYLEGUIDE.md)

**/*.go: All Go code should be formatted by gofmt
Import statement packages in Go should be separated into 3 groups: stdlib, external dependency, current project
Use typed errors with NewInvalidFlagError, FatalIfFailed patterns in Go error handling
Use middleware composition, method-based routing, and consistent JSON responses via serverutils.SendResponse for HTTP handlers in Go
Apply security headers, CSRF protection, and proper token validation in Go code
Use klog with appropriate levels (V(4) for debug, Error, Fatal) for logging in Go
Use YAML-based configuration with comprehensive flag validation in Go
Separate K8s client config from client creation, use both typed and dynamic clients in Go
Define clear interfaces in Go for testability and dependency injection

Use Go 1.25 or higher for backend development

Files:

  • pkg/serverutils/asynccache/asyncccache_test.go
pkg/**/*.go

📄 CodeRabbit inference engine (STYLEGUIDE.md)

Follow domain-based package structure in Go /pkg/ directory (auth, server, proxy, etc.)

Files:

  • pkg/serverutils/asynccache/asyncccache_test.go
🔇 Additional comments (1)
pkg/serverutils/asynccache/asyncccache_test.go (1)

38-39: LGTM!

Also applies to: 41-57, 61-67

Comment on lines +31 to +33
initializationRetryInterval = 5 * time.Millisecond
initializationTimeout = 10 * time.Millisecond

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore overridden package-level timing vars with t.Cleanup.

This test mutates global knobs and leaves them changed, which can make other tests order-dependent/flaky. Save prior values and restore them in cleanup.

Suggested fix
+	oldInitializationRetryInterval := initializationRetryInterval
+	oldInitializationTimeout := initializationTimeout
 	initializationRetryInterval = 5 * time.Millisecond
 	initializationTimeout = 10 * time.Millisecond
+	t.Cleanup(func() {
+		initializationRetryInterval = oldInitializationRetryInterval
+		initializationTimeout = oldInitializationTimeout
+	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
initializationRetryInterval = 5 * time.Millisecond
initializationTimeout = 10 * time.Millisecond
oldInitializationRetryInterval := initializationRetryInterval
oldInitializationTimeout := initializationTimeout
initializationRetryInterval = 5 * time.Millisecond
initializationTimeout = 10 * time.Millisecond
t.Cleanup(func() {
initializationRetryInterval = oldInitializationRetryInterval
initializationTimeout = oldInitializationTimeout
})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/serverutils/asynccache/asyncccache_test.go` around lines 31 - 33, The
test mutates package-level timing variables initializationRetryInterval and
initializationTimeout and does not restore them; save their current values at
the start of the test and register a t.Cleanup that restores them (use the same
variable names) so other tests aren’t affected; update the test in
asyncccache_test.go to capture prevRetry := initializationRetryInterval and
prevTimeout := initializationTimeout and call t.Cleanup(func() {
initializationRetryInterval = prevRetry; initializationTimeout = prevTimeout })
immediately after changing them.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

@jhadvig: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-playwright 977d899 link false /test e2e-playwright

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Leo6Leo
Copy link
Copy Markdown
Contributor

Leo6Leo commented May 27, 2026

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. component/backend Related to backend jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants