Skip to content

NO-JIRA: Add e2e test for authorization cache race condition#31247

Open
sanchezl wants to merge 1 commit into
openshift:mainfrom
sanchezl:auth-cache-race-e2e
Open

NO-JIRA: Add e2e test for authorization cache race condition#31247
sanchezl wants to merge 1 commit into
openshift:mainfrom
sanchezl:auth-cache-race-e2e

Conversation

@sanchezl
Copy link
Copy Markdown
Contributor

@sanchezl sanchezl commented Jun 1, 2026

NO-JIRA:

Summary

  • Adds TestProjectAuthCacheRaceCondition e2e test that exercises concurrent Projects().List() calls under RoleBinding churn to detect crashes, API errors, and latency regressions caused by unsafe map access in AuthorizationCache

Note: This test will not pass without openshift/openshift-apiserver#642

Test plan

  • Run against an unfixed openshift-apiserver to confirm the test catches pod restarts or 500 errors
  • Run against openshift-apiserver with the copy-on-write fix to confirm it passes

Summary by CodeRabbit

  • Tests
    • Added test suite validating project authorization cache performance and stability under concurrent access patterns.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a9e7ab1-422e-468d-b363-40740c6f4272

📥 Commits

Reviewing files that changed from the base of the PR and between 1ccd213 and 0785aeb.

📒 Files selected for processing (1)
  • test/extended/project/project.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/project/project.go

Walkthrough

Adds a Ginkgo end-to-end concurrency test that runs many concurrent Projects().List() readers while RoleBindings are rapidly created/deleted, and includes helpers to detect apiserver restarts and emit crash-log evidence.

Changes

Auth Cache Race Condition Test

Layer / File(s) Summary
Imports and test configuration
test/extended/project/project.go
Adds bufio, strings, sync, and sync/atomic to support log scanning and atomic concurrency coordination; defines the test configuration and types used by the race harness.
Apiserver restart and crash-evidence helpers
test/extended/project/project.go
Adds getPodRestartCounts to aggregate container restart totals by pod name and dumpApiserverCrashEvidence to scan recent openshift-apiserver logs for crash signatures when restarts are detected.
Concurrent readers and RoleBinding churn harness
test/extended/project/project.go
Implements TestProjectAuthCacheRaceCondition: creates many namespaces and users, starts concurrent goroutines repeatedly calling Projects().List() and watchers (tracking errors, max latency, and regressions), runs repeated RoleBinding create/delete rounds to churn auth cache, stops readers, and asserts no list errors, bounded latency regressions, no project-count regressions during add-only phase, no apiserver pod restarts during churn, and final project visibility correctness.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant ProjectsAPI
  participant Authorization
  participant OpenShiftAPIServer

  Test->>ProjectsAPI: Create namespaces and grant user view access
  Test->>OpenShiftAPIServer: Record baseline pod restart counts
  par Concurrent Readers
    ProjectsAPI->>OpenShiftAPIServer: Repeated Projects().List() queries
  and Concurrent Writer
    Authorization->>OpenShiftAPIServer: Create/Delete RoleBindings churn
  end
  Test->>OpenShiftAPIServer: Verify pod restart counts unchanged
  Test->>ProjectsAPI: Verify final Projects().List() contains expected namespaces
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error The dumpApiserverCrashEvidence function logs raw pod log lines containing crash signatures without sanitization, risking exposure of passwords, tokens, or other sensitive data in error messages. Filter or redact sensitive patterns (credentials, tokens, keys) from pod logs before logging, or sanitize error output to exclude data that could contain sensitive information.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test uses context.Background() without timeout (indefinite waits), violating timeout requirement. Also tests 7 distinct behaviors in single It block, violating single responsibility principle. Use context.WithTimeout() for test execution. Split test into focused It blocks testing individual behaviors, or use harness pattern with separate sub-tests.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding an end-to-end test for authorization cache race condition detection.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Both test titles use only static descriptive strings with no dynamic values, timestamps, UUIDs, pod names, or identifiers. Dynamic values correctly appear only in test body statements.
Microshift Test Compatibility ✅ Passed Test is protected with [apigroup:project.openshift.io] tag; MicroShift CI automatically skips tests with API groups not served by MicroShift.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test creates namespaces and RoleBindings, monitors openshift-apiserver pod restarts, and runs concurrent API calls. No multi-node assumptions; compatible with Single Node OpenShift.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only a test file (test/extended/project/project.go) with test harness code, not deployment manifests, operator code, or controllers with scheduling constraints.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes detected. All logging uses framework.Logf (safe), code is in g.It/g.Describe blocks or helper functions, only top-level var is simple struct initialization.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The new test contains no IPv4 assumptions or external connectivity requirements. All API calls are cluster-internal; no hardcoded IPs, CIDRs, or external service dependencies.
No-Weak-Crypto ✅ Passed PR adds test code with no weak crypto algorithms, custom crypto, or unsafe token comparisons detected in the new test functions.
Container-Privileges ✅ Passed PR adds only Go test code with no Kubernetes manifests, container configs, or privilege escalation settings (privileged, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation).
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning June 1, 2026 17:02
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanchezl
Once this PR has been reviewed and has the lgtm label, please assign jogeo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sanchezl sanchezl changed the title Add e2e test for authorization cache race condition NO-JIRA: Add e2e test for authorization cache race condition Jun 1, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sanchezl: This pull request explicitly references no jira issue.

Details

In response to this:

NO-JIRA:

Summary

  • Adds TestProjectAuthCacheRaceCondition e2e test that exercises concurrent Projects().List() calls under RoleBinding churn to detect crashes, API errors, and latency regressions caused by unsafe map access in AuthorizationCache

Test plan

  • Run against an unfixed openshift-apiserver to confirm the test catches pod restarts or 500 errors
  • Run against openshift-apiserver with the copy-on-write fix to confirm it passes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/project/project.go (1)

709-716: 💤 Low value

Consider strengthening the pod disappearance check.

When a baseline pod no longer exists, the code logs and continues. However, if a pod crashes and is replaced during the 30-second window, this would manifest as the old pod disappearing and a new pod appearing (with a fresh restart count of 0). The current logic would miss this scenario.

For more robust crash detection, consider also checking if currentRestarts contains pods that weren't in baselineRestarts, which could indicate a replacement due to crash.

🛡️ Suggested enhancement
 g.By("checking openshift-apiserver pods did not restart")
 currentRestarts := getPodRestartCounts(ctx, oc, "openshift-apiserver")
 for podName, baseline := range baselineRestarts {
     current, exists := currentRestarts[podName]
     if !exists {
-        framework.Logf("pod %s no longer exists (may have been rescheduled)", podName)
-        continue
+        // Pod disappeared - could indicate crash with pod replacement
+        g.Fail(fmt.Sprintf("pod %s no longer exists (possible crash and replacement)", podName))
     }
     o.Expect(current).To(o.Equal(baseline), fmt.Sprintf("pod %s restarted during test (before=%d, after=%d)", podName, baseline, current))
 }
+// Check for new pods that weren't in baseline (indicates replacement)
+for podName := range currentRestarts {
+    if _, existed := baselineRestarts[podName]; !existed {
+        framework.Logf("warning: new pod %s appeared during test (possible replacement after crash)", podName)
+    }
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/project/project.go` around lines 709 - 716, The existing loop
over baselineRestarts misses cases where a pod was replaced (old pod gone, new
pod present with a fresh restart count); after the current loop, add a
complementary check that iterates currentRestarts and for each podName not
present in baselineRestarts flag/fail the test (use the same o.Expect or
framework.Logf pattern) with a clear message like "pod %s appeared during test
(possible replacement/crash)"; reference the maps baselineRestarts and
currentRestarts and place this new iteration after the existing for podName,
baseline := range baselineRestarts block so replacements are detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/project/project.go`:
- Around line 709-716: The existing loop over baselineRestarts misses cases
where a pod was replaced (old pod gone, new pod present with a fresh restart
count); after the current loop, add a complementary check that iterates
currentRestarts and for each podName not present in baselineRestarts flag/fail
the test (use the same o.Expect or framework.Logf pattern) with a clear message
like "pod %s appeared during test (possible replacement/crash)"; reference the
maps baselineRestarts and currentRestarts and place this new iteration after the
existing for podName, baseline := range baselineRestarts block so replacements
are detected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b19ec8e3-a5d2-4190-9ef0-71f686996029

📥 Commits

Reviewing files that changed from the base of the PR and between 76ed5a8 and 56c8f1f.

📒 Files selected for processing (1)
  • test/extended/project/project.go

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented Jun 1, 2026

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented Jun 1, 2026

/retest required

@sanchezl sanchezl force-pushed the auth-cache-race-e2e branch from 56c8f1f to e8261c7 Compare June 3, 2026 17:51
@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented Jun 3, 2026

/test all

@sanchezl sanchezl force-pushed the auth-cache-race-e2e branch from e8261c7 to 1ccd213 Compare June 3, 2026 17:59
@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented Jun 3, 2026

/test all

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Stress test concurrent project List and Watch requests during
sustained RBAC churn to surface races in the openshift-apiserver
authorization cache.
@sanchezl sanchezl force-pushed the auth-cache-race-e2e branch from 1ccd213 to 0785aeb Compare June 4, 2026 00:45
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

@sanchezl: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 0785aeb link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-vsphere-ovn 0785aeb link true /test e2e-vsphere-ovn
ci/prow/e2e-metal-ipi-ovn-ipv6 0785aeb link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-fips 0785aeb link true /test e2e-aws-ovn-fips
ci/prow/e2e-gcp-ovn 0785aeb link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Jun 4, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 0785aeb

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[sig-auth][Feature:ProjectAPI] TestProjectAuthCacheRaceCondition should not crash or block when listing projects under concurrent cache churn [apigroup:project.openshift.io][apigroup:authorization.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-gcp-ovn High - "[sig-auth][Feature:ProjectAPI] TestProjectAuthCacheRaceCondition should not crash or block when listing projects under concurrent cache churn [apigroup:project.openshift.io][apigroup:authorization.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 High - "[sig-auth][Feature:ProjectAPI] TestProjectAuthCacheRaceCondition should not crash or block when listing projects under concurrent cache churn [apigroup:project.openshift.io][apigroup:authorization.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn High - "[sig-auth][Feature:ProjectAPI] TestProjectAuthCacheRaceCondition should not crash or block when listing projects under concurrent cache churn [apigroup:project.openshift.io][apigroup:authorization.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi High - "[sig-auth][Feature:ProjectAPI] TestProjectAuthCacheRaceCondition should not crash or block when listing projects under concurrent cache churn [apigroup:project.openshift.io][apigroup:authorization.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/parallel]" is a new test that failed 1 time(s) against the current commit

New tests seen in this PR at sha: 0785aeb

  • "[sig-auth][Feature:ProjectAPI] TestProjectAuthCacheRaceCondition should not crash or block when listing projects under concurrent cache churn [apigroup:project.openshift.io][apigroup:authorization.openshift.io][apigroup:user.openshift.io] [Suite:openshift/conformance/parallel]" [Total: 5, Pass: 0, Fail: 5, Flake: 0]

@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented Jun 5, 2026

As expected, TestProjectAuthCacheRaceCondition is failing across all e2e jobs — it successfully detects the concurrent map panic in the openshift-apiserver authorization cache (OCPBUGS-84534).

Once openshift/openshift-apiserver#642 is merged and lands in the payload, the test should pass and this PR can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants