OCPBUGS-78940: Treat groups as existent if they were found but discovery is stale by jacobsee · Pull Request #30923 · openshift/origin

jacobsee · 2026-03-23T21:55:31Z

The kube-apiserver still declares itself ready even with stale discovery entries. The stale entries would be refreshed by a background worker, but there's a race window where clients can hit a newly-ready kube-apiserver and get stale discovery data. For the purpose of answering whether they exist, return true as long as they do, even if the discovery data is currently stale.

openshift-ci-robot · 2026-03-23T21:55:34Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-03-23T21:55:39Z

@jacobsee: This pull request references Jira Issue OCPBUGS-78940, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The kube-apiserver still declares itself ready even with stale discovery entries. The stale entries would be refreshed by a background worker, but there's a race window where clients can hit a newly-ready kube-apiserver and get stale discovery data. For the purpose of answering whether they exist, return true as long as they do, even if the discovery data is currently stale.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-23T21:55:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacobsee
Once this PR has been reviewed and has the lgtm label, please assign neisw for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-23T21:56:03Z

Walkthrough

Modified the DoesApiResourceExist function to handle discovery.StaleGroupVersionError differently when scanning discovery.ErrGroupDiscoveryFailed.Groups for a requested group. The function now treats such errors as a positive existence case rather than returning false with the error.

Changes

Cohort / File(s)	Summary
Error handling in API resource existence check `test/extended/util/framework.go`	Updated `DoesApiResourceExist` to treat `discovery.StaleGroupVersionError` as an existence indicator when encountered while iterating through failed discovery groups.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

test/extended/util/framework.go (1)

2236-2241: Use a typed variable for clearer error matching

At line 2236, prefer the idiomatic pattern var staleErr discovery.StaleGroupVersionError; errors.As(err, &staleErr) over the empty struct literal &discovery.StaleGroupVersionError{}. Also rename the loop variable to avoid shadowing the outer err:

Suggested refactor

-		for gv, err := range groupFailed.Groups {
+		for gv, groupErr := range groupFailed.Groups {
 			if gv.Group == group {
-				if errors.As(err, &discovery.StaleGroupVersionError{}) {
+				var staleErr discovery.StaleGroupVersionError
+				if errors.As(groupErr, &staleErr) {
 					// Group is registered but discovery is transiently stale.
 					// This can happen immediately after a restart and should resolve itself.
 					// For now, treat as "exists" since the APIService is known to the aggregator.
 					return true, nil
 				}
-				return false, err
+				return false, groupErr
 			}
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/util/framework.go` around lines 2236 - 2241, Replace the
current errors.As call that passes an inline &discovery.StaleGroupVersionError{}
with the idiomatic typed variable pattern: declare a variable (e.g. var staleErr
discovery.StaleGroupVersionError) and call errors.As(err, &staleErr); also
ensure you don't shadow the outer err by using a different variable name
(staleErr) rather than reusing err in the surrounding function where
discovery.StaleGroupVersionError is checked.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/util/framework.go`:
- Around line 2236-2241: Replace the current errors.As call that passes an
inline &discovery.StaleGroupVersionError{} with the idiomatic typed variable
pattern: declare a variable (e.g. var staleErr discovery.StaleGroupVersionError)
and call errors.As(err, &staleErr); also ensure you don't shadow the outer err
by using a different variable name (staleErr) rather than reusing err in the
surrounding function where discovery.StaleGroupVersionError is checked.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dd7ad772-7cb1-4e1c-9be0-fac3e36005f1

📥 Commits

Reviewing files that changed from the base of the PR and between 785f8b0 and 7ceb279.

📒 Files selected for processing (1)

test/extended/util/framework.go

openshift-ci-robot · 2026-03-23T22:29:24Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci · 2026-03-24T02:18:32Z

@jacobsee: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-ovn	`7ceb279`	link	true	`/test e2e-gcp-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jacobsee · 2026-03-24T05:57:42Z

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-vsphere-ovn-serial-runc 10

openshift-ci · 2026-03-24T05:57:56Z

@jacobsee: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-vsphere-ovn-serial-runc

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6010be20-2746-11f1-9564-e2f809730b1e-0

jacobsee · 2026-03-24T05:58:43Z

/test e2e-gcp-ovn

benluddy · 2026-03-24T17:20:17Z

clients can hit a newly-ready kube-apiserver and get stale discovery data

Is this what is happening in the recent samples? Did you see any errors in kube-apiserver logs showing problems fetching fresh discovery info for this GV? I don't know the answer -- just asking to make sure we don't miss any regression that might be causing an increased failure rate for that traffic to openshift-apiserver.

jacobsee · 2026-03-24T18:47:00Z

@benluddy to dump some thoughts on order of operations on startup, first we're seeing:

2026-03-19T06:04:36.028520265Z E0319 06:04:36.028493      17 handler_proxy.go:152] error resolving openshift-apiserver/api: service "api" not found

^that. But we're seeing it before this:

2026-03-19T06:04:36.034320279Z I0319 06:04:36.034279      17 shared_informer.go:377] "Caches are synced"

at which point resolution issues seem to end.

One minute later, we see

2026-03-19T06:05:36.026134783Z I0319 06:05:36.025953      17 handler.go:304] Adding GroupVersion packages.operators.coreos.com v1 to ResourceManager
2026-03-19T06:05:36.026941136Z I0319 06:05:36.026808      17 handler.go:304] Adding GroupVersion security.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.027043253Z I0319 06:05:36.027020      17 handler.go:304] Adding GroupVersion route.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.028413809Z I0319 06:05:36.027630      17 handler.go:304] Adding GroupVersion quota.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.028413809Z I0319 06:05:36.027657      17 handler.go:304] Adding GroupVersion image.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.028413809Z I0319 06:05:36.027665      17 handler.go:304] Adding GroupVersion authorization.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.028795008Z I0319 06:05:36.028743      17 handler.go:304] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
2026-03-19T06:05:36.028795008Z I0319 06:05:36.028766      17 handler.go:304] Adding GroupVersion build.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.031172380Z I0319 06:05:36.031093      17 handler.go:304] Adding GroupVersion oauth.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.031172380Z I0319 06:05:36.031122      17 handler.go:304] Adding GroupVersion project.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.031172380Z I0319 06:05:36.031131      17 handler.go:304] Adding GroupVersion apps.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.031323855Z I0319 06:05:36.031137      17 handler.go:304] Adding GroupVersion template.openshift.io v1 to ResourceManager
2026-03-19T06:05:36.031413448Z I0319 06:05:36.031367      17 handler.go:304] Adding GroupVersion user.openshift.io v1 to ResourceManager

and everything is golden. So I do think this is a "bad order of operations in the first minute - caught later by a resync" issue.

For what it's worth, it looks like this has just been fixed upstream, but this PR is to test the theory (and because I don't think we need to fail tests on this... might be a good idea to have this just to be a little more resilient anyway)

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 23, 2026

openshift-ci bot requested review from deads2k and sjenning March 23, 2026 21:55

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-78940: Treat groups as existent if they were found but discovery is stale#30923

OCPBUGS-78940: Treat groups as existent if they were found but discovery is stale#30923
jacobsee wants to merge 1 commit intoopenshift:mainfrom
jacobsee:treat-stale-discovery-as-exists

jacobsee commented Mar 23, 2026

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci bot commented Mar 24, 2026

Uh oh!

jacobsee commented Mar 24, 2026

Uh oh!

openshift-ci bot commented Mar 24, 2026

Uh oh!

jacobsee commented Mar 24, 2026

Uh oh!

benluddy commented Mar 24, 2026

Uh oh!

jacobsee commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacobsee commented Mar 23, 2026

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci bot commented Mar 24, 2026

Uh oh!

jacobsee commented Mar 24, 2026

Uh oh!

openshift-ci bot commented Mar 24, 2026

Uh oh!

jacobsee commented Mar 24, 2026

Uh oh!

benluddy commented Mar 24, 2026

Uh oh!

jacobsee commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Mar 23, 2026 •

edited

Loading