OCPBUGS-44693: Default-disable ResilientWatchCacheInitialization #61

benluddy · 2024-11-27T16:17:31Z

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in particular) are turned away with 429 responses while the watch cache is being repopulated and retry repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the watch cache is initialized or they time out. We fail tests when the total number of watch requests during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a symptom and not the cause of the underlying storage errors.

openshift-ci-robot · 2024-11-27T16:17:36Z

@benluddy: This pull request references Jira Issue OCPBUGS-44693, which is invalid:

release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
expected Jira Issue OCPBUGS-44693 to depend on a bug in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in particular) are turned away with 429 responses while the watch cache is being repopulated and retry repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the watch cache is initialized or they time out. We fail tests when the total number of watch requests during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a symptom and not the cause of the underlying storage errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

benluddy · 2024-11-27T16:17:46Z

/cc @p0lyn0mial

openshift-ci-robot · 2024-11-27T16:21:31Z

@benluddy: Jira Issue OCPBUGS-44693: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-samples-operator#587 is open

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-44693 has not been moved to the MODIFIED state.

In response to this:

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in particular) are turned away with 429 responses while the watch cache is being repopulated and retry repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the watch cache is initialized or they time out. We fail tests when the total number of watch requests during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a symptom and not the cause of the underlying storage errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

p0lyn0mial · 2024-11-28T12:15:27Z

this pr needs to be reverted, oas is crashing:

panic: feature gate "ResilientWatchCacheInitialization" with different spec already exists: [{false false BETA 0.0}]

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.Must(...)
	k8s.io/apimachinery@v0.31.1/pkg/util/runtime/runtime.go:258
k8s.io/kubernetes/pkg/features.init.0()
	k8s.io/kubernetes@v1.31.1/pkg/features/kube_features.go:993 +0x159

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 27, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 27, 2024

openshift-ci bot requested review from deads2k, p0lyn0mial and tkashem November 27, 2024 16:17

p0lyn0mial merged commit cd4433e into openshift:openshift-apiserver-4.18-kubernetes-1.31.1 Nov 27, 2024

p0lyn0mial mentioned this pull request Nov 27, 2024

OCPBUGS-44693: Bump k8s.io/apiserver fork. openshift/openshift-apiserver#462

Closed

p0lyn0mial mentioned this pull request Nov 28, 2024

OCPBUGS-44693: Revert "UPSTREAM: <carry>: Default-disable ResilientWatchCacheInitialization #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPBUGS-44693: Default-disable ResilientWatchCacheInitialization #61

OCPBUGS-44693: Default-disable ResilientWatchCacheInitialization #61

Uh oh!

benluddy commented Nov 27, 2024

Uh oh!

openshift-ci-robot commented Nov 27, 2024

Uh oh!

benluddy commented Nov 27, 2024

Uh oh!

openshift-ci-robot commented Nov 27, 2024

Uh oh!

p0lyn0mial commented Nov 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OCPBUGS-44693: Default-disable ResilientWatchCacheInitialization #61

OCPBUGS-44693: Default-disable ResilientWatchCacheInitialization #61

Uh oh!

Conversation

benluddy commented Nov 27, 2024

Uh oh!

openshift-ci-robot commented Nov 27, 2024

Uh oh!

benluddy commented Nov 27, 2024

Uh oh!

openshift-ci-robot commented Nov 27, 2024

Uh oh!

p0lyn0mial commented Nov 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants