Skip to content

Conversation

benluddy
Copy link

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in particular) are turned away with 429 responses while the watch cache is being repopulated and retry repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the watch cache is initialized or they time out. We fail tests when the total number of watch requests during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a symptom and not the cause of the underlying storage errors.

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization
during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in
particular) are turned away with 429 responses while the watch cache is being repopulated and retry
repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the
watch cache is initialized or they time out. We fail tests when the total number of watch requests
during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior
causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a
symptom and not the cause of the underlying storage errors.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 27, 2024
@openshift-ci-robot
Copy link

@benluddy: This pull request references Jira Issue OCPBUGS-44693, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-44693 to depend on a bug in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in particular) are turned away with 429 responses while the watch cache is being repopulated and retry repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the watch cache is initialized or they time out. We fail tests when the total number of watch requests during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a symptom and not the cause of the underlying storage errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 27, 2024
@benluddy
Copy link
Author

/cc @p0lyn0mial

@p0lyn0mial p0lyn0mial merged commit cd4433e into openshift:openshift-apiserver-4.18-kubernetes-1.31.1 Nov 27, 2024
@openshift-ci-robot
Copy link

@benluddy: Jira Issue OCPBUGS-44693: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-44693 has not been moved to the MODIFIED state.

In response to this:

There's a separate pre-existing issue causing storage layer errors and watch cache re-initialization during cluster bootstrap. With ResilientWatchCacheInitialization enabled, clients (reflectors in particular) are turned away with 429 responses while the watch cache is being repopulated and retry repeatedly. Without this feature, requests hang (consuming priority and fairness "seats") until the watch cache is initialized or they time out. We fail tests when the total number of watch requests during a job exceeds a threshold based on recent historical totals. The systemic 429/retry behavior causes this threshold to be breached. We are temporarily disabling it to reduce noise as it's a symptom and not the cause of the underlying storage errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@p0lyn0mial
Copy link

this pr needs to be reverted, oas is crashing:

panic: feature gate "ResilientWatchCacheInitialization" with different spec already exists: [{false false BETA 0.0}]

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.Must(...)
	k8s.io/apimachinery@v0.31.1/pkg/util/runtime/runtime.go:258
k8s.io/kubernetes/pkg/features.init.0()
	k8s.io/kubernetes@v1.31.1/pkg/features/kube_features.go:993 +0x159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants