New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-29230: [release-4.15] separate the handler sync from the informer sync & remove the full service resync during node tracker startup #2060
Conversation
The informer sync consists of completing a LIST operation for the informer cache for a given object type, while the handler sync consists of processing all the initial ADD operations we get at startup for a given event handler. Let's have WaitForHandlerSyncWithTimeout for the handler sync and WaitForInformerCacheSyncWithTimeout for the informer sync. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com> (cherry picked from commit 369895c)
When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again. This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys. This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10^(-6) seconds, when we have 11500 services and 250 nodes, this takes in total 8 * 10^(-6) * 11500 * 250 = 23 seconds. Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services at this point anyway and workers are started so that we can finally process each service in the queue. Let's optimize this and use a simple flag to prevent the node tracker from adding all service keys at startup. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com> (cherry picked from commit f7b7583)
@ricky-rav: This pull request references Jira Issue OCPBUGS-29230, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test unit |
@ricky-rav: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/jira refresh |
@ricky-rav: This pull request references Jira Issue OCPBUGS-29230, which is valid. The bug has been moved to the POST state. 6 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm commit1 clean pick of 369895c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm commit 2 clean pick of f7b7583
/label backport-risk-assessed |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ricky-rav, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ricky-rav, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/label cherry-pick-approved |
519efdb
into
openshift:release-4.15
@ricky-rav: Jira Issue OCPBUGS-29230: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-29230 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-ovn-kubernetes-base-container-v4.15.0-202402082307.p0.g519efdb.assembly.stream.el9 for distgit ovn-kubernetes-base. |
Clean cherry pick of:
Closes #OCPBUGS-29230