New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate timeout for handler sync from informer sync and do not resync services during node tracker startup #4100
Separate timeout for handler sync from informer sync and do not resync services during node tracker startup #4100
Conversation
go-controller/pkg/types/const.go
Outdated
// It allows ~4 list() retries with the default reflector exponential backoff config | ||
InformerSyncTimeout = 20 * time.Second | ||
|
||
// GRMACBindingAgeThreshold is the lifetime in seconds of each MAC binding | ||
// entry for the gateway routers. After this time, the entry is removed and | ||
// may be refreshed with a new ARP request. | ||
GRMACBindingAgeThreshold = "300" | ||
|
||
// HandlerSyncTimeout is used when waiting for initial object handler sync. | ||
// (i.e. all the ADD events should be processed for the existing objects by the event handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long did it take on the scale cluster? Could you provide some more data around the size of the cluster? (nodes, services)? What if 2mins is also not enough or do we think 2mins has to be enough.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I'll add that to the description.
7e6ee25
to
e096eb0
Compare
@ricky-rav can you also update handlers to informers here please (I am sure in the first case I intended to sync informer, not sure which sync we needed in the second case) |
was it 2mins or 20sec? based on your presentation seems 20sec? |
20 seconds, my bad :) |
40207bf
to
10af9a1
Compare
I have two commits now:
I have less evidence now to increase the handler timeout sync to 2 minutes, since I've optimized the node tracker startup code in the second commit and that part should now take ~0.03s in a 250-node cluster. Do we want to still do it? The handlers for which we call I think it still makes sense to have Thanks! |
d03233d
to
dc1f345
Compare
aff9cd3
to
1941751
Compare
|
1941751
to
42535c5
Compare
Should be good to go now |
May be we should make |
42535c5
to
95005b5
Compare
Yeah, that's not a bad idea to make it explicit. I've set all three timeouts to 20 seconds, which seems to have been a good upper bound so far. |
95005b5
to
03849ba
Compare
@@ -80,7 +81,7 @@ func (c *controller[T]) Start(threadiness int) error { | |||
return fmt.Errorf("failed to add event handler: %w", err) | |||
} | |||
|
|||
if !util.WaitForNamedCacheSyncWithTimeout(c.name, c.stopChan, c.eventHandler.HasSynced) { | |||
if !util.WaitForHandlerSyncWithTimeout(c.name, c.stopChan, 20*time.Second, c.eventHandler.HasSynced) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I meant to change this to informer sync, so use c.config.Informer.HasSynced
instead of c.eventHandler.HasSynced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, I read your other message too quickly.
-
in services_controller.go (https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L198) I definitely need
WaitForHandlerSyncWithTimeout
, because I want all service keys to be added to the queue at that step, since I removed that part from the node tracker startup. -
here in controller.go the existing code passes the handler
HasSync
, so I simply figured I'd keep the existing logic and useWaitForHandlerSyncWithTimeout
. If that's not necessary, I can definitely change that.
03849ba
to
66f32e8
Compare
if !util.WaitForNamedCacheSyncWithTimeout(c.name, c.stopChan, c.eventHandler.HasSynced) { | ||
return fmt.Errorf("timed out waiting for egressservice caches to sync") | ||
if !util.WaitForInformerCacheSyncWithTimeout(c.name, c.stopChan, c.config.Informer.HasSynced) { | ||
return fmt.Errorf("timed out waiting for informer cache to sync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, nice catch!
lgtm for the sync wait logic, thanks @ricky-rav! |
return fmt.Errorf("timed out waiting for egressservice caches to sync") | ||
} | ||
|
||
if !util.WaitForNamedCacheSyncWithTimeout("egressservices_services", c.stopCh, c.servicesSynced) { | ||
if !util.WaitForInformerCacheSyncWithTimeout("egressservices_services", c.stopCh, c.servicesSynced) { | ||
return fmt.Errorf("timed out waiting for caches to sync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we make the error more explicit here so that when it fails we know which one? => this same nit applies to all the spots across the code.. "services, nad, eip..." something like:
fmt.Errorf("timed out waiting for %q caches to sync", resourceName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ops, there were a few of them :-D
@@ -168,10 +172,17 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr | |||
return err | |||
} | |||
// We need node cache to be synced first, as we rely on it to properly reprogram initial per node load balancers | |||
// No need to sync services here, since workers aren't up yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: we were not syncing services here anyways right? you have just added this comment to explicitly call that out, that's all right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were doing it before, but now that I removed it, the comment alone doesn't tell you the whole story. Let me remove it from here and add it to RequestFullSync
, where it makes more sense.
if c.nodeTrackerInitialSyncDone { | ||
services, err := c.serviceLister.List(labels.Everything()) | ||
if err != nil { | ||
klog.Errorf("Cached lister failed!? %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be .... return the error and fail hard/retry? or do we want to keep it as is..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I simply kept the existing logic. If I look at how node tracker sync works, it has no retry mechanism at all o_O
It might be something to explore for our tech debt, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sounds good, deff don't want you to go out of scope on this PR, perhaps open a JIRA tracker for this please! thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is: https://issues.redhat.com/browse/SDN-4483
klog.Errorf("Cached lister failed!? %v", err) | ||
return | ||
} | ||
// Resync all services unless we're processing the initial node tracker sync (in which case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so.. here the ordering is:
initialize: svccontroller.NewController->NewNodeTRacker->RequestFullSync->we check if initialNTSync is done which won't be done since that part of code is not touched yet right?
Then we have the:
Run code => after the initialize is done and we have the DNC, we do DNC.Run-> startSvcController->Run-> do the initial node tracker sync
I am unsure how and when we can hit RequestFullSync.NTInitialSync=true condition? It seems like the initial sync is always going to happen afterwards right? Perhaps I am missing something. I will huddle with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm what do you mean by DNC?
The execution at startup is:
func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGroups, useTemplates bool) error
- sets up the node tracker handler
- waits for a full sync of the node track handler
- each add event for node tracker executes
updateNode
, at the end of which we callupdateNodeInfo
, which eventually callsnt.resyncFn(nt.getZoneNodes())
, corresponding toRequestFullSync
in services_controller.go- in
RequestFullSync
, when we're here,c.nodeTrackerInitialSyncDone
is false, soc.onServiceAdd
is not called
- in
- each add event for node tracker executes
- c.nodeTrackerInitialSyncDone= true
- sets up the service and endpoint slice handlers
- waits for a full sync of the service and endpoint handlers
- the code here just adds a key for each existing service and, for each endpoint slice, the corresponding service key
- runs initTopLevelCache()
- starts the workers to consume service keys from the queue
All subsequent node tracker add/update/delete events will call RequestFullSync
and will trigger a full service resync, since c.nodeTrackerInitialSyncDone=true
now.
Now that I've outlined the main steps, I see that I can actually set c.nodeTrackerInitialSyncDone=true
later on right before starting the workers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh yeah thanks ric, let's talk tomorrow easier that way. will ping you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the line where I set the flag to true further below and renamed it more simply startupDone
, since it actually refers to whether the initial sync for the three handlers we have (node tracker, service, endpointslice) has been completed. I didn't move c.startupDone=true
after we start the workers because I'm a bit paranoid about the fact that we might miss node tracker events in the teeny tiny time frame of the for loop itself:
klog.Info("Starting workers")
for i := 0; i < workers; i++ {
go wait.Until(c.worker, c.workerLoopPeriod, stopCh)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't move c.startupDone=true after we start the workers because I'm a bit paranoid about the fact that we might miss node tracker events in the teeny tiny time frame of the for loop itself
ack that makes sense, but then there is also the reverse scene where for some nodes we get the add/update whatever between this tiny window where the done = true and workers are NOT Ready yet due to the loop not having finished but I think that should be negligible.
f4a8090
to
8dc6cd7
Compare
the two flakes in CI are known issues I liked above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both commits lgtm except for the nits I outlined
return fmt.Errorf("error syncing cache") | ||
// We need the node tracker to be synced first, as we rely on it to properly reprogram initial per node load balancers | ||
klog.Info("Waiting for node tracker handler to sync") | ||
if !util.WaitForHandlerSyncWithTimeout(nodeControllerName, stopCh, 20*time.Second, nodeHandler.HasSynced) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't you want to use HandlerSyncTimeout here instead of doing 20*time.Second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops I defined it and I wasn't using it (:
if !util.WaitForNamedCacheSyncWithTimeout(controllerName, stopCh, svcHandler.HasSynced, endpointHandler.HasSynced) { | ||
return fmt.Errorf("error syncing cache") | ||
klog.Info("Waiting for service and endpoint handlers to sync") | ||
if !util.WaitForHandlerSyncWithTimeout(controllerName, stopCh, 20*time.Second, svcHandler.HasSynced, endpointHandler.HasSynced) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't you want to use HandlerSyncTimeout here instead of doing 20*time.Second?
The informer sync consists of completing a LIST operation for the informer cache for a given object type, while the handler sync consists of processing all the initial ADD operations we get at startup for a given event handler. Let's have WaitForHandlerSyncWithTimeout for the handler sync and WaitForInformerCacheSyncWithTimeout for the informer sync. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again. This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys. This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10^(-6) seconds, when we have 11500 services and 250 nodes, this takes in total 8 * 10^(-6) * 11500 * 250 = 23 seconds. Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services at this point anyway and workers are started so that we can finally process each service in the queue. Let's optimize this and use a simple flag to prevent the node tracker from adding all service keys at startup. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
8dc6cd7
to
f7b7583
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest changes LGTM.
d/s scale test has been done openshift/ovn-kubernetes#2014 no disruptions or anything significant found yet.
once the CI runs here on the latest will merge this.
thanks @ricky-rav for the detailed commit message which made it easier for me to understand and your awesome troubleshooting here
Fix included in accepted release 4.16.0-0.nightly-2024-02-17-013806 |
[edited]