Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate timeout for handler sync from informer sync and do not resync services during node tracker startup #4100

Merged

Conversation

ricky-rav
Copy link
Contributor

@ricky-rav ricky-rav commented Jan 26, 2024

[edited]

@coveralls
Copy link

coveralls commented Jan 26, 2024

Coverage Status

coverage: 51.973% (-0.03%) from 52.002%
when pulling 8dc6cd7 on ricky-rav:OCPBUGS-20336_increaseTimeout_upstream
into cbff639 on ovn-org:master.

// It allows ~4 list() retries with the default reflector exponential backoff config
InformerSyncTimeout = 20 * time.Second

// GRMACBindingAgeThreshold is the lifetime in seconds of each MAC binding
// entry for the gateway routers. After this time, the entry is removed and
// may be refreshed with a new ARP request.
GRMACBindingAgeThreshold = "300"

// HandlerSyncTimeout is used when waiting for initial object handler sync.
// (i.e. all the ADD events should be processed for the existing objects by the event handler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long did it take on the scale cluster? Could you provide some more data around the size of the cluster? (nodes, services)? What if 2mins is also not enough or do we think 2mins has to be enough.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll add that to the description.

go-controller/pkg/util/sync.go Show resolved Hide resolved
@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 2 times, most recently from 7e6ee25 to e096eb0 Compare January 26, 2024 17:27
@npinaeva
Copy link
Member

@ricky-rav can you also update handlers to informers here please (I am sure in the first case I intended to sync informer, not sure which sync we needed in the second case)
https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/controller/controller.go#L83
https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L198

@tssurya
Copy link
Member

tssurya commented Jan 29, 2024

In short, we saw that with a 250-node cluster it takes node tracker an average of 0.1 seconds to process one node, so in the upgrade runs that we looked at, the time out of 2 minutes occurred when ~200 nodes had been processed.

was it 2mins or 20sec? based on your presentation seems 20sec?

@ricky-rav
Copy link
Contributor Author

In short, we saw that with a 250-node cluster it takes node tracker an average of 0.1 seconds to process one node, so in the upgrade runs that we looked at, the time out of 2 minutes occurred when ~200 nodes had been processed.

was it 2mins or 20sec? based on your presentation seems 20sec?

20 seconds, my bad :)

@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 3 times, most recently from 40207bf to 10af9a1 Compare January 30, 2024 14:35
@ricky-rav
Copy link
Contributor Author

ricky-rav commented Jan 30, 2024

I have two commits now:

I have less evidence now to increase the handler timeout sync to 2 minutes, since I've optimized the node tracker startup code in the second commit and that part should now take ~0.03s in a 250-node cluster. Do we want to still do it? The handlers for which we call WaitForHandlerSyncWithTimeout are all level-driven controllers, so WaitForHandlerSyncWithTimeout will return when all object keys have been added to the queue, without waiting for the workers to process the keys. That shouldn't need a higher timeout than 20s.

I think it still makes sense to have WaitForHandlerSyncWithTimeout and WaitForInformerSyncWithTimeout, because they wait for two different things, but the reasons for a higher timeout are thinner now. @npinaeva @tssurya @trozet what do you all think?

Thanks!

@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 2 times, most recently from d03233d to dc1f345 Compare January 30, 2024 15:31
@ricky-rav ricky-rav changed the title Separate timeout for handler sync from informer sync Separate timeout for handler sync from informer sync and do not resync services during node tracker startup Jan 30, 2024
@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 2 times, most recently from aff9cd3 to 1941751 Compare January 30, 2024 19:07
@ricky-rav
Copy link
Contributor Author

ricky-rav commented Jan 30, 2024

/hold
There's a nice race condition in the variable I added...

@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 1941751 to 42535c5 Compare January 31, 2024 09:57
@ricky-rav
Copy link
Contributor Author

Should be good to go now

@npinaeva
Copy link
Member

May be we should make WaitForHandlerSyncWithTimeout always accept the timeout explicitly from the caller, considering there is not such thing as "generic handler timeout" since it depends on the handler implementation? Then you could tweak it per-handler as needed, and also make sure you do some scale testing before setting it.

@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 42535c5 to 95005b5 Compare January 31, 2024 18:35
@ricky-rav
Copy link
Contributor Author

May be we should make WaitForHandlerSyncWithTimeout always accept the timeout explicitly from the caller, considering there is not such thing as "generic handler timeout" since it depends on the handler implementation? Then you could tweak it per-handler as needed, and also make sure you do some scale testing before setting it.

Yeah, that's not a bad idea to make it explicit. I've set all three timeouts to 20 seconds, which seems to have been a good upper bound so far.

@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 95005b5 to 03849ba Compare February 1, 2024 13:57
@@ -80,7 +81,7 @@ func (c *controller[T]) Start(threadiness int) error {
return fmt.Errorf("failed to add event handler: %w", err)
}

if !util.WaitForNamedCacheSyncWithTimeout(c.name, c.stopChan, c.eventHandler.HasSynced) {
if !util.WaitForHandlerSyncWithTimeout(c.name, c.stopChan, 20*time.Second, c.eventHandler.HasSynced) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I meant to change this to informer sync, so use c.config.Informer.HasSynced instead of c.eventHandler.HasSynced

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I read your other message too quickly.

@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 03849ba to 66f32e8 Compare February 1, 2024 17:27
if !util.WaitForNamedCacheSyncWithTimeout(c.name, c.stopChan, c.eventHandler.HasSynced) {
return fmt.Errorf("timed out waiting for egressservice caches to sync")
if !util.WaitForInformerCacheSyncWithTimeout(c.name, c.stopChan, c.config.Informer.HasSynced) {
return fmt.Errorf("timed out waiting for informer cache to sync")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, nice catch!

@npinaeva
Copy link
Member

npinaeva commented Feb 2, 2024

lgtm for the sync wait logic, thanks @ricky-rav!

return fmt.Errorf("timed out waiting for egressservice caches to sync")
}

if !util.WaitForNamedCacheSyncWithTimeout("egressservices_services", c.stopCh, c.servicesSynced) {
if !util.WaitForInformerCacheSyncWithTimeout("egressservices_services", c.stopCh, c.servicesSynced) {
return fmt.Errorf("timed out waiting for caches to sync")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we make the error more explicit here so that when it fails we know which one? => this same nit applies to all the spots across the code.. "services, nad, eip..." something like:
fmt.Errorf("timed out waiting for %q caches to sync", resourceName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ops, there were a few of them :-D

@@ -168,10 +172,17 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
return err
}
// We need node cache to be synced first, as we rely on it to properly reprogram initial per node load balancers
// No need to sync services here, since workers aren't up yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: we were not syncing services here anyways right? you have just added this comment to explicitly call that out, that's all right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were doing it before, but now that I removed it, the comment alone doesn't tell you the whole story. Let me remove it from here and add it to RequestFullSync, where it makes more sense.

if c.nodeTrackerInitialSyncDone {
services, err := c.serviceLister.List(labels.Everything())
if err != nil {
klog.Errorf("Cached lister failed!? %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be .... return the error and fail hard/retry? or do we want to keep it as is..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I simply kept the existing logic. If I look at how node tracker sync works, it has no retry mechanism at all o_O
It might be something to explore for our tech debt, maybe?

Copy link
Member

@tssurya tssurya Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sounds good, deff don't want you to go out of scope on this PR, perhaps open a JIRA tracker for this please! thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.Errorf("Cached lister failed!? %v", err)
return
}
// Resync all services unless we're processing the initial node tracker sync (in which case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so.. here the ordering is:
initialize: svccontroller.NewController->NewNodeTRacker->RequestFullSync->we check if initialNTSync is done which won't be done since that part of code is not touched yet right?

Then we have the:
Run code => after the initialize is done and we have the DNC, we do DNC.Run-> startSvcController->Run-> do the initial node tracker sync

I am unsure how and when we can hit RequestFullSync.NTInitialSync=true condition? It seems like the initial sync is always going to happen afterwards right? Perhaps I am missing something. I will huddle with you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm what do you mean by DNC?

The execution at startup is:

  • func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGroups, useTemplates bool) error
    • sets up the node tracker handler
    • waits for a full sync of the node track handler
      • each add event for node tracker executes updateNode, at the end of which we call updateNodeInfo, which eventually calls nt.resyncFn(nt.getZoneNodes()), corresponding to RequestFullSync in services_controller.go
        • in RequestFullSync, when we're here, c.nodeTrackerInitialSyncDone is false, so c.onServiceAdd is not called
    • c.nodeTrackerInitialSyncDone= true
    • sets up the service and endpoint slice handlers
    • waits for a full sync of the service and endpoint handlers
      • the code here just adds a key for each existing service and, for each endpoint slice, the corresponding service key
    • runs initTopLevelCache()
    • starts the workers to consume service keys from the queue

All subsequent node tracker add/update/delete events will call RequestFullSync and will trigger a full service resync, since c.nodeTrackerInitialSyncDone=true now.

Now that I've outlined the main steps, I see that I can actually set c.nodeTrackerInitialSyncDone=true later on right before starting the workers...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh yeah thanks ric, let's talk tomorrow easier that way. will ping you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the line where I set the flag to true further below and renamed it more simply startupDone, since it actually refers to whether the initial sync for the three handlers we have (node tracker, service, endpointslice) has been completed. I didn't move c.startupDone=true after we start the workers because I'm a bit paranoid about the fact that we might miss node tracker events in the teeny tiny time frame of the for loop itself:

	klog.Info("Starting workers")
	for i := 0; i < workers; i++ {
		go wait.Until(c.worker, c.workerLoopPeriod, stopCh)
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't move c.startupDone=true after we start the workers because I'm a bit paranoid about the fact that we might miss node tracker events in the teeny tiny time frame of the for loop itself

ack that makes sense, but then there is also the reverse scene where for some nodes we get the add/update whatever between this tiny window where the done = true and workers are NOT Ready yet due to the loop not having finished but I think that should be negligible.

go-controller/pkg/util/sync.go Show resolved Hide resolved
@tssurya
Copy link
Member

tssurya commented Feb 7, 2024

the two flakes in CI are known issues I liked above

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both commits lgtm except for the nits I outlined

return fmt.Errorf("error syncing cache")
// We need the node tracker to be synced first, as we rely on it to properly reprogram initial per node load balancers
klog.Info("Waiting for node tracker handler to sync")
if !util.WaitForHandlerSyncWithTimeout(nodeControllerName, stopCh, 20*time.Second, nodeHandler.HasSynced) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't you want to use HandlerSyncTimeout here instead of doing 20*time.Second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops I defined it and I wasn't using it (:

if !util.WaitForNamedCacheSyncWithTimeout(controllerName, stopCh, svcHandler.HasSynced, endpointHandler.HasSynced) {
return fmt.Errorf("error syncing cache")
klog.Info("Waiting for service and endpoint handlers to sync")
if !util.WaitForHandlerSyncWithTimeout(controllerName, stopCh, 20*time.Second, svcHandler.HasSynced, endpointHandler.HasSynced) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't you want to use HandlerSyncTimeout here instead of doing 20*time.Second?

The informer sync consists of completing a LIST operation for the informer cache for a given object type, while the handler sync consists of processing all the initial ADD operations we get at startup for a given event handler. Let's have WaitForHandlerSyncWithTimeout for the handler sync and WaitForInformerCacheSyncWithTimeout for the informer sync.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again.
This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys.

This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10^(-6) seconds, when we have 11500 services and 250 nodes, this takes in total  8 * 10^(-6) * 11500 * 250 = 23 seconds.

Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services at this point anyway and workers are started so that we can finally process each service in the queue.

Let's optimize this and use a simple flag to prevent the node tracker from adding all service keys at startup.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 8dc6cd7 to f7b7583 Compare February 7, 2024 10:36
Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest changes LGTM.
d/s scale test has been done openshift/ovn-kubernetes#2014 no disruptions or anything significant found yet.
once the CI runs here on the latest will merge this.
thanks @ricky-rav for the detailed commit message which made it easier for me to understand and your awesome troubleshooting here

@tssurya
Copy link
Member

tssurya commented Feb 7, 2024

new flake in external-gateway lane which is not related to this PR: #4139
known flake in kv live migration lane, rest are green: #3986

@tssurya tssurya merged commit 6e1ca6f into ovn-org:master Feb 7, 2024
28 of 30 checks passed
@openshift-merge-robot
Copy link

Fix included in accepted release 4.16.0-0.nightly-2024-02-17-013806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants