Separate timeout for handler sync from informer sync and do not resync services during node tracker startup #4100

ricky-rav · 2024-01-26T15:07:25Z

[edited]

separate the handler sync from the informer sync
remove the full service resync during the node tracker startup since it's redundant: all service keys are added at each node tracker add, but no service controller workers are consuming those keys yet; the services keys are added already in the service controller startup. This is discussed in: https://issues.redhat.com/browse/OCPBUGS-20336?focusedId=23916634&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23916634 and previous comments

coveralls · 2024-01-26T15:32:35Z

coverage: 51.973% (-0.03%) from 52.002%
when pulling 8dc6cd7 on ricky-rav:OCPBUGS-20336_increaseTimeout_upstream
into cbff639 on ovn-org:master.

tssurya · 2024-01-26T16:41:33Z

go-controller/pkg/types/const.go

 	// It allows ~4 list() retries with the default reflector exponential backoff config
 	InformerSyncTimeout = 20 * time.Second

 	// GRMACBindingAgeThreshold is the lifetime in seconds of each MAC binding
 	// entry for the gateway routers. After this time, the entry is removed and
 	// may be refreshed with a new ARP request.
 	GRMACBindingAgeThreshold = "300"
+
+	// HandlerSyncTimeout is used when waiting for initial object handler sync.
+	// (i.e. all the ADD events should be processed for the existing objects by the event handler)


how long did it take on the scale cluster? Could you provide some more data around the size of the cluster? (nodes, services)? What if 2mins is also not enough or do we think 2mins has to be enough.?

You're right, I'll add that to the description.

go-controller/pkg/util/sync.go

npinaeva · 2024-01-26T17:37:23Z

@ricky-rav can you also update handlers to informers here please (I am sure in the first case I intended to sync informer, not sure which sync we needed in the second case)
https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/controller/controller.go#L83
https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L198

tssurya · 2024-01-29T15:28:16Z

In short, we saw that with a 250-node cluster it takes node tracker an average of 0.1 seconds to process one node, so in the upgrade runs that we looked at, the time out of 2 minutes occurred when ~200 nodes had been processed.

was it 2mins or 20sec? based on your presentation seems 20sec?

ricky-rav · 2024-01-29T15:47:39Z

In short, we saw that with a 250-node cluster it takes node tracker an average of 0.1 seconds to process one node, so in the upgrade runs that we looked at, the time out of 2 minutes occurred when ~200 nodes had been processed.

was it 2mins or 20sec? based on your presentation seems 20sec?

20 seconds, my bad :)

ricky-rav · 2024-01-30T14:38:54Z

I have two commits now:

I separate the handler sync from the informer sync and I keep the same timeout of 20s in both cases (here I was increasing the handler sync timeout to 2 minutes before)
I remove the full service resync during the node tracker startup, as discussed in last couple of comments of this bug: https://issues.redhat.com/browse/OCPBUGS-20336?focusedId=23916634&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23916634

I have less evidence now to increase the handler timeout sync to 2 minutes, since I've optimized the node tracker startup code in the second commit and that part should now take ~0.03s in a 250-node cluster. Do we want to still do it? The handlers for which we call WaitForHandlerSyncWithTimeout are all level-driven controllers, so WaitForHandlerSyncWithTimeout will return when all object keys have been added to the queue, without waiting for the workers to process the keys. That shouldn't need a higher timeout than 20s.

I think it still makes sense to have WaitForHandlerSyncWithTimeout and WaitForInformerSyncWithTimeout, because they wait for two different things, but the reasons for a higher timeout are thinner now. @npinaeva @tssurya @trozet what do you all think?

Thanks!

ricky-rav · 2024-01-30T19:23:16Z

/hold
There's a nice race condition in the variable I added...

ricky-rav · 2024-01-31T12:05:27Z

Should be good to go now

npinaeva · 2024-01-31T14:54:05Z

May be we should make WaitForHandlerSyncWithTimeout always accept the timeout explicitly from the caller, considering there is not such thing as "generic handler timeout" since it depends on the handler implementation? Then you could tweak it per-handler as needed, and also make sure you do some scale testing before setting it.

ricky-rav · 2024-01-31T18:37:12Z

May be we should make WaitForHandlerSyncWithTimeout always accept the timeout explicitly from the caller, considering there is not such thing as "generic handler timeout" since it depends on the handler implementation? Then you could tweak it per-handler as needed, and also make sure you do some scale testing before setting it.

Yeah, that's not a bad idea to make it explicit. I've set all three timeouts to 20 seconds, which seems to have been a good upper bound so far.

npinaeva · 2024-02-01T15:20:21Z

go-controller/pkg/controller/controller.go

@@ -80,7 +81,7 @@ func (c *controller[T]) Start(threadiness int) error {
 		return fmt.Errorf("failed to add event handler: %w", err)
 	}

-	if !util.WaitForNamedCacheSyncWithTimeout(c.name, c.stopChan, c.eventHandler.HasSynced) {
+	if !util.WaitForHandlerSyncWithTimeout(c.name, c.stopChan, 20*time.Second, c.eventHandler.HasSynced) {


oh I meant to change this to informer sync, so use c.config.Informer.HasSynced instead of c.eventHandler.HasSynced

Oh I see, I read your other message too quickly.

in services_controller.go (https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go#L198) I definitely need WaitForHandlerSyncWithTimeout, because I want all service keys to be added to the queue at that step, since I removed that part from the node tracker startup.

here in controller.go the existing code passes the handler HasSync, so I simply figured I'd keep the existing logic and use WaitForHandlerSyncWithTimeout. If that's not necessary, I can definitely change that.

npinaeva · 2024-02-02T13:56:52Z

go-controller/pkg/controller/controller.go

-	if !util.WaitForNamedCacheSyncWithTimeout(c.name, c.stopChan, c.eventHandler.HasSynced) {
-		return fmt.Errorf("timed out waiting for egressservice caches to sync")
+	if !util.WaitForInformerCacheSyncWithTimeout(c.name, c.stopChan, c.config.Informer.HasSynced) {
+		return fmt.Errorf("timed out waiting for informer cache to sync")


lol, nice catch!

npinaeva · 2024-02-02T13:57:52Z

lgtm for the sync wait logic, thanks @ricky-rav!

tssurya · 2024-02-05T11:45:23Z

go-controller/pkg/clustermanager/egressservice/egressservice_cluster.go

 		return fmt.Errorf("timed out waiting for egressservice caches to sync")
 	}

-	if !util.WaitForNamedCacheSyncWithTimeout("egressservices_services", c.stopCh, c.servicesSynced) {
+	if !util.WaitForInformerCacheSyncWithTimeout("egressservices_services", c.stopCh, c.servicesSynced) {
 		return fmt.Errorf("timed out waiting for caches to sync")


nit: should we make the error more explicit here so that when it fails we know which one? => this same nit applies to all the spots across the code.. "services, nad, eip..." something like:
fmt.Errorf("timed out waiting for %q caches to sync", resourceName)

ah ops, there were a few of them :-D

tssurya · 2024-02-05T12:16:40Z

go-controller/pkg/ovn/controller/services/services_controller.go

@@ -168,10 +172,17 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
 		return err
 	}
 	// We need node cache to be synced first, as we rely on it to properly reprogram initial per node load balancers
+	// No need to sync services here, since workers aren't up yet.


qq: we were not syncing services here anyways right? you have just added this comment to explicitly call that out, that's all right?

we were doing it before, but now that I removed it, the comment alone doesn't tell you the whole story. Let me remove it from here and add it to RequestFullSync, where it makes more sense.

tssurya · 2024-02-05T12:33:33Z

go-controller/pkg/ovn/controller/services/services_controller.go

+	if c.nodeTrackerInitialSyncDone {
+		services, err := c.serviceLister.List(labels.Everything())
+		if err != nil {
+			klog.Errorf("Cached lister failed!? %v", err)


shouldn't this be .... return the error and fail hard/retry? or do we want to keep it as is..

You're right, I simply kept the existing logic. If I look at how node tracker sync works, it has no retry mechanism at all o_O
It might be something to explore for our tech debt, maybe?

yeah sounds good, deff don't want you to go out of scope on this PR, perhaps open a JIRA tracker for this please! thanks

Here it is: https://issues.redhat.com/browse/SDN-4483

tssurya · 2024-02-05T12:49:52Z

go-controller/pkg/ovn/controller/services/services_controller.go

-		klog.Errorf("Cached lister failed!? %v", err)
-		return
-	}
+	// Resync all services unless we're processing the initial node tracker sync (in which case


so.. here the ordering is:
initialize: svccontroller.NewController->NewNodeTRacker->RequestFullSync->we check if initialNTSync is done which won't be done since that part of code is not touched yet right?

Then we have the:
Run code => after the initialize is done and we have the DNC, we do DNC.Run-> startSvcController->Run-> do the initial node tracker sync

I am unsure how and when we can hit RequestFullSync.NTInitialSync=true condition? It seems like the initial sync is always going to happen afterwards right? Perhaps I am missing something. I will huddle with you.

uhm what do you mean by DNC?

The execution at startup is:

func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGroups, useTemplates bool) error

sets up the node tracker handler

waits for a full sync of the node track handler

each add event for node tracker executes updateNode, at the end of which we call updateNodeInfo, which eventually calls nt.resyncFn(nt.getZoneNodes()), corresponding to RequestFullSync in services_controller.go

in RequestFullSync, when we're here, c.nodeTrackerInitialSyncDone is false, so c.onServiceAdd is not called

c.nodeTrackerInitialSyncDone= true

sets up the service and endpoint slice handlers

waits for a full sync of the service and endpoint handlers

the code here just adds a key for each existing service and, for each endpoint slice, the corresponding service key

runs initTopLevelCache()

starts the workers to consume service keys from the queue

All subsequent node tracker add/update/delete events will call RequestFullSync and will trigger a full service resync, since c.nodeTrackerInitialSyncDone=true now.

Now that I've outlined the main steps, I see that I can actually set c.nodeTrackerInitialSyncDone=true later on right before starting the workers...

heh yeah thanks ric, let's talk tomorrow easier that way. will ping you

I've moved the line where I set the flag to true further below and renamed it more simply startupDone, since it actually refers to whether the initial sync for the three handlers we have (node tracker, service, endpointslice) has been completed. I didn't move c.startupDone=true after we start the workers because I'm a bit paranoid about the fact that we might miss node tracker events in the teeny tiny time frame of the for loop itself:

klog.Info("Starting workers") for i := 0; i < workers; i++ { go wait.Until(c.worker, c.workerLoopPeriod, stopCh) }

I didn't move c.startupDone=true after we start the workers because I'm a bit paranoid about the fact that we might miss node tracker events in the teeny tiny time frame of the for loop itself

ack that makes sense, but then there is also the reverse scene where for some nodes we get the add/update whatever between this tiny window where the done = true and workers are NOT Ready yet due to the loop not having finished but I think that should be negligible.

go-controller/pkg/util/sync.go

tssurya · 2024-02-07T09:18:53Z

the two flakes in CI are known issues I liked above

tssurya

both commits lgtm except for the nits I outlined

tssurya · 2024-02-07T09:22:57Z

go-controller/pkg/ovn/controller/services/services_controller.go

-		return fmt.Errorf("error syncing cache")
+	// We need the node tracker to be synced first, as we rely on it to properly reprogram initial per node load balancers
+	klog.Info("Waiting for node tracker handler to sync")
+	if !util.WaitForHandlerSyncWithTimeout(nodeControllerName, stopCh, 20*time.Second, nodeHandler.HasSynced) {


nit: don't you want to use HandlerSyncTimeout here instead of doing 20*time.Second?

ops I defined it and I wasn't using it (:

tssurya · 2024-02-07T09:23:02Z

go-controller/pkg/ovn/controller/services/services_controller.go

-	if !util.WaitForNamedCacheSyncWithTimeout(controllerName, stopCh, svcHandler.HasSynced, endpointHandler.HasSynced) {
-		return fmt.Errorf("error syncing cache")
+	klog.Info("Waiting for service and endpoint handlers to sync")
+	if !util.WaitForHandlerSyncWithTimeout(controllerName, stopCh, 20*time.Second, svcHandler.HasSynced, endpointHandler.HasSynced) {


nit: don't you want to use HandlerSyncTimeout here instead of doing 20*time.Second?

go-controller/pkg/ovn/controller/services/services_controller.go

The informer sync consists of completing a LIST operation for the informer cache for a given object type, while the handler sync consists of processing all the initial ADD operations we get at startup for a given event handler. Let's have WaitForHandlerSyncWithTimeout for the handler sync and WaitForInformerCacheSyncWithTimeout for the informer sync. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again. This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys. This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10^(-6) seconds, when we have 11500 services and 250 nodes, this takes in total 8 * 10^(-6) * 11500 * 250 = 23 seconds. Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services at this point anyway and workers are started so that we can finally process each service in the queue. Let's optimize this and use a simple flag to prevent the node tracker from adding all service keys at startup. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>

tssurya

latest changes LGTM.
d/s scale test has been done openshift/ovn-kubernetes#2014 no disruptions or anything significant found yet.
once the CI runs here on the latest will merge this.
thanks @ricky-rav for the detailed commit message which made it easier for me to understand and your awesome troubleshooting here

tssurya · 2024-02-07T12:27:55Z

new flake in external-gateway lane which is not related to this PR: #4139
known flake in kv live migration lane, rest are green: #3986

openshift-merge-robot · 2024-02-17T09:44:22Z

Fix included in accepted release 4.16.0-0.nightly-2024-02-17-013806

ricky-rav requested review from trozet, dcbw, girishmg, jcaamano and tssurya as code owners January 26, 2024 15:07

tssurya requested changes Jan 26, 2024

View reviewed changes

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 2 times, most recently from 7e6ee25 to e096eb0 Compare January 26, 2024 17:27

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 3 times, most recently from 40207bf to 10af9a1 Compare January 30, 2024 14:35

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 2 times, most recently from d03233d to dc1f345 Compare January 30, 2024 15:31

ricky-rav changed the title ~~Separate timeout for handler sync from informer sync~~ Separate timeout for handler sync from informer sync and do not resync services during node tracker startup Jan 30, 2024

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 2 times, most recently from aff9cd3 to 1941751 Compare January 30, 2024 19:07

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 1941751 to 42535c5 Compare January 31, 2024 09:57

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 42535c5 to 95005b5 Compare January 31, 2024 18:35

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 95005b5 to 03849ba Compare February 1, 2024 13:57

npinaeva reviewed Feb 1, 2024

View reviewed changes

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 03849ba to 66f32e8 Compare February 1, 2024 17:27

npinaeva reviewed Feb 2, 2024

View reviewed changes

tssurya reviewed Feb 5, 2024

View reviewed changes

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch 3 times, most recently from f4a8090 to 8dc6cd7 Compare February 5, 2024 19:15

tssurya requested changes Feb 7, 2024

View reviewed changes

ricky-rav added 2 commits February 7, 2024 11:36

ricky-rav force-pushed the OCPBUGS-20336_increaseTimeout_upstream branch from 8dc6cd7 to f7b7583 Compare February 7, 2024 10:36

tssurya approved these changes Feb 7, 2024

View reviewed changes

tssurya merged commit 6e1ca6f into ovn-org:master Feb 7, 2024
28 of 30 checks passed

tssurya self-assigned this Feb 20, 2024

ricky-rav mentioned this pull request Apr 11, 2024

Load balancer failed to be created #4270

Open

Separate timeout for handler sync from informer sync and do not resync services during node tracker startup #4100

Separate timeout for handler sync from informer sync and do not resync services during node tracker startup #4100

Conversation

ricky-rav commented Jan 26, 2024 • edited

coveralls commented Jan 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npinaeva commented Jan 26, 2024

tssurya commented Jan 29, 2024

ricky-rav commented Jan 29, 2024

ricky-rav commented Jan 30, 2024 • edited

ricky-rav commented Jan 30, 2024 • edited

ricky-rav commented Jan 31, 2024

npinaeva commented Jan 31, 2024

ricky-rav commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npinaeva commented Feb 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tssurya Feb 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tssurya commented Feb 7, 2024 • edited

tssurya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tssurya left a comment

Choose a reason for hiding this comment

tssurya commented Feb 7, 2024 • edited

openshift-merge-robot commented Feb 17, 2024

ricky-rav commented Jan 26, 2024 •

edited

coveralls commented Jan 26, 2024 •

edited

ricky-rav commented Jan 30, 2024 •

edited

ricky-rav commented Jan 30, 2024 •

edited

tssurya Feb 5, 2024 •

edited

tssurya commented Feb 7, 2024 •

edited

tssurya commented Feb 7, 2024 •

edited