Skip to content

Commit

Permalink
Do not resync services during node tracker startup
Browse files Browse the repository at this point in the history
When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again.
This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys.

This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10**(-6) seconds, when we have 11500 services and 250 nodes, this takes in total  8 * 10**(-6) * 11500 * 250 = 23 seconds.

Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services anyway and workers are started so that we can finally process each service in the queue.

Let's optimize this and use a simple flag to avoid adding all service keys during node tracker startup.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
  • Loading branch information
ricky-rav committed Jan 30, 2024
1 parent ca8b2d7 commit d03233d
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 6 deletions.
3 changes: 2 additions & 1 deletion go-controller/pkg/ovn/controller/services/node_tracker.go
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,8 @@ func (nt *nodeTracker) updateNodeInfo(nodeName, switchName, routerName, chassisI

nt.nodes[nodeName] = ni

// Resync all services
// Resync all services unless we're processing the initial sync
// (in which case, the service add will happen at the next step in controller services Run())
nt.resyncFn(nt.getZoneNodes())
}

Expand Down
18 changes: 13 additions & 5 deletions go-controller/pkg/ovn/controller/services/services_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,9 @@ type Controller struct {
// nodeTracker
nodeTracker *nodeTracker

// nodeTrackerInitialSyncDone is false up until the initial node tracker sync at startup is done
nodeTrackerInitialSyncDone bool

// Per node information and template variables. The latter expand to each
// chassis' node IP (v4 and v6).
// Must be accessed only with the nodeInfo mutex taken.
Expand Down Expand Up @@ -168,11 +171,15 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
return err
}
// We need node cache to be synced first, as we rely on it to properly reprogram initial per node load balancers
// No need to sync services here, since workers aren't up yet.
klog.Info("Waiting for node tracker caches to sync")
c.nodeTrackerInitialSyncDone = false
if !util.WaitForHandlerSyncWithTimeout(nodeControllerName, stopCh, nodeHandler.HasSynced) {
return fmt.Errorf("error syncing cache")
return fmt.Errorf("error syncing node tracker handler")
}

c.nodeTrackerInitialSyncDone = true

klog.Info("Setting up event handlers for services")
svcHandler, err := c.serviceInformer.Informer().AddEventHandler(factory.WithUpdateHandlingForObjReplace(cache.ResourceEventHandlerFuncs{
AddFunc: c.onServiceAdd,
Expand All @@ -196,7 +203,7 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
// Wait for the caches to be synced
klog.Info("Waiting for service and endpoint caches to sync")
if !util.WaitForHandlerSyncWithTimeout(controllerName, stopCh, svcHandler.HasSynced, endpointHandler.HasSynced) {
return fmt.Errorf("error syncing cache")
return fmt.Errorf("error syncing service and endpoint handlers")
}

if runRepair {
Expand Down Expand Up @@ -518,9 +525,10 @@ func (c *Controller) RequestFullSync(nodeInfos []nodeInfo) {
klog.Errorf("Cached lister failed!? %v", err)
return
}

for _, service := range services {
c.onServiceAdd(service)
if c.nodeTrackerInitialSyncDone {
for _, service := range services {
c.onServiceAdd(service)
}
}
}

Expand Down

0 comments on commit d03233d

Please sign in to comment.