Skip to content

Commit

Permalink
Do not resync services during node tracker startup
Browse files Browse the repository at this point in the history
When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again.
This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys.

This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10**(-6) seconds, when we have 11500 services and 250 nodes, this takes in total  8 * 10**(-6) * 11500 * 250 = 23 seconds.

Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services anyway and workers are started so that we can finally process each service in the queue.

Let's optimize this and use a simple flag to avoid adding all service keys during node tracker startup.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
  • Loading branch information
ricky-rav committed Feb 1, 2024
1 parent 5a93686 commit 66f32e8
Showing 1 changed file with 26 additions and 10 deletions.
36 changes: 26 additions & 10 deletions go-controller/pkg/ovn/controller/services/services_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,10 @@ type Controller struct {
// nodeTracker
nodeTracker *nodeTracker

// nodeTrackerInitialSyncDone is false up until the initial node tracker sync at startup is done
nodeTrackerInitialSyncDone bool
nodeTrackerInitialSyncDoneLock sync.RWMutex

// Per node information and template variables. The latter expand to each
// chassis' node IP (v4 and v6).
// Must be accessed only with the nodeInfo mutex taken.
Expand Down Expand Up @@ -168,10 +172,17 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
return err
}
// We need node cache to be synced first, as we rely on it to properly reprogram initial per node load balancers
// No need to sync services here, since workers aren't up yet.
klog.Info("Waiting for node tracker caches to sync")
c.nodeTrackerInitialSyncDoneLock.Lock()
c.nodeTrackerInitialSyncDone = false
c.nodeTrackerInitialSyncDoneLock.Unlock()
if !util.WaitForHandlerSyncWithTimeout(nodeControllerName, stopCh, 20*time.Second, nodeHandler.HasSynced) {
return fmt.Errorf("error syncing cache")
return fmt.Errorf("error syncing node tracker handler")
}
c.nodeTrackerInitialSyncDoneLock.Lock()
c.nodeTrackerInitialSyncDone = true
c.nodeTrackerInitialSyncDoneLock.Unlock()

klog.Info("Setting up event handlers for services")
svcHandler, err := c.serviceInformer.Informer().AddEventHandler(factory.WithUpdateHandlingForObjReplace(cache.ResourceEventHandlerFuncs{
Expand All @@ -195,7 +206,7 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr

klog.Info("Waiting for service and endpoint handlers to sync")
if !util.WaitForHandlerSyncWithTimeout(controllerName, stopCh, 20*time.Second, svcHandler.HasSynced, endpointHandler.HasSynced) {
return fmt.Errorf("error syncing cache")
return fmt.Errorf("error syncing service and endpoint handlers")
}

if runRepair {
Expand Down Expand Up @@ -511,15 +522,20 @@ func (c *Controller) RequestFullSync(nodeInfos []nodeInfo) {
// Resync node infos and node IP templates.
c.syncNodeInfos(nodeInfos)

// Resync services.
services, err := c.serviceLister.List(labels.Everything())
if err != nil {
klog.Errorf("Cached lister failed!? %v", err)
return
}
// Resync all services unless we're processing the initial node tracker sync (in which case
// the service add will happen at the next step in the controller services Run())
c.nodeTrackerInitialSyncDoneLock.RLock()
defer c.nodeTrackerInitialSyncDoneLock.RUnlock()
if c.nodeTrackerInitialSyncDone {
services, err := c.serviceLister.List(labels.Everything())
if err != nil {
klog.Errorf("Cached lister failed!? %v", err)
return
}

for _, service := range services {
c.onServiceAdd(service)
for _, service := range services {
c.onServiceAdd(service)
}
}
}

Expand Down

0 comments on commit 66f32e8

Please sign in to comment.