Skip to content

Commit

Permalink
Do not resync services during node tracker startup
Browse files Browse the repository at this point in the history
When we execute Run() in the service controller code, we wait on purpose for all node tracker add operations to be done, before we set up and start the service and endpoint handlers. Now, every node tracker add operation runs RequestFullSync, which (1) updates OVN LB templates on NBDB; (2) for each service, adds the service key to the service handler queue so that the service can be processed again.
This last part is necessary to recompute services when a node is added, updated or deleted, but at startup the service handler hasn't been started yet, so we're just adding the same service keys to the same queue for each node in the cluster, with no worker consuming the keys.

This proved to take a significantly long time in large clusters: if adding a key to the queue takes 8 * 10^(-6) seconds, when we have 11500 services and 250 nodes, this takes in total  8 * 10^(-6) * 11500 * 250 = 23 seconds.

Soon afterwards, in Run() we setup service and endpointslice handlers and we wait for them to process all add operations: the service handler queue is filled with all existing services at this point anyway and workers are started so that we can finally process each service in the queue.

Let's optimize this and use a simple flag to prevent the node tracker from adding all service keys at startup.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
  • Loading branch information
ricky-rav committed Feb 5, 2024
1 parent 8e885e0 commit 8dc6cd7
Showing 1 changed file with 26 additions and 8 deletions.
34 changes: 26 additions & 8 deletions go-controller/pkg/ovn/controller/services/services_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,11 @@ type Controller struct {
// nodeTracker
nodeTracker *nodeTracker

// startupDone is false up until the node tracker, service and endpointslice initial sync
// in Run() is completed
startupDone bool
startupDoneLock sync.RWMutex

// Per node information and template variables. The latter expand to each
// chassis' node IP (v4 and v6).
// Must be accessed only with the nodeInfo mutex taken.
Expand Down Expand Up @@ -169,6 +174,9 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
}
// We need the node tracker to be synced first, as we rely on it to properly reprogram initial per node load balancers
klog.Info("Waiting for node tracker handler to sync")
c.startupDoneLock.Lock()
c.startupDone = false
c.startupDoneLock.Unlock()
if !util.WaitForHandlerSyncWithTimeout(nodeControllerName, stopCh, 20*time.Second, nodeHandler.HasSynced) {
return fmt.Errorf("error syncing node tracker handler")
}
Expand Down Expand Up @@ -209,6 +217,10 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}, runRepair, useLBGr
return fmt.Errorf("error initializing alreadyApplied cache: %w", err)
}

c.startupDoneLock.Lock()
c.startupDone = true
c.startupDoneLock.Unlock()

// Start the workers after the repair loop to avoid races
klog.Info("Starting workers")
for i := 0; i < workers; i++ {
Expand Down Expand Up @@ -511,15 +523,21 @@ func (c *Controller) RequestFullSync(nodeInfos []nodeInfo) {
// Resync node infos and node IP templates.
c.syncNodeInfos(nodeInfos)

// Resync services.
services, err := c.serviceLister.List(labels.Everything())
if err != nil {
klog.Errorf("Cached lister failed!? %v", err)
return
}
// Resync all services unless we're processing the initial node tracker sync (in which case
// the service add will happen at the next step in the services controller Run() and workers
// aren't up yet anyway: no need to do it during node tracker startup then)
c.startupDoneLock.RLock()
defer c.startupDoneLock.RUnlock()
if c.startupDone {
services, err := c.serviceLister.List(labels.Everything())
if err != nil {
klog.Errorf("Cached lister failed!? %v", err)
return
}

for _, service := range services {
c.onServiceAdd(service)
for _, service := range services {
c.onServiceAdd(service)
}
}
}

Expand Down

0 comments on commit 8dc6cd7

Please sign in to comment.