Skip to content

Commit

Permalink
Ensure ovnkube-controller does not go remote->local
Browse files Browse the repository at this point in the history
At ovnkube-controller start up, it may attempt to start adding resources
for a node that does not yet have its annotation, but should be
considered local. This is especially the case in OpenShift when going
from phase1 -> phase 2 upgrade and ovnkube-node on the node will
annotate its zone id. In this case ovnkube-controller will start too
early and program NBDB resources as remote for the node. Then when the
node is annotated, it will program NBDB resources as local.

Since we cannot be sure that all remote feature configuration in NBDB is
removed when going from remote -> local we need to avoid this case. It
also makes no sense to program NBDB when there are no nodes in the zone.

This commit adds a wait to make sure at least one node in the cluster
is in the ovnkube-controller's managed zone.

Thanks @numansiddique for the idea.

Signed-off-by: Tim Rozet <trozet@redhat.com>
  • Loading branch information
trozet authored and jcaamano committed Jul 6, 2023
1 parent 926a1dc commit a3dd7de
Showing 1 changed file with 30 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,6 @@ func (cm *networkControllerManager) Start(ctx context.Context) error {
if err1 != nil {
return false, nil
}

if config.Default.Zone != zone {
err1 = fmt.Errorf("config zone %s different from NBDB zone %s", config.Default.Zone, zone)
return false, nil
Expand All @@ -326,6 +325,36 @@ func (cm *networkControllerManager) Start(ctx context.Context) error {
}
klog.Infof("NBDB zone sync took: %s", time.Since(start))

err = cm.watchFactory.Start()
if err != nil {
return err
}

// Wait for one node to have the zone we want to manage, otherwise there is no point in configuring NBDB.
// Really this covers a use case where a node is going from local -> remote, but has not yet annotated itself.
// In this case ovnkube-controller on this remote node will treat the node as remote, and then once the annotation
// appears will convert it to local, which may or may not clean up DB resources correctly.
klog.Infof("Waiting up to %s for a node to have %q zone", maxTimeout, config.Default.Zone)
start = time.Now()
err = wait.PollUntilContextTimeout(ctx, 250*time.Millisecond, maxTimeout, true, func(ctx context.Context) (bool, error) {
nodes, err := cm.watchFactory.GetNodes()
if err != nil {
klog.Errorf("Unable to get nodes from informer while waiting for node zone sync")
return false, nil
}
for _, node := range nodes {
if util.GetNodeZone(node) == config.Default.Zone {
return true, nil
}
}
return false, nil
})
if err != nil {
return fmt.Errorf("failed to start default network controller - while waiting for any node to have zone: %q, error: %v",
config.Default.Zone, err)
}
klog.Infof("Waiting for node in zone sync took: %s", time.Since(start))

cm.configureMetrics(cm.stopChan)

err = cm.configureSCTPSupport()
Expand All @@ -348,11 +377,6 @@ func (cm *networkControllerManager) Start(ctx context.Context) error {
}
cm.podRecorder.Run(cm.sbClient, cm.stopChan)

err = cm.watchFactory.Start()
if err != nil {
return err
}

err = cm.initDefaultNetworkController()
if err != nil {
return fmt.Errorf("failed to init default network controller: %v", err)
Expand Down

0 comments on commit a3dd7de

Please sign in to comment.