Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.14] OCPBUGS-26573: Improve troubleshooting IC upgrades #2076

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
51 changes: 25 additions & 26 deletions pkg/network/ovn_kubernetes.go
Original file line number Diff line number Diff line change
Expand Up @@ -425,10 +425,10 @@ func renderOVNKubernetes(conf *operv1.NetworkSpec, bootstrapResult *bootstrap.Bo
// non-default, internal use only; this is selected in the first phase of an upgrade from a
// non-interconnect version (< 4.14) to an interconnect version (>= 4.14)
manifestSubDir = filepath.Join(manifestSubDirBasePath, "single-zone-interconnect")
} else if isMigrationToMultiZoneAboutToStartOrOngoing(bootstrapResult.OVN, &targetZoneMode) {
} else if !targetZoneMode.fastForwardToMultiZone && isMigrationToMultiZoneAboutToStartOrOngoing(bootstrapResult.OVN, &targetZoneMode) {
// intermediate step when converting from single zone to multizone; this is selected
// in the second phase of an upgrade from a non-interconnect version (< 4.14)
// to an interconnect version (>= 4.14)
// to an interconnect version (>= 4.14). Skipped when fastForwardToMultiZone is set.
manifestSubDir = filepath.Join(manifestSubDirBasePath, "multi-zone-interconnect-tmp")
}
klog.Infof("render YAMLs from %s folder", manifestSubDir)
Expand Down Expand Up @@ -1258,16 +1258,19 @@ const (
type targetZoneModeType struct {
// zoneMode indicates the target zone mode that CNO is supposed to converge to.
zoneMode InterConnectZoneMode
// "temporary", if true, indicates that the target zone mode is only temporary;
// it is used along with zoneMode=singlezone in order to temporarily switch to single-zone mode when upgrading
// from versions with no interconnect support (<=4.13). It has no effect if zoneMode=multizone.
temporary bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this variable is now removed in 4.14.7 or whatever z stream this lands in right?
how will the new upgrades from nonIC to 4.14.z where this change lands look like? I see
remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade in the PR message so that means we no longer need to know where the phase is temporary or not, just need to know if upgrade is ongoing or not? why did we need the temporary flag originally? seems like it was redundant with ongoingUpgrade?

I guess my TL;DR question is is there a difference between going from 4.13.0 to 4.14.0 and 4.13.0 to 4.14.z where this lands functionally for the end user if this field is going away? Do we need to make updates to the original enhancement?

Copy link
Contributor Author

@ricky-rav ricky-rav Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have simplified the logic earlier on, but I wanted to leave a window open in case we were determined to allow single zone...

Here's how it went.

The temporary field was how I envisaged to distinguish between being permanently in single zone (temporary=false) or just momentarily during the upgrade to 4.14 (temporary=true). The idea was that a configmap would track an upgrade to IC or a voluntary switch to single zone.

Then I needed another flag in the configmap, ongoing-upgrade, so that the CNO status manager could know whether an upgrade to IC was ongoing: it would only report version=4.14 when the upgrade was done, which was signaled by ongoingUpgrade getting removed from the configmap.

After 4.14 GA, we decided not to enable switching back to single zone , so now we can finally get rid of temporary.

Well, I told a nice story, but with hindsight temporary was indeed redundant also before :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To finish answering your questions, the phases in the upgrade are exactly the same as before, there's no change in behaviour. I can take care of updating the enhancement, once this PR merges :)

// "configMapFound" indicates whether the interconnect configmap was found; when not found,
// the zone mode defaults to multizone.
configMapFound bool
// ongoingUpgrade is true when the configmap was pushed by CNO itself; it is used by status manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make updates to what ongoingUpgrade means? It seems like its also taking over the role that "temporary" was doing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we've decided that we're not allowing a cluster to be permanently in single zone, I might as well remove "ongoingUpgrade" and just refer to configMapFound, since only CNO is supposed to push the configmap now... let me try it out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack did you try it? all ok on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I retested everything after pushing the current changes: #2076 (comment)

// to know it has to keep reporting the old version when this parameter is set.
ongoingUpgrade bool
// fastForwardToMultiZone can be manually set by the cluster admin through the interconnect configmap
// in order to force CNO to deploy multizone OVNK regardless of the current status of OVNK components
// throughout an upgrade to interconnect. This would effectively get the cluster out of phase 1 and phase 2
// and make OVNK jump to the YAMLs in the multi-zone-interconnect folder.
// This can be useful in case of problems during upgrades, if ever the logic in prepareUpgradeToInterconnect
// is not moving forward (e.g. one or more nodes are down and ovnk node is considered as "progressing")
fastForwardToMultiZone bool
}

// getTargetInterConnectZoneMode determines the desired interconnect zone mode for the cluster.
Expand Down Expand Up @@ -1302,15 +1305,19 @@ func getTargetInterConnectZoneMode(kubeClient cnoclient.Client) (targetZoneModeT
targetZoneMode.zoneMode = zoneModeMultiZone
}

if temporaryFromConfigMap, ok := interConnectConfigMap.Data["temporary"]; ok {
targetZoneMode.temporary = strings.ToLower(temporaryFromConfigMap) == "true"
}

if _, ok := interConnectConfigMap.Data["ongoing-upgrade"]; ok {
targetZoneMode.ongoingUpgrade = true
}

klog.Infof("target zone mode: %+v", targetZoneMode)
if _, ok := interConnectConfigMap.Data["fast-forward-to-multizone"]; ok {
targetZoneMode.fastForwardToMultiZone = true
if targetZoneMode.zoneMode != zoneModeMultiZone {
klog.Warningf("Forcing interconnect zone mode to multizone due to 'fast-forward-to-multizone' being set")
targetZoneMode.zoneMode = zoneModeMultiZone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is sort of a disruptive change in case ongoingUpgrade=true and someone sets this right? example, single zone is rolling out and stuck and now we set fastForwardToMultiZone to true to unblock and move forward with multizone.. will the single zone pods be left behind or will cleanup happen?

Also I guess we can add something to the PR description or enhancement update around this option to say "at your own risk - to be used only with guidance from eng or something"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using the new flag is disruptive: CNO will push the final multizone YAMLs regardless of where the cluster is during the 4.13->4.14 upgrade. After this PR merges, my idea was to write a document about this for support. Either that or I update the enhancement...

As for the single-zone pods, they will be removed as a consequence of the new ones being pushed.

}
}

klog.Infof("interconnect target zone mode: %+v", targetZoneMode)

return targetZoneMode, nil
}
Expand Down Expand Up @@ -2389,7 +2396,7 @@ func doesVersionEnableInterConnect(string) bool {
// If we're upgrading from a 4.13 cluster, which has no OVN InterConnect support, three phases are necessary.
// Phase 1:
//
// a) prepareUpgradeToInterConnect pushes a configMap with zone-mode=singlezone, temporary=true;
// a) prepareUpgradeToInterConnect pushes a configMap with zone-mode=singlezone, ongoing-upgrade='';
// b) renderOVNKubernetes selects the YAMLs from the single-zone folder
// (node DaemonSet, master DaemonSet [StatefulSet for hypershift], ovnkube-sbdb route [hypershift]);
// c) shouldUpdateOVNKonUpgrade rolls out first node DaemonSet then master DaemonSet/StatefulSet in single-zone mode
Expand All @@ -2398,7 +2405,7 @@ func doesVersionEnableInterConnect(string) bool {
// Phase 2tmp:
//
// a) Master and Node DaemonSet/StatefulSet are now single zone, so prepareUpgradeToInterConnect overrides the configMap
// with zone-mode=multizone, temporary=false;
// with zone-mode=multizone;
// b) renderOVNKubernetes selects the YAMLs from the multi-zone-tmp folder (node DaemonSet, master DaemonSet/StatefulSet,
// control plane deployment, ovnkube-sbdb route [hypershift]);
// c) shouldUpdateOVNKonInterConnectZoneModeChange applies a zone mode change (single->multi) by rolling out
Expand Down Expand Up @@ -2436,7 +2443,6 @@ func prepareUpgradeToInterConnect(ovn bootstrap.OVNBootstrapResult, client cnocl
},
Data: map[string]string{
"zone-mode": fmt.Sprint(zoneModeSingleZone),
"temporary": "true",
// we lookup ongoing-upgrade in SetFromPods (pod_status.go) to distinguish a
// configmap pushed during a non-IC-> IC upgrade from one that is added
// outside of an upgrade by a cluster admin to change the zone mode.
Expand All @@ -2448,12 +2454,11 @@ func prepareUpgradeToInterConnect(ovn bootstrap.OVNBootstrapResult, client cnocl
}
targetZoneMode.configMapFound = true
targetZoneMode.zoneMode = zoneModeSingleZone
targetZoneMode.temporary = true

} else if isMigrationToMultiZoneAboutToStart(ovn, targetZoneMode, true) && targetZoneMode.temporary && targetZoneMode.ongoingUpgrade {
} else if isMigrationToMultiZoneAboutToStart(ovn, targetZoneMode, true) && targetZoneMode.ongoingUpgrade && !targetZoneMode.fastForwardToMultiZone {
// [start of phase 2]
// if node and master DaemonSets have already upgraded to >= 4.14 single zone and
// we previously pushed a configmap for temporary single zone,
// we previously pushed a configmap for single zone,
// patch the configmap and proceed with the migration to multizone.
klog.Infof("Upgrade to interconnect, start of phase2: patching IC configmap for multizone")

Expand All @@ -2463,11 +2468,6 @@ func prepareUpgradeToInterConnect(ovn bootstrap.OVNBootstrapResult, client cnocl
"path": "/data/zone-mode",
"value": fmt.Sprint(zoneModeMultiZone),
},
{
"op": "replace",
"path": "/data/temporary",
"value": "false",
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe stupid question, but why isn't ongoing-upgrade also patched?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's set already at the beginning of the upgrade, in the first "if" block inside prepareUpgradeToInterConnect:


patchBytes, err := json.Marshal(patch)
Expand All @@ -2481,9 +2481,8 @@ func prepareUpgradeToInterConnect(ovn bootstrap.OVNBootstrapResult, client cnocl
}

targetZoneMode.zoneMode = zoneModeMultiZone
targetZoneMode.temporary = false

} else if isMigrationToMultiZoneComplete(ovn, targetZoneMode) && !targetZoneMode.temporary && targetZoneMode.ongoingUpgrade {
} else if isMigrationToMultiZoneComplete(ovn, targetZoneMode) && targetZoneMode.ongoingUpgrade {
// [after completion of phase 2]
// daemonsets have rolled out in multizone mode
// Remove the configmap: this won't trigger any further roll out, but along with the annotation
Expand All @@ -2507,12 +2506,12 @@ func prepareUpgradeToInterConnect(ovn bootstrap.OVNBootstrapResult, client cnocl

// Print IC upgrade status when phase 1 or phase 2 are ongoing
if targetZoneMode.ongoingUpgrade {
if targetZoneMode.zoneMode == zoneModeSingleZone && targetZoneMode.temporary &&
if targetZoneMode.zoneMode == zoneModeSingleZone &&
ovn.NodeUpdateStatus != nil && ovn.MasterUpdateStatus != nil && ovn.ControlPlaneUpdateStatus == nil &&
(ovn.NodeUpdateStatus.Progressing || ovn.MasterUpdateStatus.Progressing) {
klog.Warningf("Upgrade to interconnect, phase1 is ongoing")

} else if isMigrationToMultiZoneOngoing(ovn, targetZoneMode) && !targetZoneMode.temporary {
} else if isMigrationToMultiZoneOngoing(ovn, targetZoneMode) {
klog.Warningf("Upgrade to interconnect, phase2 is ongoing")
}

Expand Down