New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.14] OCPBUGS-26573: Improve troubleshooting IC upgrades #2076
[release-4.14] OCPBUGS-26573: Improve troubleshooting IC upgrades #2076
Conversation
Skipping CI for Draft Pull Request. |
/retest |
/retest |
Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com> (cherry picked from commit 77bff544f331e14db3b7a4b5245af335c5d16a19)
The "temporary" field became redundant when "ongoing-upgrade" was added to the IC upgrade logic. Let's only keep the "ongoing-upgrade" to keep track of a configmap pushed by CNO for IC upgrade. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com> (cherry picked from commit f3a4bff911baa5a0d33780e2cf4347712c3e6026)
a51f388
to
ab2547c
Compare
/retest |
@ricky-rav: This pull request references SDN-4154 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall logic is clear where you are trying to add a fast-forward knob. Left some inline questions...
Do we need some unit tests?
Did you do live testing and results were ok? (asking because we are kinda going in blind here...)
// "temporary", if true, indicates that the target zone mode is only temporary; | ||
// it is used along with zoneMode=singlezone in order to temporarily switch to single-zone mode when upgrading | ||
// from versions with no interconnect support (<=4.13). It has no effect if zoneMode=multizone. | ||
temporary bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this variable is now removed in 4.14.7 or whatever z stream this lands in right?
how will the new upgrades from nonIC to 4.14.z where this change lands look like? I see
remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade
in the PR message so that means we no longer need to know where the phase is temporary or not, just need to know if upgrade is ongoing or not? why did we need the temporary
flag originally? seems like it was redundant with ongoingUpgrade?
I guess my TL;DR question is is there a difference between going from 4.13.0 to 4.14.0 and 4.13.0 to 4.14.z where this lands functionally for the end user if this field is going away? Do we need to make updates to the original enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have simplified the logic earlier on, but I wanted to leave a window open in case we were determined to allow single zone...
Here's how it went.
The temporary
field was how I envisaged to distinguish between being permanently in single zone (temporary=false) or just momentarily during the upgrade to 4.14 (temporary=true). The idea was that a configmap would track an upgrade to IC or a voluntary switch to single zone.
Then I needed another flag in the configmap, ongoing-upgrade
, so that the CNO status manager could know whether an upgrade to IC was ongoing: it would only report version=4.14 when the upgrade was done, which was signaled by ongoingUpgrade
getting removed from the configmap.
After 4.14 GA, we decided not to enable switching back to single zone , so now we can finally get rid of temporary
.
Well, I told a nice story, but with hindsight temporary
was indeed redundant also before :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To finish answering your questions, the phases in the upgrade are exactly the same as before, there's no change in behaviour. I can take care of updating the enhancement, once this PR merges :)
targetZoneMode.fastForwardToMultiZone = true | ||
if targetZoneMode.zoneMode != zoneModeMultiZone { | ||
klog.Warningf("Forcing interconnect zone mode to multizone due to 'fast-forward-to-multizone' being set") | ||
targetZoneMode.zoneMode = zoneModeMultiZone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is sort of a disruptive change in case ongoingUpgrade=true and someone sets this right? example, single zone is rolling out and stuck and now we set fastForwardToMultiZone
to true to unblock and move forward with multizone.. will the single zone pods be left behind or will cleanup happen?
Also I guess we can add something to the PR description or enhancement update around this option to say "at your own risk - to be used only with guidance from eng or something"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using the new flag is disruptive: CNO will push the final multizone YAMLs regardless of where the cluster is during the 4.13->4.14 upgrade. After this PR merges, my idea was to write a document about this for support. Either that or I update the enhancement...
As for the single-zone pods, they will be removed as a consequence of the new ones being pushed.
pkg/network/ovn_kubernetes.go
Outdated
// "temporary", if true, indicates that the target zone mode is only temporary; | ||
// it is used along with zoneMode=singlezone in order to temporarily switch to single-zone mode when upgrading | ||
// from versions with no interconnect support (<=4.13). It has no effect if zoneMode=multizone. | ||
temporary bool | ||
// "configMapFound" indicates whether the interconnect configmap was found; when not found, | ||
// the zone mode defaults to multizone. | ||
configMapFound bool | ||
// ongoingUpgrade is true when the configmap was pushed by CNO itself; it is used by status manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to make updates to what ongoingUpgrade means? It seems like its also taking over the role that "temporary" was doing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we've decided that we're not allowing a cluster to be permanently in single zone, I might as well remove "ongoingUpgrade" and just refer to configMapFound
, since only CNO is supposed to push the configmap now... let me try it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack did you try it? all ok on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I retested everything after pushing the current changes: #2076 (comment)
"op": "replace", | ||
"path": "/data/temporary", | ||
"value": "false", | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe stupid question, but why isn't ongoing-upgrade also patched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's set already at the beginning of the upgrade, in the first "if" block inside prepareUpgradeToInterConnect
:
"ongoing-upgrade": "", |
Single-zone mode is only needed during a 4.13->4.14 upgrade. Switching back to single zone is not supported. As a consequence, we can further simplify the code and remove the "ongoing-upgrade" flag, whose purpose was to distinguish between a configmap pushed by a cluster admin to switch to single / multi zone and the configmap pushed by CNO to track the upgrade progress. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
0364ebe
to
5983b44
Compare
I'll test the new changes tomorrow morning... |
/retest |
2 similar comments
/retest |
/retest |
I've manually tested the new changes to fast-forward to the multizone yamls when the cluster is in phase 1 and when it's in phase 2. No surprises, the cluster converges to the new YAMLs as expected. |
/retest |
1 similar comment
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ricky-rav, trozet, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/label cherry-pick-approved |
/test e2e-hypershift-ovn |
/retest |
@ricky-rav: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test e2e-hypershift-ovn |
8a28156
into
openshift:release-4.14
@ricky-rav: Jira Issue OCPBUGS-26573: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-26573 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-network-operator-container-v4.14.0-202402191039.p0.g8a28156.assembly.stream.el8 for distgit cluster-network-operator. |
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
Fix included in accepted release 4.14.0-0.nightly-2024-02-19-170131 |
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@anuragthehatter for the final diagrams & configmap please refer to openshift/enhancements#1567 |
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav: Jira Issue OCPBUGS-26573 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
If ever a cluster is not making progress during phase 1 or phase 2 of the upgrade to OVN IC, a cluster admin can now edit the interconnect configmap and add a new key (
fast-forward-to-multizone
) to bypass the two-phase upgrade and let CNO apply directly the multizone YAMLs.The current code only allows the cluster to move forward when it's in phase 1 by setting
zone-mode=multizone
andtemporary=false
:Let's improve that by allowing to jump to multizone in either phase of the upgrade. At the same time, remove the
temporary
field from the configmap, since theongoing-upgrade
field already tracks that CNO is going through an upgrade: