Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.14] OCPBUGS-26573: Improve troubleshooting IC upgrades #2076

Merged

Conversation

ricky-rav
Copy link
Contributor

@ricky-rav ricky-rav commented Oct 20, 2023

If ever a cluster is not making progress during phase 1 or phase 2 of the upgrade to OVN IC, a cluster admin can now edit the interconnect configmap and add a new key (fast-forward-to-multizone) to bypass the two-phase upgrade and let CNO apply directly the multizone YAMLs.

The current code only allows the cluster to move forward when it's in phase 1 by setting zone-mode=multizone and temporary=false:
ICupgrades_from_413z_to_41413

Let's improve that by allowing to jump to multizone in either phase of the upgrade. At the same time, remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade:
ICupgrades_from_413z_to_41414plus

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 20, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ricky-rav ricky-rav marked this pull request as ready for review October 20, 2023 09:19
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 20, 2023
@ricky-rav
Copy link
Contributor Author

/retest

@ricky-rav ricky-rav changed the title [release-4.14][WIP] Improve troubleshooting IC upgrades [release-4.14] Improve troubleshooting IC upgrades Oct 23, 2023
@ricky-rav
Copy link
Contributor Author

/retest

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
(cherry picked from commit 77bff544f331e14db3b7a4b5245af335c5d16a19)
The "temporary" field became redundant when "ongoing-upgrade" was added to the IC upgrade logic. Let's only keep the "ongoing-upgrade" to keep track of a configmap pushed by CNO for IC upgrade.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
(cherry picked from commit f3a4bff911baa5a0d33780e2cf4347712c3e6026)
@ricky-rav
Copy link
Contributor Author

/retest

@ricky-rav ricky-rav changed the title [release-4.14] Improve troubleshooting IC upgrades [release-4.14] SDN-4154: Improve troubleshooting IC upgrades Oct 25, 2023
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 25, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 25, 2023

@ricky-rav: This pull request references SDN-4154 which is a valid jira issue.

In response to this:

If ever a cluster is not making progress during phase 1 or phase 2 of the upgrade to OVN IC, a cluster admin can now edit the interconnect configmap and add a new key (fast-forward-to-multizone) to bypass the two-phase upgrade and let CNO apply directly the multizone YAMLs.

The current code only allows the cluster to move forward when it's in phase 1 by setting zone-mode=multizone and temporary=false:
CurrentICupgrades drawio

Let's improve that by allowing to jump to multizone in either phase of the upgrade. At the same time, remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade:
ICupgrades drawio

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ricky-rav
Copy link
Contributor Author

/retest

1 similar comment
@ricky-rav
Copy link
Contributor Author

/retest

Copy link
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall logic is clear where you are trying to add a fast-forward knob. Left some inline questions...
Do we need some unit tests?
Did you do live testing and results were ok? (asking because we are kinda going in blind here...)

// "temporary", if true, indicates that the target zone mode is only temporary;
// it is used along with zoneMode=singlezone in order to temporarily switch to single-zone mode when upgrading
// from versions with no interconnect support (<=4.13). It has no effect if zoneMode=multizone.
temporary bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this variable is now removed in 4.14.7 or whatever z stream this lands in right?
how will the new upgrades from nonIC to 4.14.z where this change lands look like? I see
remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade in the PR message so that means we no longer need to know where the phase is temporary or not, just need to know if upgrade is ongoing or not? why did we need the temporary flag originally? seems like it was redundant with ongoingUpgrade?

I guess my TL;DR question is is there a difference between going from 4.13.0 to 4.14.0 and 4.13.0 to 4.14.z where this lands functionally for the end user if this field is going away? Do we need to make updates to the original enhancement?

Copy link
Contributor Author

@ricky-rav ricky-rav Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have simplified the logic earlier on, but I wanted to leave a window open in case we were determined to allow single zone...

Here's how it went.

The temporary field was how I envisaged to distinguish between being permanently in single zone (temporary=false) or just momentarily during the upgrade to 4.14 (temporary=true). The idea was that a configmap would track an upgrade to IC or a voluntary switch to single zone.

Then I needed another flag in the configmap, ongoing-upgrade, so that the CNO status manager could know whether an upgrade to IC was ongoing: it would only report version=4.14 when the upgrade was done, which was signaled by ongoingUpgrade getting removed from the configmap.

After 4.14 GA, we decided not to enable switching back to single zone , so now we can finally get rid of temporary.

Well, I told a nice story, but with hindsight temporary was indeed redundant also before :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To finish answering your questions, the phases in the upgrade are exactly the same as before, there's no change in behaviour. I can take care of updating the enhancement, once this PR merges :)

targetZoneMode.fastForwardToMultiZone = true
if targetZoneMode.zoneMode != zoneModeMultiZone {
klog.Warningf("Forcing interconnect zone mode to multizone due to 'fast-forward-to-multizone' being set")
targetZoneMode.zoneMode = zoneModeMultiZone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is sort of a disruptive change in case ongoingUpgrade=true and someone sets this right? example, single zone is rolling out and stuck and now we set fastForwardToMultiZone to true to unblock and move forward with multizone.. will the single zone pods be left behind or will cleanup happen?

Also I guess we can add something to the PR description or enhancement update around this option to say "at your own risk - to be used only with guidance from eng or something"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using the new flag is disruptive: CNO will push the final multizone YAMLs regardless of where the cluster is during the 4.13->4.14 upgrade. After this PR merges, my idea was to write a document about this for support. Either that or I update the enhancement...

As for the single-zone pods, they will be removed as a consequence of the new ones being pushed.

// "temporary", if true, indicates that the target zone mode is only temporary;
// it is used along with zoneMode=singlezone in order to temporarily switch to single-zone mode when upgrading
// from versions with no interconnect support (<=4.13). It has no effect if zoneMode=multizone.
temporary bool
// "configMapFound" indicates whether the interconnect configmap was found; when not found,
// the zone mode defaults to multizone.
configMapFound bool
// ongoingUpgrade is true when the configmap was pushed by CNO itself; it is used by status manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make updates to what ongoingUpgrade means? It seems like its also taking over the role that "temporary" was doing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we've decided that we're not allowing a cluster to be permanently in single zone, I might as well remove "ongoingUpgrade" and just refer to configMapFound, since only CNO is supposed to push the configmap now... let me try it out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack did you try it? all ok on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I retested everything after pushing the current changes: #2076 (comment)

"op": "replace",
"path": "/data/temporary",
"value": "false",
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe stupid question, but why isn't ongoing-upgrade also patched?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's set already at the beginning of the upgrade, in the first "if" block inside prepareUpgradeToInterConnect:

Single-zone mode is only needed during a 4.13->4.14 upgrade. Switching back to single zone is not supported. As a consequence, we can further simplify the code and remove the "ongoing-upgrade" flag, whose purpose was to distinguish between a configmap pushed by a cluster admin to switch to single / multi zone and the configmap pushed by CNO to track the upgrade progress.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav
Copy link
Contributor Author

I'll test the new changes tomorrow morning...

@ricky-rav
Copy link
Contributor Author

/retest

2 similar comments
@ricky-rav
Copy link
Contributor Author

/retest

@ricky-rav
Copy link
Contributor Author

/retest

@ricky-rav
Copy link
Contributor Author

I've manually tested the new changes to fast-forward to the multizone yamls when the cluster is in phase 1 and when it's in phase 2. No surprises, the cluster converges to the new YAMLs as expected.

@ricky-rav
Copy link
Contributor Author

/retest

1 similar comment
@ricky-rav
Copy link
Contributor Author

/retest

@ricky-rav ricky-rav changed the title [release-4.14] SDN-4154: Improve troubleshooting IC upgrades [release-4.14] OCPBUGS-26573: Improve troubleshooting IC upgrades Jan 10, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 10, 2024
@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Feb 14, 2024
Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ricky-rav, trozet, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2024
@asood-rh
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Feb 16, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD acd8754 and 2 for PR HEAD 5983b44 in total

@ricky-rav
Copy link
Contributor Author

/test e2e-hypershift-ovn

@ricky-rav
Copy link
Contributor Author

/retest

Copy link
Contributor

openshift-ci bot commented Feb 16, 2024

@ricky-rav: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-dualstack 5983b44 link false /test e2e-azure-ovn-dualstack
ci/prow/e2e-openstack-kuryr 5983b44 link false /test e2e-openstack-kuryr
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 5983b44 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/security 5983b44 link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ricky-rav
Copy link
Contributor Author

/test e2e-hypershift-ovn

@openshift-merge-bot openshift-merge-bot bot merged commit 8a28156 into openshift:release-4.14 Feb 19, 2024
40 of 43 checks passed
@openshift-ci-robot
Copy link
Contributor

@ricky-rav: Jira Issue OCPBUGS-26573: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-26573 has been moved to the MODIFIED state.

In response to this:

If ever a cluster is not making progress during phase 1 or phase 2 of the upgrade to OVN IC, a cluster admin can now edit the interconnect configmap and add a new key (fast-forward-to-multizone) to bypass the two-phase upgrade and let CNO apply directly the multizone YAMLs.

The current code only allows the cluster to move forward when it's in phase 1 by setting zone-mode=multizone and temporary=false:
CurrentICupgrades drawio

Let's improve that by allowing to jump to multizone in either phase of the upgrade. At the same time, remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade:
ICupgrades drawio

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-network-operator-container-v4.14.0-202402191039.p0.g8a28156.assembly.stream.el8 for distgit cluster-network-operator.
All builds following this will include this PR.

ricky-rav added a commit to ricky-rav/enhancements that referenced this pull request Feb 19, 2024
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/enhancements that referenced this pull request Feb 19, 2024
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
ricky-rav added a commit to ricky-rav/enhancements that referenced this pull request Feb 19, 2024
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.14.0-0.nightly-2024-02-19-170131

ricky-rav added a commit to ricky-rav/enhancements that referenced this pull request Feb 20, 2024
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav
Copy link
Contributor Author

@anuragthehatter for the final diagrams & configmap please refer to openshift/enhancements#1567

ricky-rav added a commit to ricky-rav/enhancements that referenced this pull request Mar 5, 2024
With openshift/cluster-network-operator#2076 we added the possibility to skip the 4.13->4.14 two-phase OVNK upgrade for toubleshooting purposes and fast forward to final YAMLs for 4.14 OVNK. Also, the ConfigMap pushed by CNO got simplified.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@openshift-ci-robot
Copy link
Contributor

@ricky-rav: Jira Issue OCPBUGS-26573 is in an unrecognized state (Closed) and will not be moved to the MODIFIED state.

In response to this:

If ever a cluster is not making progress during phase 1 or phase 2 of the upgrade to OVN IC, a cluster admin can now edit the interconnect configmap and add a new key (fast-forward-to-multizone) to bypass the two-phase upgrade and let CNO apply directly the multizone YAMLs.

The current code only allows the cluster to move forward when it's in phase 1 by setting zone-mode=multizone and temporary=false:
ICupgrades_from_413z_to_41413

Let's improve that by allowing to jump to multizone in either phase of the upgrade. At the same time, remove the temporary field from the configmap, since the ongoing-upgrade field already tracks that CNO is going through an upgrade:
ICupgrades_from_413z_to_41414plus

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet