Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/cli/admin/upgrade: Clarify client-side vs. server-side docs #1181

Conversation

wking
Copy link
Member

@wking wking commented Jun 25, 2022

The outgoing docs were frequently conflated. The incoming docs hopefully make the distinction clearer, and they list the client-side checkForUpgrade conditions. I haven't mentioned the Invalid check, because that trips so rarely, but I have used the open-ended "include checks for...", so folks aren't too surprised if they hit a client-side guard that's not listed in the docs. The "failing clusters" bit is a lie, but that's already tracked in #900, and eventually I'll be able to talk someone into reviewing the fix.

The cluster-side guards are also open-ended, although I do call out both release verification (signature checks and similar) andupgradeable conditions, since those are declared in the force godocs. I do not mention additional guards like the etcd backups, because the version of oc requesting the update may diverge from the version of the cluster being asked to update, so we don't really know which checks that cluster will run cluster-side.

The outgoing docs were frequently conflated.  The incoming docs
hopefully make the distinction clearer, and they list the client-side
checkForUpgrade conditions.  I haven't mentioned the Invalid check,
because that trips so rarely, but I have used the open-ended "include
checks for...", so folks aren't too surprised if they hit a
client-side guard that's not listed in the docs.  The "failing
clusters" bit is a lie, but that's already tracked in [1], and
eventually I'll be able to talk someone into reviewing the fix.

The cluster-side guards are also open-ended, although I do call out
both release verification (signature checks and similar) and
upgradeable conditions, since those are declared in the force godocs
[2].  I do not mention additional guards like the etcd backups [3],
because the version of oc requesting the update may diverge from the
version of the cluster being asked to update, so we don't really know
which checks that cluster will run cluster-side.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1992680
[2]: https://github.com/openshift/api/blob/22eb4f6f4385a0183a5eee4c8ca6d49eecda8120/config/v1/types_cluster_version.go#L378-L386
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1997347
@openshift-ci openshift-ci bot requested review from deads2k and soltysh June 25, 2022 03:34
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2022
flags.BoolVar(&o.AllowExplicitUpgrade, "allow-explicit-upgrade", o.AllowExplicitUpgrade, "Upgrade even if the upgrade target is not listed in the available versions list.")
flags.BoolVar(&o.AllowUpgradeWithWarnings, "allow-upgrade-with-warnings", o.AllowUpgradeWithWarnings, "Upgrade even if an upgrade is in process or a cluster error is blocking the update.")
flags.BoolVar(&o.AllowUpgradeWithWarnings, "allow-upgrade-with-warnings", o.AllowUpgradeWithWarnings, "Upgrade regardless of client-side guard failures, such as upgrades in progress or failing clusters.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm surprised that "failing clusters" is a client side check (I assume "failing clusters" means ClusterVersion status.condition failing=true) and not a serverside check, does that mean if someone edits the CV directly to request an upgrade, the failing condition will not block the upgrade?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right. If someone directly modify clusterversion resource then they can bypass the client side checks. Also I think webconsole does not have these checks. Thats why I am not supportive of client side checks as it is not consistent across different interfaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats why I am not supportive of client side checks as it is not consistent across different interfaces.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I assume "failing clusters" means ClusterVersion status.condition failing=true)

Failing here means when operators are degraded.

func checkForUpgrade(cv *configv1.ClusterVersion) error {
results := []string{}
if c := findClusterOperatorStatusCondition(cv.Status.Conditions, "Invalid"); c != nil && c.Status == configv1.ConditionTrue {
results = append(results, fmt.Sprintf("the cluster version object is invalid, you must correct the invalid state first:\n\n Reason: %s\n Message: %s\n\n", c.Reason, strings.ReplaceAll(c.Message, "\n", "\n ")))
}
if c := findClusterOperatorStatusCondition(cv.Status.Conditions, configv1.OperatorDegraded); c != nil && c.Status == configv1.ConditionTrue {
results = append(results, fmt.Sprintf("the cluster is experiencing an upgrade-blocking error:\n\n Reason: %s\n Message: %s\n\n", c.Reason, strings.ReplaceAll(c.Message, "\n", "\n ")))
}
if c := findClusterOperatorStatusCondition(cv.Status.Conditions, configv1.OperatorProgressing); c != nil && c.Status == configv1.ConditionTrue {
results = append(results, fmt.Sprintf("the cluster is already upgrading:\n\n Reason: %s\n Message: %s\n\n", c.Reason, strings.ReplaceAll(c.Message, "\n", "\n ")))
}
if len(results) == 0 {
return nil
}
return errors.New(strings.Join(results, ""))
}

such as upgrades in progress or failing clusters.

@wking we should change this to such as upgrades in progress or some situation where upgrade is stuck because of an error condition.

flags.BoolVar(&o.AllowExplicitUpgrade, "allow-explicit-upgrade", o.AllowExplicitUpgrade, "Upgrade even if the upgrade target is not listed in the available versions list.")
flags.BoolVar(&o.AllowUpgradeWithWarnings, "allow-upgrade-with-warnings", o.AllowUpgradeWithWarnings, "Upgrade even if an upgrade is in process or a cluster error is blocking the update.")
flags.BoolVar(&o.AllowUpgradeWithWarnings, "allow-upgrade-with-warnings", o.AllowUpgradeWithWarnings, "Upgrade regardless of client-side guard failures, such as upgrades in progress or failing clusters.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I assume "failing clusters" means ClusterVersion status.condition failing=true)

Failing here means when operators are degraded.

func checkForUpgrade(cv *configv1.ClusterVersion) error {
results := []string{}
if c := findClusterOperatorStatusCondition(cv.Status.Conditions, "Invalid"); c != nil && c.Status == configv1.ConditionTrue {
results = append(results, fmt.Sprintf("the cluster version object is invalid, you must correct the invalid state first:\n\n Reason: %s\n Message: %s\n\n", c.Reason, strings.ReplaceAll(c.Message, "\n", "\n ")))
}
if c := findClusterOperatorStatusCondition(cv.Status.Conditions, configv1.OperatorDegraded); c != nil && c.Status == configv1.ConditionTrue {
results = append(results, fmt.Sprintf("the cluster is experiencing an upgrade-blocking error:\n\n Reason: %s\n Message: %s\n\n", c.Reason, strings.ReplaceAll(c.Message, "\n", "\n ")))
}
if c := findClusterOperatorStatusCondition(cv.Status.Conditions, configv1.OperatorProgressing); c != nil && c.Status == configv1.ConditionTrue {
results = append(results, fmt.Sprintf("the cluster is already upgrading:\n\n Reason: %s\n Message: %s\n\n", c.Reason, strings.ReplaceAll(c.Message, "\n", "\n ")))
}
if len(results) == 0 {
return nil
}
return errors.New(strings.Join(results, ""))
}

such as upgrades in progress or failing clusters.

@wking we should change this to such as upgrades in progress or some situation where upgrade is stuck because of an error condition.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Apr 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bparees
Copy link
Contributor

bparees commented Apr 6, 2023

@wking @LalatenduMohanty did this get improved via another PR? if not, can we reopen this and land it?

@LalatenduMohanty
Copy link
Member

/reopen

@LalatenduMohanty
Copy link
Member

/lifecycle frozen

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2023

@LalatenduMohanty: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot reopened this Apr 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2023

@LalatenduMohanty: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LalatenduMohanty
Copy link
Member

/remove-lifecycle stale

@petr-muller
Copy link
Member

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 6, 2023
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 8228cb2 and 2 for PR HEAD 10bd469 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 539d06f and 1 for PR HEAD 10bd469 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 8964f36 and 0 for PR HEAD 10bd469 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 10, 2023

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 10bd469 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/hold

Revision 10bd469 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2023
@LalatenduMohanty
Copy link
Member

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2023
@LalatenduMohanty
Copy link
Member

/retest

@wking
Copy link
Member Author

wking commented Apr 15, 2023

/test e2e-aws

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 15, 2023

@wking: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test build-rpms-from-tar
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-builds
  • /test e2e-aws-ovn-serial
  • /test e2e-aws-ovn-upgrade
  • /test images
  • /test rpm-build
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test e2e-metal-ipi-ovn-ipv6

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-oc-master-build-rpms-from-tar
  • pull-ci-openshift-oc-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-oc-master-e2e-aws-ovn
  • pull-ci-openshift-oc-master-e2e-aws-ovn-serial
  • pull-ci-openshift-oc-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-oc-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-oc-master-images
  • pull-ci-openshift-oc-master-rpm-build
  • pull-ci-openshift-oc-master-unit
  • pull-ci-openshift-oc-master-verify
  • pull-ci-openshift-oc-master-verify-deps

In response to this:

/test e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 5f8c36d into openshift:master Apr 17, 2023
13 checks passed
@wking wking deleted the doc-distinguish-upgrade-client-vs-server-guards branch April 17, 2023 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants