Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-9133: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded #746

Merged

Conversation

wking
Copy link
Member

@wking wking commented Feb 24, 2022

By adding cluster_operator_up handling for ClusterVersion, with version as the component name, the same way we handle cluster_operator_conditions. This plugs us into ClusterOperatorDown (based on cluster_operator_up) and ClusterOperatorDegraded (based on both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on ClusterVersion Failing=True and does not fire on Failing=False. Thinking through an update from before:

  1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
  2. User requests an update to a release with this change.
  3. New CVO comes in, starts serving cluster_operator_up{name="version"}.
  4. Old ClusterOperatorDegraded no matching cluster_operator_conditions{name="version",condition="Degraded"}, falls through to cluster_operator_up{name="version"}, and starts cooking the for: 30m.
  5. If we go more than 30m before updating the ClusterOperatorDegraded rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one 4.y release before the CVO-metrics change lands to get:

  1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
  2. User requests an update to a release with the expr change.
  3. Incoming ClusterOperatorDegraded sees no cluster_operator_conditions{name="version",condition="Degraded"}, cluster_operator_conditions{name="version",condition="Failing"} (we hope), or cluster_operator_up{name="version"}, so it doesn't fire. Unless we are Failing=True, in which case, hooray, we'll start alerting about it.
  4. User requests an update to a release with the CVO-metrics change.
  5. New CVO starts serving cluster_operator_up{name="version"}, just like the fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (#45). But it will become more important with #741, which is planning on removing that default creation. When there is no ClusterVersion, we expect ClusterOperatorDown to fire.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2022
@wking wking force-pushed the metrics-for-no-cluster-version branch from 30aa6e6 to 43f13d7 Compare February 24, 2022 21:17
@wking wking changed the title 2058416: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded Bug 2058416: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded Apr 12, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 12, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 12, 2022

@wking: This pull request references Bugzilla bug 2058416, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

In response to this:

Bug 2058416: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2022
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2022
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 22, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jan 21, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2023

@wking: This pull request references Bugzilla bug 2058416. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.
Warning: Failed to comment on Bugzilla bug with reason for changed state.

In response to this:

Bug 2058416: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking reopened this Jan 25, 2023
@openshift-ci openshift-ci bot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2023

@wking: This pull request references Bugzilla bug 2058416, which is invalid:

  • expected the bug to target the "4.13.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2058416: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Feb 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2023

@wking: An error was encountered removing this pull request from the external tracker bugs for bug 2058416 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. response code 400 not 200

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

Bug 2058416: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking reopened this Mar 18, 2024
@wking
Copy link
Member Author

wking commented Mar 18, 2024

/retitle OCPBUGS-9133: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded

@wking
Copy link
Member Author

wking commented Apr 9, 2024

/retest-required

@wking
Copy link
Member Author

wking commented Apr 10, 2024

Testing with Cluster Bot and launch 4.16,openshift/cluster-version-operator#746 aws (logs), I opened by making auth mad:

$ oc adm cordon -l node-role.kubernetes.io/control-plane=
node/ip-10-0-120-3.us-west-1.compute.internal cordoned
node/ip-10-0-31-66.us-west-1.compute.internal cordoned
node/ip-10-0-71-7.us-west-1.compute.internal cordoned
$ oc -n openshift-authentication delete "$(oc -n openshift-authentication get -o name pods | head -n1)"
pod "oauth-openshift-6cbcb6579f-2fg22" deleted

Do the Machine-approver too, for good measure, as I pick on things that should spook cluster-operators without actually hurting cluster performance (I'm not trying to scale up new Machines/Nodes):

$ oc -n openshift-cluster-machine-approver delete "$(oc -n openshift-cluster-machine-approver get -o name pods | head -n1)"
pod "machine-approver-74b7866855-z2flp" deleted

With operand pods removed, and the cordon blocking its replacement from scheduling, the operators should be grumbling, but even after 15m, they're still all happy:

$ oc get -o json clusteroperator | jq -c '.items[].status.conditions[] | select(.type == "Available" or .type == "Degraded") | {type, status}' | sort | uniq -c
     33 {"type":"Available","status":"True"}
     33 {"type":"Degraded","status":"False"}

Maybe I should go after the registry:

$ oc adm cordon -l node-role.kubernetes.io/worker=
node/ip-10-0-108-53.us-west-1.compute.internal cordoned
node/ip-10-0-33-86.us-west-1.compute.internal cordoned
node/ip-10-0-87-58.us-west-1.compute.internal cordoned
$ oc delete namespace openshift-image-registry
namespace "openshift-image-registry" deleted

Hey, now an operator is mad:

$ oc get -o json clusteroperator | jq -c '.items[] | .metadata.name as $n | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .name = $n' | sort
{"lastTransitionTime":"2024-04-10T06:48:00Z","message":"1 of 6 credentials requests are failing to sync.","reason":"CredentialsFailing","status":"True","type":"Degraded","name":"cloud-credential"}

But that's not one I'd been trying to poke, and then it got happy again. Probably just trying to create the registry's CredentialsRequest Secret, and struggling until the CVO had recreated that namespace. Ah, eventually machine-config complains about the cordoned control plane:

$ oc get -o json clusteroperator | jq -c '.items[] | .metadata.name as $n | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .name = $n' | sort
{"lastTransitionTime":"2024-04-10T06:50:28Z","message":"Failed to resync 4.16.0-0.test-2024-04-10-052103-ci-ln-xtnbb9k-latest because: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 3, ready 0, updated: 3, unavailable: 3, degraded: 0)]]","reason":"RequiredPoolsFailed","status":"True","type":"Degraded","name":"machine-config"}

and ~5m later, authentication starts complaining too:

$ oc get -o json clusteroperator | jq -c '.items[] | .metadata.name as $n | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .name = $n' | sort
{"lastTransitionTime":"2024-04-10T06:50:28Z","message":"Failed to resync 4.16.0-0.test-2024-04-10-052103-ci-ln-xtnbb9k-latest because: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 3, ready 0, updated: 3, unavailable: 3, degraded: 0)]]","reason":"RequiredPoolsFailed","status":"True","type":"Degraded","name":"machine-config"}
{"lastTransitionTime":"2024-04-10T06:55:30Z","message":"OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()","reason":"OAuthServerDeployment_UnavailablePod","status":"True","type":"Degraded","name":"authentication"}

And the CVO passes these along:

$ oc adm upgrade
Failing=True:

  Reason: ClusterOperatorsDegraded
  Message: Cluster operators authentication, machine-config are degraded

Error while reconciling 4.16.0-0.test-2024-04-10-052103-ci-ln-xtnbb9k-latest: authentication, machine-config has an unknown error: ClusterOperatorsDegraded
...

version alert renders well:

image

operator alert renders with an extra }}:

image

…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

The awkward:

  {{ "{{ ... \"version\" }} ... {{ end }}" }}

business is because this content is unpacked in two rounds of
templating:

1. The cluster-version operator's getPayloadTasks' renderManifest
   preprocessing for the CVO directory, which is based on Go
   templates.
2. Prometheus alerting-rule templates, which use console templates
   [2], which are also based on Go templates [3].

The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:

  {{ ... "version" }} ... {{ end }}

is left for Promtheus' templating.

[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
@wking wking force-pushed the metrics-for-no-cluster-version branch from 74312d1 to 10849d7 Compare April 10, 2024 07:04
@wking
Copy link
Member Author

wking commented Apr 10, 2024

earlier testing:

operator alert renders with an extra }}:

I've pushed 74312d1 -> 10849d7 to address this.

@petr-muller
Copy link
Member

/retest

@@ -247,6 +247,7 @@ func (o *Options) run(ctx context.Context, controllerCtx *Context, lock resource
}
klog.Infof("Failed to initialize from payload; shutting down: %v", err)
resultChannel <- asyncResult{name: "payload initialization", error: firstError}
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the panic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, essay in 2952a2f ;)

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2024
Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-upgrade 43f13d7 link true /test e2e-agnostic-upgrade
ci/prow/e2e-agnostic-upgrade-into-change 43f13d7 link true /test e2e-agnostic-upgrade-into-change
ci/prow/e2e-agnostic-upgrade-out-of-change 43f13d7 link true /test e2e-agnostic-upgrade-out-of-change
ci/prow/e2e-agnostic 43f13d7 link true /test e2e-agnostic

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member

: [sig-api-machinery][Feature:APIServer][Late] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients [Suite:openshift/conformance/parallel

Single failure from the hypershift conformance job, does not seem to be related
/override ci/prow/e2e-hypershift-conformance

Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-hypershift-conformance

In response to this:

: [sig-api-machinery][Feature:APIServer][Late] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients [Suite:openshift/conformance/parallel

Single failure from the hypershift conformance job, does not seem to be related
/override ci/prow/e2e-hypershift-conformance

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dis016
Copy link

dis016 commented Apr 17, 2024

/qe-approved

@jiajliu
Copy link

jiajliu commented Apr 18, 2024

/label qe-approved cc @dis016

@wking
Copy link
Member Author

wking commented Apr 18, 2024

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 18, 2024
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-9133, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

In response to this:

By adding cluster_operator_up handling for ClusterVersion, with version as the component name, the same way we handle cluster_operator_conditions. This plugs us into ClusterOperatorDown (based on cluster_operator_up) and ClusterOperatorDegraded (based on both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on ClusterVersion Failing=True and does not fire on Failing=False. Thinking through an update from before:

  1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
  2. User requests an update to a release with this change.
  3. New CVO comes in, starts serving cluster_operator_up{name="version"}.
  4. Old ClusterOperatorDegraded no matching cluster_operator_conditions{name="version",condition="Degraded"}, falls through to cluster_operator_up{name="version"}, and starts cooking the for: 30m.
  5. If we go more than 30m before updating the ClusterOperatorDegraded rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one 4.y release before the CVO-metrics change lands to get:

  1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
  2. User requests an update to a release with the expr change.
  3. Incoming ClusterOperatorDegraded sees no cluster_operator_conditions{name="version",condition="Degraded"}, cluster_operator_conditions{name="version",condition="Failing"} (we hope), or cluster_operator_up{name="version"}, so it doesn't fire. Unless we are Failing=True, in which case, hooray, we'll start alerting about it.
  4. User requests an update to a release with the CVO-metrics change.
  5. New CVO starts serving cluster_operator_up{name="version"}, just like the fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (#45). But it will become more important with #741, which is planning on removing that default creation. When there is no ClusterVersion, we expect ClusterOperatorDown to fire.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from dis016 April 18, 2024 03:28
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD debaaf6 and 2 for PR HEAD 10849d7 in total

@openshift-merge-bot openshift-merge-bot bot merged commit 5e73deb into openshift:master Apr 18, 2024
11 checks passed
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-9133: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9133 has been moved to the MODIFIED state.

In response to this:

By adding cluster_operator_up handling for ClusterVersion, with version as the component name, the same way we handle cluster_operator_conditions. This plugs us into ClusterOperatorDown (based on cluster_operator_up) and ClusterOperatorDegraded (based on both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on ClusterVersion Failing=True and does not fire on Failing=False. Thinking through an update from before:

  1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
  2. User requests an update to a release with this change.
  3. New CVO comes in, starts serving cluster_operator_up{name="version"}.
  4. Old ClusterOperatorDegraded no matching cluster_operator_conditions{name="version",condition="Degraded"}, falls through to cluster_operator_up{name="version"}, and starts cooking the for: 30m.
  5. If we go more than 30m before updating the ClusterOperatorDegraded rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one 4.y release before the CVO-metrics change lands to get:

  1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
  2. User requests an update to a release with the expr change.
  3. Incoming ClusterOperatorDegraded sees no cluster_operator_conditions{name="version",condition="Degraded"}, cluster_operator_conditions{name="version",condition="Failing"} (we hope), or cluster_operator_up{name="version"}, so it doesn't fire. Unless we are Failing=True, in which case, hooray, we'll start alerting about it.
  4. User requests an update to a release with the CVO-metrics change.
  5. New CVO starts serving cluster_operator_up{name="version"}, just like the fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (#45). But it will become more important with #741, which is planning on removing that default creation. When there is no ClusterVersion, we expect ClusterOperatorDown to fire.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking wking deleted the metrics-for-no-cluster-version branch April 18, 2024 15:10
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.16.0-202404181209.p0.g5e73deb.assembly.stream.el9 for distgit cluster-version-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-04-21-123502

wking added a commit to wking/oc that referenced this pull request Apr 24, 2024
…sight

Structure this output, so it gets all the usual pretty-printing,
detail, etc. that the sad-ClusterOperator conditions are getting.
Sometimes Failing will complain about sad ClusterOperators, and in
that case we'll double up on that messaging.  But we're punting on
"consolidate when multiple updateInsights complain about the same root
cause" for now.  And sometimes Failing will complain about other
resources, such as ProgressDeadlineExceeded operator Deployments [1],
and in that case the information is only flowing out through
ClusterVersion, and not via the other resources we check when
rendering status.

The links to ClusterOperatorDegraded are because [2] folded
Failing=True into ClusterOperatorDegraded alerting, although we still
need to update the runbook to address that change.

[1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148
[2]: openshift/cluster-version-operator#746
wking added a commit to wking/oc that referenced this pull request Apr 24, 2024
…sight

Structure this output, so it gets all the usual pretty-printing,
detail, etc. that the sad-ClusterOperator conditions are getting.
Sometimes Failing will complain about sad ClusterOperators, and in
that case we'll double up on that messaging.  But we're punting on
"consolidate when multiple updateInsights complain about the same root
cause" for now.  And sometimes Failing will complain about other
resources, such as ProgressDeadlineExceeded operator Deployments [1],
and in that case the information is only flowing out through
ClusterVersion, and not via the other resources we check when
rendering status.

The links to ClusterOperatorDegraded are because [2] folded
Failing=True into ClusterOperatorDegraded alerting, although we still
need to update the runbook to address that change.

The *output updates are via:

  $ go build ./cmd/oc
  $ export  OC_ENABLE_CMD_UPGRADE_STATUS=true
  $ for X in pkg/cli/admin/upgrade/status/examples/*-cv.yaml; do ./oc adm upgrade status --mock-clusterversion "${X}" > "${X/-cv.yaml/.output}"; ./oc adm upgrade status --detailed=all --mock-clusterversion "${X}" > "${X/-cv.yaml/.detailed-output}"; done

[1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148
[2]: openshift/cluster-version-operator#746
wking added a commit to wking/oc that referenced this pull request Apr 24, 2024
…sight

Structure this output, so it gets all the usual pretty-printing,
detail, etc. that the sad-ClusterOperator conditions are getting.
Sometimes Failing will complain about sad ClusterOperators, and in
that case we'll double up on that messaging.  But we're punting on
"consolidate when multiple updateInsights complain about the same root
cause" for now.  And sometimes Failing will complain about other
resources, such as ProgressDeadlineExceeded operator Deployments [1],
and in that case the information is only flowing out through
ClusterVersion, and not via the other resources we check when
rendering status.

The links to ClusterOperatorDegraded are because [2] folded
Failing=True into ClusterOperatorDegraded alerting, although we still
need to update the runbook to address that change.

The *output updates are via:

  $ go build ./cmd/oc
  $ export  OC_ENABLE_CMD_UPGRADE_STATUS=true
  $ for X in pkg/cli/admin/upgrade/status/examples/*-cv.yaml; do ./oc adm upgrade status --mock-clusterversion "${X}" > "${X/-cv.yaml/.output}"; ./oc adm upgrade status --detailed=all --mock-clusterversion "${X}" > "${X/-cv.yaml/.detailed-output}"; done

[1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148
[2]: openshift/cluster-version-operator#746
wking added a commit to wking/oc that referenced this pull request Apr 29, 2024
…sight

Structure this output, so it gets all the usual pretty-printing,
detail, etc. that the sad-ClusterOperator conditions are getting.
Sometimes Failing will complain about sad ClusterOperators, and in
that case we'll double up on that messaging.  But we're punting on
"consolidate when multiple updateInsights complain about the same root
cause" for now.  And sometimes Failing will complain about other
resources, such as ProgressDeadlineExceeded operator Deployments [1],
and in that case the information is only flowing out through
ClusterVersion, and not via the other resources we check when
rendering status.

The links to ClusterOperatorDegraded are because [2] folded
Failing=True into ClusterOperatorDegraded alerting, although we still
need to update the runbook to address that change.

The *output updates are via:

  $ go build ./cmd/oc
  $ export  OC_ENABLE_CMD_UPGRADE_STATUS=true
  $ for X in pkg/cli/admin/upgrade/status/examples/*-cv.yaml; do ./oc adm upgrade status --mock-clusterversion "${X}" > "${X/-cv.yaml/.output}"; ./oc adm upgrade status --detailed=all --mock-clusterversion "${X}" > "${X/-cv.yaml/.detailed-output}"; done

[1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148
[2]: openshift/cluster-version-operator#746
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants