New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-9133: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded #746
OCPBUGS-9133: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded #746
Conversation
30aa6e6
to
43f13d7
Compare
@wking: This pull request references Bugzilla bug 2058416, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: This pull request references Bugzilla bug 2058416. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: This pull request references Bugzilla bug 2058416, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: An error was encountered removing this pull request from the external tracker bugs for bug 2058416 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
response code 400 not 200
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retitle OCPBUGS-9133: pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and ClusterOperatorDegraded |
/retest-required |
Testing with Cluster Bot and $ oc adm cordon -l node-role.kubernetes.io/control-plane=
node/ip-10-0-120-3.us-west-1.compute.internal cordoned
node/ip-10-0-31-66.us-west-1.compute.internal cordoned
node/ip-10-0-71-7.us-west-1.compute.internal cordoned
$ oc -n openshift-authentication delete "$(oc -n openshift-authentication get -o name pods | head -n1)"
pod "oauth-openshift-6cbcb6579f-2fg22" deleted Do the Machine-approver too, for good measure, as I pick on things that should spook cluster-operators without actually hurting cluster performance (I'm not trying to scale up new Machines/Nodes): $ oc -n openshift-cluster-machine-approver delete "$(oc -n openshift-cluster-machine-approver get -o name pods | head -n1)"
pod "machine-approver-74b7866855-z2flp" deleted With operand pods removed, and the cordon blocking its replacement from scheduling, the operators should be grumbling, but even after 15m, they're still all happy: $ oc get -o json clusteroperator | jq -c '.items[].status.conditions[] | select(.type == "Available" or .type == "Degraded") | {type, status}' | sort | uniq -c
33 {"type":"Available","status":"True"}
33 {"type":"Degraded","status":"False"} Maybe I should go after the registry: $ oc adm cordon -l node-role.kubernetes.io/worker=
node/ip-10-0-108-53.us-west-1.compute.internal cordoned
node/ip-10-0-33-86.us-west-1.compute.internal cordoned
node/ip-10-0-87-58.us-west-1.compute.internal cordoned
$ oc delete namespace openshift-image-registry
namespace "openshift-image-registry" deleted Hey, now an operator is mad: $ oc get -o json clusteroperator | jq -c '.items[] | .metadata.name as $n | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .name = $n' | sort
{"lastTransitionTime":"2024-04-10T06:48:00Z","message":"1 of 6 credentials requests are failing to sync.","reason":"CredentialsFailing","status":"True","type":"Degraded","name":"cloud-credential"} But that's not one I'd been trying to poke, and then it got happy again. Probably just trying to create the registry's CredentialsRequest Secret, and struggling until the CVO had recreated that namespace. Ah, eventually $ oc get -o json clusteroperator | jq -c '.items[] | .metadata.name as $n | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .name = $n' | sort
{"lastTransitionTime":"2024-04-10T06:50:28Z","message":"Failed to resync 4.16.0-0.test-2024-04-10-052103-ci-ln-xtnbb9k-latest because: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 3, ready 0, updated: 3, unavailable: 3, degraded: 0)]]","reason":"RequiredPoolsFailed","status":"True","type":"Degraded","name":"machine-config"} and ~5m later, $ oc get -o json clusteroperator | jq -c '.items[] | .metadata.name as $n | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .name = $n' | sort
{"lastTransitionTime":"2024-04-10T06:50:28Z","message":"Failed to resync 4.16.0-0.test-2024-04-10-052103-ci-ln-xtnbb9k-latest because: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 3, ready 0, updated: 3, unavailable: 3, degraded: 0)]]","reason":"RequiredPoolsFailed","status":"True","type":"Degraded","name":"machine-config"}
{"lastTransitionTime":"2024-04-10T06:55:30Z","message":"OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()","reason":"OAuthServerDeployment_UnavailablePod","status":"True","type":"Degraded","name":"authentication"} And the CVO passes these along: $ oc adm upgrade
Failing=True:
Reason: ClusterOperatorsDegraded
Message: Cluster operators authentication, machine-config are degraded
Error while reconciling 4.16.0-0.test-2024-04-10-052103-ci-ln-xtnbb9k-latest: authentication, machine-config has an unknown error: ClusterOperatorsDegraded
...
operator alert renders with an extra |
…usterOperatorDegraded By adding cluster_operator_up handling for ClusterVersion, with 'version' as the component name, the same way we handle cluster_operator_conditions. This plugs us into ClusterOperatorDown (based on cluster_operator_up) and ClusterOperatorDegraded (based on both cluster_operator_conditions and cluster_operator_up). I've adjusted the ClusterOperatorDegraded rule so that it fires on ClusterVersion Failing=True and does not fire on Failing=False. Thinking through an update from before: 1. Outgoing CVO does not serve cluster_operator_up{name="version"}. 2. User requests an update to a release with this change. 3. New CVO comes in, starts serving cluster_operator_up{name="version"}. 4. Old ClusterOperatorDegraded no matching cluster_operator_conditions{name="version",condition="Degraded"}, falls through to cluster_operator_up{name="version"}, and starts cooking the 'for: 30m'. 5. If we go more than 30m before updating the ClusterOperatorDegraded rule to understand Failing, ClusterOperatorDegraded would fire. We'll need to backport the ClusterOperatorDegraded expr change to one 4.y release before the CVO-metrics change lands to get: 1. Outgoing CVO does not serve cluster_operator_up{name="version"}. 2. User requests an update to a release with the expr change. 3. Incoming ClusterOperatorDegraded sees no cluster_operator_conditions{name="version",condition="Degraded"}, cluster_operator_conditions{name="version",condition="Failing"} (we hope), or cluster_operator_up{name="version"}, so it doesn't fire. Unless we are Failing=True, in which case, hooray, we'll start alerting about it. 4. User requests an update to a release with the CVO-metrics change. 5. New CVO starts serving cluster_operator_up, just like the fresh-modern-install situation, and everything is great. The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45). But it will become more important with [1], which is planning on removing that default creation. When there is no ClusterVersion, we expect ClusterOperatorDown to fire. The awkward: {{ "{{ ... \"version\" }} ... {{ end }}" }} business is because this content is unpacked in two rounds of templating: 1. The cluster-version operator's getPayloadTasks' renderManifest preprocessing for the CVO directory, which is based on Go templates. 2. Prometheus alerting-rule templates, which use console templates [2], which are also based on Go templates [3]. The '{{ "..." }}' wrapping is consumed by the CVO's templating, and the remaining: {{ ... "version" }} ... {{ end }} is left for Promtheus' templating. [1]: openshift#741 [2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating [3]: https://prometheus.io/docs/visualization/consoles/
74312d1
to
10849d7
Compare
|
/retest |
@@ -247,6 +247,7 @@ func (o *Options) run(ctx context.Context, controllerCtx *Context, lock resource | |||
} | |||
klog.Infof("Failed to initialize from payload; shutting down: %v", err) | |||
resultChannel <- asyncResult{name: "payload initialization", error: firstError} | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, essay in 2952a2f ;)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Single failure from the hypershift conformance job, does not seem to be related |
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-hypershift-conformance In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/qe-approved |
/label qe-approved cc @dis016 |
/label qe-approved |
@wking: This pull request references Jira Issue OCPBUGS-9133, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@wking: Jira Issue OCPBUGS-9133: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9133 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-version-operator-container-v4.16.0-202404181209.p0.g5e73deb.assembly.stream.el9 for distgit cluster-version-operator. |
Fix included in accepted release 4.16.0-0.nightly-2024-04-21-123502 |
…sight Structure this output, so it gets all the usual pretty-printing, detail, etc. that the sad-ClusterOperator conditions are getting. Sometimes Failing will complain about sad ClusterOperators, and in that case we'll double up on that messaging. But we're punting on "consolidate when multiple updateInsights complain about the same root cause" for now. And sometimes Failing will complain about other resources, such as ProgressDeadlineExceeded operator Deployments [1], and in that case the information is only flowing out through ClusterVersion, and not via the other resources we check when rendering status. The links to ClusterOperatorDegraded are because [2] folded Failing=True into ClusterOperatorDegraded alerting, although we still need to update the runbook to address that change. [1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148 [2]: openshift/cluster-version-operator#746
…sight Structure this output, so it gets all the usual pretty-printing, detail, etc. that the sad-ClusterOperator conditions are getting. Sometimes Failing will complain about sad ClusterOperators, and in that case we'll double up on that messaging. But we're punting on "consolidate when multiple updateInsights complain about the same root cause" for now. And sometimes Failing will complain about other resources, such as ProgressDeadlineExceeded operator Deployments [1], and in that case the information is only flowing out through ClusterVersion, and not via the other resources we check when rendering status. The links to ClusterOperatorDegraded are because [2] folded Failing=True into ClusterOperatorDegraded alerting, although we still need to update the runbook to address that change. The *output updates are via: $ go build ./cmd/oc $ export OC_ENABLE_CMD_UPGRADE_STATUS=true $ for X in pkg/cli/admin/upgrade/status/examples/*-cv.yaml; do ./oc adm upgrade status --mock-clusterversion "${X}" > "${X/-cv.yaml/.output}"; ./oc adm upgrade status --detailed=all --mock-clusterversion "${X}" > "${X/-cv.yaml/.detailed-output}"; done [1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148 [2]: openshift/cluster-version-operator#746
…sight Structure this output, so it gets all the usual pretty-printing, detail, etc. that the sad-ClusterOperator conditions are getting. Sometimes Failing will complain about sad ClusterOperators, and in that case we'll double up on that messaging. But we're punting on "consolidate when multiple updateInsights complain about the same root cause" for now. And sometimes Failing will complain about other resources, such as ProgressDeadlineExceeded operator Deployments [1], and in that case the information is only flowing out through ClusterVersion, and not via the other resources we check when rendering status. The links to ClusterOperatorDegraded are because [2] folded Failing=True into ClusterOperatorDegraded alerting, although we still need to update the runbook to address that change. The *output updates are via: $ go build ./cmd/oc $ export OC_ENABLE_CMD_UPGRADE_STATUS=true $ for X in pkg/cli/admin/upgrade/status/examples/*-cv.yaml; do ./oc adm upgrade status --mock-clusterversion "${X}" > "${X/-cv.yaml/.output}"; ./oc adm upgrade status --detailed=all --mock-clusterversion "${X}" > "${X/-cv.yaml/.detailed-output}"; done [1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148 [2]: openshift/cluster-version-operator#746
…sight Structure this output, so it gets all the usual pretty-printing, detail, etc. that the sad-ClusterOperator conditions are getting. Sometimes Failing will complain about sad ClusterOperators, and in that case we'll double up on that messaging. But we're punting on "consolidate when multiple updateInsights complain about the same root cause" for now. And sometimes Failing will complain about other resources, such as ProgressDeadlineExceeded operator Deployments [1], and in that case the information is only flowing out through ClusterVersion, and not via the other resources we check when rendering status. The links to ClusterOperatorDegraded are because [2] folded Failing=True into ClusterOperatorDegraded alerting, although we still need to update the runbook to address that change. The *output updates are via: $ go build ./cmd/oc $ export OC_ENABLE_CMD_UPGRADE_STATUS=true $ for X in pkg/cli/admin/upgrade/status/examples/*-cv.yaml; do ./oc adm upgrade status --mock-clusterversion "${X}" > "${X/-cv.yaml/.output}"; ./oc adm upgrade status --detailed=all --mock-clusterversion "${X}" > "${X/-cv.yaml/.detailed-output}"; done [1]: https://github.com/openshift/cluster-version-operator/blob/1acac06742fb0e3e49ffe2294864007f26a7799d/lib/resourcebuilder/apps.go#L122C124-L122C148 [2]: openshift/cluster-version-operator#746
By adding
cluster_operator_up
handling for ClusterVersion, withversion
as the component name, the same way we handlecluster_operator_conditions
. This plugs us intoClusterOperatorDown
(based oncluster_operator_up
) andClusterOperatorDegraded
(based on bothcluster_operator_conditions
andcluster_operator_up
).I've adjusted the
ClusterOperatorDegraded
rule so that it fires on ClusterVersionFailing=True
and does not fire onFailing=False
. Thinking through an update from before:cluster_operator_up{name="version"}
.cluster_operator_up{name="version"}
.ClusterOperatorDegraded
no matchingcluster_operator_conditions{name="version",condition="Degraded"}
, falls through tocluster_operator_up{name="version"}
, and starts cooking thefor: 30m
.ClusterOperatorDegraded
rule to understandFailing
,ClusterOperatorDegraded
would fire.We'll need to backport the
ClusterOperatorDegraded
expr
change to one 4.y release before the CVO-metrics change lands to get:cluster_operator_up{name="version"}
.expr
change.ClusterOperatorDegraded
sees nocluster_operator_conditions{name="version",condition="Degraded"}
,cluster_operator_conditions{name="version",condition="Failing"}
(we hope), orcluster_operator_up{name="version"}
, so it doesn't fire. Unless we areFailing=True
, in which case, hooray, we'll start alerting about it.cluster_operator_up{name="version"}
, just like the fresh-modern-install situation, and everything is great.The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (#45). But it will become more important with #741, which is planning on removing that default creation. When there is no ClusterVersion, we expect
ClusterOperatorDown
to fire.