Skip to content

Commit

Permalink
pkg/cvo/metrics: Connect ClusterVersion to ClusterOperatorDown and Cl…
Browse files Browse the repository at this point in the history
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, #45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

[1]: #741
  • Loading branch information
wking committed Mar 18, 2024
1 parent 8f4a120 commit af22e20
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 3 deletions.
Expand Up @@ -112,7 +112,9 @@ spec:
max by (namespace, name, reason)
(
(
cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"}
cluster_operator_conditions{job="cluster-version-operator", name!="version", condition="Degraded"}
or on (namespace, name)
cluster_operator_conditions{job="cluster-version-operator", name="version", condition="Failing"}
or on (namespace, name)
group by (namespace, name) (cluster_operator_up{job="cluster-version-operator"})
) == 1
Expand Down
22 changes: 20 additions & 2 deletions pkg/cvo/metrics.go
Expand Up @@ -374,7 +374,16 @@ func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) {
current := m.optr.currentVersion()
var completed configv1.UpdateHistory

if cv, err := m.optr.cvLister.Get(m.optr.name); err == nil {
if cv, err := m.optr.cvLister.Get(m.optr.name); apierrors.IsNotFound(err) {
g := m.clusterOperatorUp.WithLabelValues("version", "")
g.Set(0)
ch <- g

g = m.clusterOperatorConditions.WithLabelValues("version", string(configv1.OperatorAvailable), "ClusterVersionNotFound")
g.Set(0)
ch <- g
} else if err == nil {

// output cluster version

var initial configv1.UpdateHistory
Expand Down Expand Up @@ -482,7 +491,16 @@ func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) {
klog.V(2).Infof("skipping metrics for ClusterVersion condition %s=%s (neither True nor False)", condition.Type, condition.Status)
continue
}
g := m.clusterOperatorConditions.WithLabelValues("version", string(condition.Type), string(condition.Reason))

g := m.clusterOperatorUp.WithLabelValues("version", completed.Version)
if resourcemerge.IsOperatorStatusConditionTrue(cv.Status.Conditions, configv1.OperatorAvailable) {
g.Set(1)
} else {
g.Set(0)
}
ch <- g

g = m.clusterOperatorConditions.WithLabelValues("version", string(condition.Type), string(condition.Reason))
if condition.Status == configv1.ConditionTrue {
g.Set(1)
} else {
Expand Down

0 comments on commit af22e20

Please sign in to comment.