From 75e8efa19d3b8fb5c7233e3af45f2d0289a3740c Mon Sep 17 00:00:00 2001 From: Hongkai Liu Date: Fri, 5 Sep 2025 15:39:20 -0400 Subject: [PATCH 1/3] NO-JIRA: New rules about CO's Degraded and Available conditions The essence of the new rules is that operators MUST not go Available=False or Degraded=True in an HA cluster during an uneventful CI upgrade. Those rules have applied in CI for a while [1, 2] and OCPBugs have been filed in this area. In order to avoid CI failing, many exceptions have been added in the tests [3, 4] as many of those bugs are still open. It is expected to invest effort to deliver the fixes of those bugs. [1]. https://issues.redhat.com/browse/OTA-700 [2]. https://issues.redhat.com/browse/TRT-1578 [3]. https://github.com/openshift/origin/blob/2af38a7807699b3046a73f931884152a11271d21/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L102 [4]. https://github.com/openshift/origin/pull/27231 --- config/v1/types_cluster_operator.go | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/config/v1/types_cluster_operator.go b/config/v1/types_cluster_operator.go index a447adb9f4a..6ad5f576b97 100644 --- a/config/v1/types_cluster_operator.go +++ b/config/v1/types_cluster_operator.go @@ -154,6 +154,7 @@ const ( // is functional and available in the cluster. Available=False means at least // part of the component is non-functional, and that the condition requires // immediate administrator intervention. + // A component must not report Available=False during the course of a normal upgrade. OperatorAvailable ClusterStatusConditionType = "Available" // Progressing indicates that the component (operator and all configured operands) @@ -175,7 +176,7 @@ const ( // Degraded because it may have a lower quality of service. A component may be // Progressing but not Degraded because the transition from one state to // another does not persist over a long enough period to report Degraded. A - // component should not report Degraded during the course of a normal upgrade. + // component must not report Degraded during the course of a normal upgrade. // A component may report Degraded in response to a persistent infrastructure // failure that requires eventual administrator intervention. For example, if // a control plane host is unhealthy and must be replaced. A component should From 5d8d667e7eb21f85ab1e24c01e9c224c07f4cc7a Mon Sep 17 00:00:00 2001 From: Hongkai Liu Date: Fri, 5 Sep 2025 16:05:17 -0400 Subject: [PATCH 2/3] A New rule about CO's update duration Operators MUST complete their upgrade within 20 minutes in a cluster up to 250 nodes in size, except for Machine Config Operator which has 90 minutes. This formalizes the changes introduced from cluster-version-operator#1165 where CVO begins complaining (Failing=Unknown) whenever a cluster operator takes longer to upgrade than the given time. [1]. https://github.com/openshift/cluster-version-operator/pull/1165 --- config/v1/types_cluster_operator.go | 3 +++ 1 file changed, 3 insertions(+) diff --git a/config/v1/types_cluster_operator.go b/config/v1/types_cluster_operator.go index 6ad5f576b97..d8334e415e8 100644 --- a/config/v1/types_cluster_operator.go +++ b/config/v1/types_cluster_operator.go @@ -164,6 +164,9 @@ const ( // state. If the observed cluster state has changed and the component is // reacting to it (scaling up for instance), Progressing should become true // since it is moving from one steady state to another. + // A component in a cluster with less than 250 nodes must complete a version + // change within a limited period of time: 90 minutes for Machine Config Operator and 20 minutes for others. + // Machine Config Operator is given more time as it needs to restart control plane nodes. OperatorProgressing ClusterStatusConditionType = "Progressing" // Degraded indicates that the component (operator and all configured operands) From 0aed54e62fa01ab6d69c7428d378b4dc1c98ae83 Mon Sep 17 00:00:00 2001 From: Hongkai Liu Date: Mon, 8 Sep 2025 15:29:07 -0400 Subject: [PATCH 3/3] Add more details about the Progressing condition "version change" is counted as config changes and thus cluster operators must report Progressing for it. On the contrary, cluster operators should not report Progress ONLY when it owns a DaemonSet that acts on a new node or a node rebooting. It is because it may happen too often in a heaviy-scaled cluster and make Progressing less useful for the clients that hope to use it for Cluster version updating. --- config/v1/types_cluster_operator.go | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/config/v1/types_cluster_operator.go b/config/v1/types_cluster_operator.go index d8334e415e8..1152182f862 100644 --- a/config/v1/types_cluster_operator.go +++ b/config/v1/types_cluster_operator.go @@ -158,11 +158,13 @@ const ( OperatorAvailable ClusterStatusConditionType = "Available" // Progressing indicates that the component (operator and all configured operands) - // is actively rolling out new code, propagating config changes, or otherwise + // is actively rolling out new code, propagating config changes (e.g, a version change), or otherwise // moving from one steady state to another. Operators should not report - // progressing when they are reconciling (without action) a previously known - // state. If the observed cluster state has changed and the component is - // reacting to it (scaling up for instance), Progressing should become true + // Progressing when they are reconciling (without action) a previously known + // state. Operators should not report Progressing only because DaemonSets owned by them + // are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. + // If the observed cluster state has changed and the component is + // reacting to it (updated proxy configuration for instance), Progressing should become true // since it is moving from one steady state to another. // A component in a cluster with less than 250 nodes must complete a version // change within a limited period of time: 90 minutes for Machine Config Operator and 20 minutes for others.