Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1159: [2/3] Maintain ReconciliationIssues condition #1032

Conversation

petr-muller
Copy link
Member

@petr-muller petr-muller commented Jan 31, 2024

cvo: set ResourceReconciliationIssues condition

When an appropriate feature flag is set, maintain a ResourceReconciliationIssues
condition on the CV status. This condition is False when no issues were
encountered (signalled by the Failure field on the SyncWorkerStatus
parameter) and True otherwise.


cvo: read enabled feature flags from cluster

Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.

Builds on top of #1031


Because of OCPBUGS-30080, we cannot easily determine running CVO version by a single os.Getenv(), like other operators can. CVO can determine its version from the initial payload it loads from disk though, but this happens a bit later in the code flow, after leadership lease is acquired and all informers are started. At that point we can provide the feature gate / featureset knowledge to the structures that need it: actual CVO controller and the feature changestopper, but these structures also need to be initialized earlier (they require informers which are already started). This leads to a slightly awkard delayed initialization scheme, where the controller structures are initialized early and populated with early content like informers etc. Later, when informers are started and CVO loads its initial payload, we can extract the version from it and use it to populate the feature gate in the controller structures. Because enabled feature gates are avaiable later in the flow, it also means part of the CVO code cannot be gated by a feature gate (like controller initialization, or initial payload loading). We do not need that now but it may cause issues later.

The high-level sequence after this commit looks like this:

  1. Initialize CVO and ChangeStopper controller structures with informers they need, and populate CVO's enabledFeatureGate checker with one panics when used (no code can check for gates before we know them)
  2. Acquire lease and start the informers
  3. Fetch a FeatureGate resource from the cluster (using an informer) and determine the FeatureSet from it (needed to load the payload)
  4. Load the initial payload from disk and extract the version from it
  5. Use the version to determine the enabled feature gates from the FeatureGate resource
  6. Populate the CVO and ChangeStopper controller structures with the newly discovered feature gates

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 31, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

cvo: set ResourceReconciliationIssues condition

When an appropriate feature flag is set, maintain a ResourceReconciliationIssues
condition on the CV status. This condition is False when no issues were
encountered (signalled by the Failure field on the SyncWorkerStatus
parameter) and True otherwise.


cvo: read enabled feature flags from cluster

Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 31, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

cvo: set ResourceReconciliationIssues condition

When an appropriate feature flag is set, maintain a ResourceReconciliationIssues
condition on the CV status. This condition is False when no issues were
encountered (signalled by the Failure field on the SyncWorkerStatus
parameter) and True otherwise.


cvo: read enabled feature flags from cluster

Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.

Builds on top of #1031

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller petr-muller force-pushed the ota-1159-result-of-work-rri-condition branch from a96e7bd to 3fd6099 Compare January 31, 2024 14:41
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2024
@petr-muller petr-muller force-pushed the ota-1159-result-of-work-rri-condition branch 3 times, most recently from 4e58482 to 33e06ec Compare January 31, 2024 16:34
@petr-muller
Copy link
Member Author

/retest

pkg/cvo/cvo.go Outdated Show resolved Hide resolved
var enabledGates cvo.FeatureGates
for _, g := range gate.Status.FeatureGates {
if g.Version != version.Version.String() {
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this approach discussed briefly in openshift/api#1405, but it feels a bit weird. If a gate claims enabled in 4.y.z and we happen to be running a different version (e.g. because we're updating into a new CVO) it looks like we'll disble all the gates until whatever manages FeatureGate catches up with us. But we don't want to blip these off early in the update, do we? We might be unique vs. other existing consumers in going first in the update, while most other consumers trail behind the Kube API server operator (except etcd, which is in parallel). Anyhow, I'm not entirely sure how this plays out, and it would be nice to have the commit message talk me through the behavior as we understand it :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think I misunderstood how exactly the FeatureGate works during upgrades - I for some reason believed the resource is managed by the config-operator for some reason I lived in the world where CVO is upgraded after config-operator, but that does not seem to be true (0000_00_cluster-version-operator_03_deployment goes before 0000_10_config-operator_08_clusteroperator. I'm clearly the ideal person to write these PRs :-P

I started following o/enhancements dev-guide/featuresets.md that lead me to FeatureGateAccess where the controller errors and requeues until the gates are set, and the dev-guide setup code makes the controller startup block until that happens. That felt scary for CVO so I thought running briefly without gates enalbed (which is fine for the current state of gates in CVO where there is one which is off by default) is acceptable (aggravated by my false belief that featuregatestopper would kill us once status finally updates).

I guess our only two real options is to assume some sane default or block? And I guess we cannot really block because of our role in the upgrades because we upgrade before config-operator so we would never see the new version gates?

You are totally correct that whatever corners we cut, we need to document them better. I'll spend more time looking into this and at the very least write much better commit messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by adding the UnknownVersion meta-flag that informs flag consumer code in CVO that we were not able to read enabled/disabled features for our version, so we should follow some sane default behavior, but can tolerate evidence of previously enabled features if we find them in the cluster.

It brings some complexity but I think it is the only solution for CVO to tolerate FeatureGate status not having a matching version while not blocking startup on that.

@@ -180,6 +201,7 @@ func (o *Options) Run(ctx context.Context) error {
return false, nil
default:
startingFeatureSet = string(gate.Spec.FeatureSet)
enabledGates = cvoGatesFrom(gate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we're going to start depending on cluster FeatureGate status here, we'll need something more granular than our current feature-set watcher to notice those changes and keep us up to date. Maybe pkg/featurechangestopper (already watching/polling) should grow an API to call a callback on changes, and we have a lock on Operator and a goroutine that updates Operator.enabledGates on changes? Or rolls the CVO container? Or...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mistakenly believed that featurechangestopper would kill us on any change but that's really not what happens. library-go's FeatureGateAccess terminates when the set of enabled/disabled gates changes. Our featurechangestopper terminates when featureset changes.

I think for our case it could be appropriate to do something like what FeatureGateAccess does and simply terminate on observed gate changes, probably check just the ones that CVO cares about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c1b2767 by making featurechangestopper terminate CVO when CVO-relevant flags change.

expectedRriCondition: nil,
},
{
name: "RRI disabled, version unknown, failure => condition present",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add "existing condition" to the name? And a test case for RRI disabled, version unknown, no existing condition, failure => condition not present?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 added in 3da7615

@petr-muller
Copy link
Member Author

/test all

@petr-muller
Copy link
Member Author

/test e2e-agnostic-operator

pkg/cvo/cvo.go Outdated Show resolved Hide resolved
@petr-muller
Copy link
Member Author

/test lint
/test unit

@petr-muller
Copy link
Member Author

Huh for some reason this PR is not picking up recent commits like petr-muller@24b845e

@petr-muller
Copy link
Member Author

/close

@openshift-ci openshift-ci bot closed this Mar 1, 2024
Copy link
Contributor

openshift-ci bot commented Mar 1, 2024

@petr-muller: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller petr-muller force-pushed the ota-1159-result-of-work-rri-condition branch from 9929ed6 to 24b845e Compare March 1, 2024 15:43
@petr-muller
Copy link
Member Author

/reopen

@openshift-ci openshift-ci bot reopened this Mar 1, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 1, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

cvo: set ResourceReconciliationIssues condition

When an appropriate feature flag is set, maintain a ResourceReconciliationIssues
condition on the CV status. This condition is False when no issues were
encountered (signalled by the Failure field on the SyncWorkerStatus
parameter) and True otherwise.


cvo: read enabled feature flags from cluster

Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.

Builds on top of #1031

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@JianLi-RH
Copy link

@petr-muller Please take a look above test, in step 6, when ResourceReconciliationIssues is true but the message is not a json:

  {
    "lastTransitionTime": "2024-03-27T09:55:25Z",
    "message": "Issues found during resource reconciliation: Cluster operator control-plane-machine-set is not available",
    "reason": "IssuesFound",
    "status": "True",
    "type": "ResourceReconciliationIssues"
  }

@JianLi-RH
Copy link

JianLi-RH commented Mar 27, 2024

Upgrade cluster from normal 4.15 to latest with this PR:

  1. setup a cluster
    https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/273129/
[jianl@localhost 415]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-03-26-221137   True        False         10m     Cluster version is 4.15.0-0.nightly-2024-03-26-221137
  1. Build an image:
    build 4.16,openshift/cluster-version-operator#1032
  2. Upgrade to above image:
[jianl@localhost 415]$ oc adm upgrade --to-image registry.build03.ci.openshift.org/ci-ln-xt6n3jt/release:latest --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build03.ci.openshift.org/ci-ln-xt6n3jt/release:latest
[jianl@localhost 415]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-03-26-221137   True        True          3m32s   Working towards 4.16.0-0.test-2024-03-27-061445-ci-ln-xt6n3jt-latest: 108 of 886 done (12% complete), waiting on etcd, kube-apiserver
[jianl@localhost 415]$ 
  1. check upgrade
[jianl@localhost 415]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.test-2024-03-27-061445-ci-ln-xt6n3jt-latest   True        False         7m38s   Cluster version is 4.16.0-0.test-2024-03-27-061445-ci-ln-xt6n3jt-latest
[jianl@localhost 415]$ 
  1. check featuregate
[jianl@localhost 415]$ oc get featuregate cluster -ojson
{
    "apiVersion": "config.openshift.io/v1",
    "kind": "FeatureGate",
    "metadata": {
        "creationTimestamp": "2024-03-27T09:36:04Z",
        "generation": 1,
        "name": "cluster",
        "resourceVersion": "36681",
        "uid": "6aa98c82-46f1-4a7d-b58a-1bf0470f0b47"
    },
    "spec": {},
    "status": {
        "featureGates": [
            {
                "disabled": [
                    {
                        "name": "AdminNetworkPolicy"
                    },
                    {
                        "name": "AlertingRules"
                    },
                    {
                        "name": "AutomatedEtcdBackup"
                    },
                    {
                        "name": "CSIDriverSharedResource"
                    },
                    {
                        "name": "ClusterAPIInstall"
                    },
                    {
                        "name": "DNSNameResolver"
                    },
                    {
                        "name": "DisableKubeletCloudCredentialProviders"
                    },
                    {
                        "name": "DynamicResourceAllocation"
                    },
                    {
                        "name": "EventedPLEG"
                    },
                    {
                        "name": "Example"
                    },
                    {
                        "name": "ExternalOIDC"
                    },
                    {
                        "name": "ExternalRouteCertificate"
                    },
                    {
                        "name": "GCPClusterHostedDNS"
                    },
                    {
                        "name": "GCPLabelsTags"
                    },
                    {
                        "name": "GatewayAPI"
                    },
                    {
                        "name": "HardwareSpeed"
                    },
                    {
                        "name": "ImagePolicy"
                    },
                    {
                        "name": "InsightsConfig"
                    },
                    {
                        "name": "InsightsConfigAPI"
                    },
                    {
                        "name": "InsightsOnDemandDataGather"
                    },
                    {
                        "name": "InstallAlternateInfrastructureAWS"
                    },
                    {
                        "name": "MachineAPIOperatorDisableMachineHealthCheckController"
                    },
                    {
                        "name": "MachineAPIProviderOpenStack"
                    },
                    {
                        "name": "MachineConfigNodes"
                    },
                    {
                        "name": "ManagedBootImages"
                    },
                    {
                        "name": "MaxUnavailableStatefulSet"
                    },
                    {
                        "name": "MetricsServer"
                    },
                    {
                        "name": "MixedCPUsAllocation"
                    },
                    {
                        "name": "NewOLM"
                    },
                    {
                        "name": "NodeDisruptionPolicy"
                    },
                    {
                        "name": "NodeSwap"
                    },
                    {
                        "name": "OnClusterBuild"
                    },
                    {
                        "name": "PinnedImages"
                    },
                    {
                        "name": "PlatformOperators"
                    },
                    {
                        "name": "RouteExternalCertificate"
                    },
                    {
                        "name": "SignatureStores"
                    },
                    {
                        "name": "SigstoreImageVerification"
                    },
                    {
                        "name": "TranslateStreamCloseWebsocketRequests"
                    },
                    {
                        "name": "UpgradeStatus"
                    },
                    {
                        "name": "ValidatingAdmissionPolicy"
                    },
                    {
                        "name": "VolumeGroupSnapshot"
                    }
                ],
                "enabled": [
                    {
                        "name": "AlibabaPlatform"
                    },
                    {
                        "name": "AzureWorkloadIdentity"
                    },
                    {
                        "name": "BareMetalLoadBalancer"
                    },
                    {
                        "name": "BuildCSIVolumes"
                    },
                    {
                        "name": "CloudDualStackNodeIPs"
                    },
                    {
                        "name": "ExternalCloudProvider"
                    },
                    {
                        "name": "ExternalCloudProviderAzure"
                    },
                    {
                        "name": "ExternalCloudProviderExternal"
                    },
                    {
                        "name": "ExternalCloudProviderGCP"
                    },
                    {
                        "name": "KMSv1"
                    },
                    {
                        "name": "NetworkLiveMigration"
                    },
                    {
                        "name": "OpenShiftPodSecurityAdmission"
                    },
                    {
                        "name": "PrivateHostedZoneAWS"
                    },
                    {
                        "name": "VSphereControlPlaneMachineSet"
                    },
                    {
                        "name": "VSphereStaticIPs"
                    }
                ],
                "version": "4.16.0-0.test-2024-03-27-061445-ci-ln-xt6n3jt-latest"
            },
            {
                "disabled": [
                    {
                        "name": "AdminNetworkPolicy"
                    },
                    {
                        "name": "AutomatedEtcdBackup"
                    },
                    {
                        "name": "CSIDriverSharedResource"
                    },
                    {
                        "name": "ClusterAPIInstall"
                    },
                    {
                        "name": "DNSNameResolver"
                    },
                    {
                        "name": "DisableKubeletCloudCredentialProviders"
                    },
                    {
                        "name": "DynamicResourceAllocation"
                    },
                    {
                        "name": "EventedPLEG"
                    },
                    {
                        "name": "GCPClusterHostedDNS"
                    },
                    {
                        "name": "GCPLabelsTags"
                    },
                    {
                        "name": "GatewayAPI"
                    },
                    {
                        "name": "InsightsConfigAPI"
                    },
                    {
                        "name": "InstallAlternateInfrastructureAWS"
                    },
                    {
                        "name": "MachineAPIOperatorDisableMachineHealthCheckController"
                    },
                    {
                        "name": "MachineAPIProviderOpenStack"
                    },
                    {
                        "name": "MachineConfigNodes"
                    },
                    {
                        "name": "ManagedBootImages"
                    },
                    {
                        "name": "MaxUnavailableStatefulSet"
                    },
                    {
                        "name": "MetricsServer"
                    },
                    {
                        "name": "MixedCPUsAllocation"
                    },
                    {
                        "name": "NodeSwap"
                    },
                    {
                        "name": "OnClusterBuild"
                    },
                    {
                        "name": "OpenShiftPodSecurityAdmission"
                    },
                    {
                        "name": "RouteExternalCertificate"
                    },
                    {
                        "name": "SignatureStores"
                    },
                    {
                        "name": "SigstoreImageVerification"
                    },
                    {
                        "name": "VSphereControlPlaneMachineSet"
                    },
                    {
                        "name": "VSphereStaticIPs"
                    },
                    {
                        "name": "ValidatingAdmissionPolicy"
                    }
                ],
                "enabled": [
                    {
                        "name": "AlibabaPlatform"
                    },
                    {
                        "name": "AzureWorkloadIdentity"
                    },
                    {
                        "name": "BuildCSIVolumes"
                    },
                    {
                        "name": "CloudDualStackNodeIPs"
                    },
                    {
                        "name": "ExternalCloudProvider"
                    },
                    {
                        "name": "ExternalCloudProviderAzure"
                    },
                    {
                        "name": "ExternalCloudProviderExternal"
                    },
                    {
                        "name": "ExternalCloudProviderGCP"
                    },
                    {
                        "name": "NetworkLiveMigration"
                    },
                    {
                        "name": "PrivateHostedZoneAWS"
                    }
                ],
                "version": "4.15.0-0.nightly-2024-03-26-221137"
            }
        ]
    }
}
[jianl@localhost 415]$ 
  1. enable TP
[jianl@localhost 415]$ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'
featuregate.config.openshift.io/cluster patched
[jianl@localhost 415]$ 

7 Check featuregate again:

[jianl@localhost 415]$ oc get featuregate cluster -ojson
{
    "apiVersion": "config.openshift.io/v1",
    "kind": "FeatureGate",
    "metadata": {
        "creationTimestamp": "2024-03-27T09:36:04Z",
        "generation": 2,
        "name": "cluster",
        "resourceVersion": "85800",
        "uid": "6aa98c82-46f1-4a7d-b58a-1bf0470f0b47"
    },
    "spec": {
        "featureSet": "TechPreviewNoUpgrade"
    },
    "status": {
        "featureGates": [
            {
                "disabled": [
                    {
                        "name": "ClusterAPIInstall"
                    },
                    {
                        "name": "DisableKubeletCloudCredentialProviders"
                    },
                    {
                        "name": "EventedPLEG"
                    },
                    {
                        "name": "MachineAPIOperatorDisableMachineHealthCheckController"
                    }
                ],
                "enabled": [
                    {
                        "name": "AdminNetworkPolicy"
                    },
                    {
                        "name": "AlertingRules"
                    },
                    {
                        "name": "AlibabaPlatform"
                    },
                    {
                        "name": "AutomatedEtcdBackup"
                    },
                    {
                        "name": "AzureWorkloadIdentity"
                    },
                    {
                        "name": "BareMetalLoadBalancer"
                    },
                    {
                        "name": "BuildCSIVolumes"
                    },
                    {
                        "name": "CSIDriverSharedResource"
                    },
                    {
                        "name": "CloudDualStackNodeIPs"
                    },
                    {
                        "name": "DNSNameResolver"
                    },
                    {
                        "name": "DynamicResourceAllocation"
                    },
                    {
                        "name": "Example"
                    },
                    {
                        "name": "ExternalCloudProvider"
                    },
                    {
                        "name": "ExternalCloudProviderAzure"
                    },
                    {
                        "name": "ExternalCloudProviderExternal"
                    },
                    {
                        "name": "ExternalCloudProviderGCP"
                    },
                    {
                        "name": "ExternalOIDC"
                    },
                    {
                        "name": "ExternalRouteCertificate"
                    },
                    {
                        "name": "GCPClusterHostedDNS"
                    },
                    {
                        "name": "GCPLabelsTags"
                    },
                    {
                        "name": "GatewayAPI"
                    },
                    {
                        "name": "HardwareSpeed"
                    },
                    {
                        "name": "ImagePolicy"
                    },
                    {
                        "name": "InsightsConfig"
                    },
                    {
                        "name": "InsightsConfigAPI"
                    },
                    {
                        "name": "InsightsOnDemandDataGather"
                    },
                    {
                        "name": "InstallAlternateInfrastructureAWS"
                    },
                    {
                        "name": "KMSv1"
                    },
                    {
                        "name": "MachineAPIProviderOpenStack"
                    },
                    {
                        "name": "MachineConfigNodes"
                    },
                    {
                        "name": "ManagedBootImages"
                    },
                    {
                        "name": "MaxUnavailableStatefulSet"
                    },
                    {
                        "name": "MetricsServer"
                    },
                    {
                        "name": "MixedCPUsAllocation"
                    },
                    {
                        "name": "NetworkLiveMigration"
                    },
                    {
                        "name": "NewOLM"
                    },
                    {
                        "name": "NodeDisruptionPolicy"
                    },
                    {
                        "name": "NodeSwap"
                    },
                    {
                        "name": "OnClusterBuild"
                    },
                    {
                        "name": "OpenShiftPodSecurityAdmission"
                    },
                    {
                        "name": "PinnedImages"
                    },
                    {
                        "name": "PlatformOperators"
                    },
                    {
                        "name": "PrivateHostedZoneAWS"
                    },
                    {
                        "name": "RouteExternalCertificate"
                    },
                    {
                        "name": "SignatureStores"
                    },
                    {
                        "name": "SigstoreImageVerification"
                    },
                    {
                        "name": "TranslateStreamCloseWebsocketRequests"
                    },
                    {
                        "name": "UpgradeStatus"
                    },
                    {
                        "name": "VSphereControlPlaneMachineSet"
                    },
                    {
                        "name": "VSphereStaticIPs"
                    },
                    {
                        "name": "ValidatingAdmissionPolicy"
                    },
                    {
                        "name": "VolumeGroupSnapshot"
                    }
                ],
                "version": "4.16.0-0.test-2024-03-27-061445-ci-ln-xt6n3jt-latest"
            },
            {
                "disabled": [
                    {
                        "name": "AdminNetworkPolicy"
                    },
                    {
                        "name": "AutomatedEtcdBackup"
                    },
                    {
                        "name": "CSIDriverSharedResource"
                    },
                    {
                        "name": "ClusterAPIInstall"
                    },
                    {
                        "name": "DNSNameResolver"
                    },
                    {
                        "name": "DisableKubeletCloudCredentialProviders"
                    },
                    {
                        "name": "DynamicResourceAllocation"
                    },
                    {
                        "name": "EventedPLEG"
                    },
                    {
                        "name": "GCPClusterHostedDNS"
                    },
                    {
                        "name": "GCPLabelsTags"
                    },
                    {
                        "name": "GatewayAPI"
                    },
                    {
                        "name": "InsightsConfigAPI"
                    },
                    {
                        "name": "InstallAlternateInfrastructureAWS"
                    },
                    {
                        "name": "MachineAPIOperatorDisableMachineHealthCheckController"
                    },
                    {
                        "name": "MachineAPIProviderOpenStack"
                    },
                    {
                        "name": "MachineConfigNodes"
                    },
                    {
                        "name": "ManagedBootImages"
                    },
                    {
                        "name": "MaxUnavailableStatefulSet"
                    },
                    {
                        "name": "MetricsServer"
                    },
                    {
                        "name": "MixedCPUsAllocation"
                    },
                    {
                        "name": "NodeSwap"
                    },
                    {
                        "name": "OnClusterBuild"
                    },
                    {
                        "name": "OpenShiftPodSecurityAdmission"
                    },
                    {
                        "name": "RouteExternalCertificate"
                    },
                    {
                        "name": "SignatureStores"
                    },
                    {
                        "name": "SigstoreImageVerification"
                    },
                    {
                        "name": "VSphereControlPlaneMachineSet"
                    },
                    {
                        "name": "VSphereStaticIPs"
                    },
                    {
                        "name": "ValidatingAdmissionPolicy"
                    }
                ],
                "enabled": [
                    {
                        "name": "AlibabaPlatform"
                    },
                    {
                        "name": "AzureWorkloadIdentity"
                    },
                    {
                        "name": "BuildCSIVolumes"
                    },
                    {
                        "name": "CloudDualStackNodeIPs"
                    },
                    {
                        "name": "ExternalCloudProvider"
                    },
                    {
                        "name": "ExternalCloudProviderAzure"
                    },
                    {
                        "name": "ExternalCloudProviderExternal"
                    },
                    {
                        "name": "ExternalCloudProviderGCP"
                    },
                    {
                        "name": "NetworkLiveMigration"
                    },
                    {
                        "name": "PrivateHostedZoneAWS"
                    }
                ],
                "version": "4.15.0-0.nightly-2024-03-26-221137"
            }
        ]
    }
}
[jianl@localhost 415]$ 

@petr-muller
Copy link
Member Author

@JianLi-RH

Please take a look above test, in step 6, when ResourceReconciliationIssues is true but the message is not a json

Thanks for testing, I made a mistake in my notes - the JSON part is actually done in a followup PR #1033, not this one. This PR only needs to have a correct True/False value and natural language message. Sorry, it mixed up in my head.

@JianLi-RH
Copy link

@petr-muller ok, so I think the change is not break install and upgrade and also can set ResourceReconciliationIssues correctly.
But for TechPreview part, since I am not familiar with it, so maybe need one more day to test it.

@petr-muller
Copy link
Member Author

@JianLi-RH no worries about time, it's not extra urgent. If we can get it tested this week then we're fine 👍

pkg/cvo/cvo.go Outdated
// This is analogical to VersionForOperatorFromEnv() from o/library-go but the import is pretty
// heavy for a single, simple os.Getenv wrapper, so we just inline the logic here.
operatorVersion := os.Getenv("OPERATOR_IMAGE_VERSION")
klog.Infof("Looking up feature gates for version %s", operatorVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not be printing this log always after we move this out of feature-gate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later commit removed this specific line because logging this would be excessive (we'd do it in a loop):

// CvoGatesFromFeatureGate finds feature gates for a given version in a FeatureGate resource and returns
// CvoGates that reflects them, or the default gates if given version was not found in the FeatureGate
func CvoGatesFromFeatureGate(gate *configv1.FeatureGate, version string) CvoGates {
enabledGates := DefaultCvoGates(version)
for _, g := range gate.Status.FeatureGates {

But one-time log of how feature gates were detected (what version do we see, if we found matching item in FeatureGate .status, etc) is definitely useful and we should have it. Right now we have a one-time log like this:

$ /usr/bin/grep -e start.go -e featuregates.go -e stopper.go openshift-cluster-version_cluster-version-operator-77b9674d7-58k2s_cluster-version-operator.log
I0325 15:16:00.615836       1 start.go:23] ClusterVersionOperator v1.0.0-1183-gb6dbd748-dirty
I0325 15:16:00.747858       1 start.go:281] Waiting on 1 outstanding goroutines.
I0325 15:20:52.964166       1 start.go:549] FeatureGate found in cluster, using its feature set "" at startup
I0325 15:20:54.090127       1 start.go:574] CVO features for version 4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest enabled at startup: {desiredVersion:4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest unknownVersion:false resourceReconciliationIssuesCondition:false}
I0325 15:20:54.090240       1 featurechangestopper.go:123] Starting stop-on-features-change controller with startingRequiredFeatureSet="" startingCvoGates={desiredVersion:4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest unknownVersion:false resourceReconciliationIssuesCondition:false}

resourceReconciliationIssuesConditionType v1.ClusterStatusConditionType = "ResourceReconciliationIssues"

noResourceReconciliationIssuesReason string = "NoIssues"
noResourceReconciliationIssuesMessage string = "No issues found during resource reconciliation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't resource reconciliation and reconciliation is same thing in this case? If yes, we should drop resource word.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me clean up the code 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed all instances of "resource reconciliation" to just "reconciliation" in cc78232

@LalatenduMohanty
Copy link
Member

The PR overall looks good to me. I have some small nitpicks which I mentioned in my review.

@petr-muller petr-muller changed the title OTA-1159: [2/3] Maintain ResourceReconciliationIssues condition OTA-1159: [2/3] Maintain ReconciliationIssues condition Mar 27, 2024
Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
@petr-muller petr-muller force-pushed the ota-1159-result-of-work-rri-condition branch from cc78232 to 1604448 Compare March 27, 2024 18:01
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
Copy link
Contributor

openshift-ci bot commented Mar 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Mar 27, 2024

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@JianLi-RH
Copy link

JianLi-RH commented Mar 28, 2024

Finished below regression:

  1. Techpreview operator should not be installed when enable non-TechPreviewNoUpgrade featureset
    Pass
[jianl@localhost 416]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.test-2024-03-28-061910-ci-ln-4nt94vb-latest   True        False         67m     Cluster version is 4.16.0-0.test-2024-03-28-061910-ci-ln-4nt94vb-latest
[jianl@localhost 416]$ oc get ns openshift-cluster-olm-operator
Error from server (NotFound): namespaces "openshift-cluster-olm-operator" not found
[jianl@localhost 416]$ oc get co olm
Error from server (NotFound): clusteroperators.config.openshift.io "olm" not found
[jianl@localhost 416]$ 

enable non-TechPreviewNoUpgrade featureset in featuregate

[jianl@localhost 416]$ oc patch featuregate cluster -p '{"spec": {"featureSet": "CustomNoUpgrade"}}' --type merge
featuregate.config.openshift.io/cluster patched
[jianl@localhost 416]$ 

Wait some minutes, check co again:

[jianl@localhost 416]$ oc get ns openshift-cluster-olm-operator
Error from server (NotFound): namespaces "openshift-cluster-olm-operator" not found
[jianl@localhost 416]$ oc get co olm
Error from server (NotFound): clusteroperators.config.openshift.io "olm" not found
[jianl@localhost 416]$ 

@JianLi-RH
Copy link

@JianLi-RH
Copy link

JianLi-RH commented Mar 28, 2024

  1. Techpreview operator will not be installed during upgrade
    Pass

Setup a 4.15 cluster without TP:

[jianl@localhost 415]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-03-27-192252   True        False         49m     Cluster version is 4.15.0-0.nightly-2024-03-27-192252
[jianl@localhost 415]$ oc get ns openshift-cluster-api
Error from server (NotFound): namespaces "openshift-cluster-api" not found
[jianl@localhost 415]$ oc get co cluster-api
Error from server (NotFound): clusteroperators.config.openshift.io "cluster-api" not found
[jianl@localhost 415]$ oc get co openshift-cluster-olm-operator
Error from server (NotFound): clusteroperators.config.openshift.io "openshift-cluster-olm-operator" not found
[jianl@localhost 415]$ 

Create an image:
build 4.16,openshift/cluster-version-operator#1032

[jianl@localhost 416]$ grep -rl 'release.openshift.io/feature-set: .*TechPreviewNoUpgrade.*' manifests/|grep "clusteroperator.yaml"
manifests/0000_30_cluster-api_12_clusteroperator.yaml
manifests/0000_50_cluster-platform-operators-manager_07-aggregated-clusteroperator.yaml
[jianl@localhost 416]$ 

Upgrade to above payload:

[jianl@localhost 415]$ oc adm upgrade --to-image registry.build03.ci.openshift.org/ci-ln-ynszrqt/release:latest --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build03.ci.openshift.org/ci-ln-ynszrqt/release:latest
[jianl@localhost 415]$ 

When the upgrading finished, check above ns and co again:

[jianl@localhost 415]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.test-2024-03-28-041439-ci-ln-ynszrqt-latest   True        False         69s     Cluster version is 4.16.0-0.test-2024-03-28-041439-ci-ln-ynszrqt-latest
[jianl@localhost 415]$ oc get ns openshift-cluster-api
Error from server (NotFound): namespaces "openshift-cluster-api" not found
[jianl@localhost 415]$ oc get co cluster-api
Error from server (NotFound): clusteroperators.config.openshift.io "cluster-api" not found
[jianl@localhost 415]$ 

@JianLi-RH
Copy link

  1. Techpreview operator should be installed when enable TechPreviewNoUpgrade featureset
    Pass

Continue the test with the cluster in case 3.
Enable TP:

[jianl@localhost 415]$ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge
featuregate.config.openshift.io/cluster patched
[jianl@localhost 415]$ oc get featuregate cluster -ojson|jq .spec
{
  "featureSet": "TechPreviewNoUpgrade"
}
[jianl@localhost 415]$ 

Check co:

[jianl@localhost 415]$ oc get ns openshift-cluster-api
NAME                    STATUS   AGE
openshift-cluster-api   Active   6m29s
[jianl@localhost 415]$ oc get co cluster-api
NAME          VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
cluster-api   4.16.0-0.test-2024-03-28-041439-ci-ln-ynszrqt-latest   True        False         False      4m51s   
[jianl@localhost 415]$

Check if upgradable:

[jianl@localhost 415]$ oc adm upgrade
Cluster version is 4.16.0-0.test-2024-03-28-041439-ci-ln-ynszrqt-latest

Upgradeable=False

  Reason: ClusterOperatorsNotUpgradeable
  Message: Multiple cluster operators should not be upgraded between minor versions:
  * Cluster operator config-operator should not be upgraded between minor versions: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates
  * Cluster operator machine-config should not be upgraded between minor versions: PoolUpdating: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.15
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-0.test-2024-03-28-041439-ci-ln-ynszrqt-latest not found in the "stable-4.15" channel

[jianl@localhost 415]$ 

Check manifest:

[jianl@localhost 415]$ grep -rh "release.openshift.io/feature-set:" post-manifest/
        release.openshift.io/feature-set: CustomNoUpgrade,TechPreviewNoUpgrade
        release.openshift.io/feature-set: CustomNoUpgrade,TechPreviewNoUpgrade
        release.openshift.io/feature-set: CustomNoUpgrade,TechPreviewNoUpgrade
        release.openshift.io/feature-set: CustomNoUpgrade,TechPreviewNoUpgrade

Unset featureset to disable TechPreviewNoUpgrade

[jianl@localhost 415]$ oc patch featuregate cluster -p '{"spec": {"featureSet": ""}}' --type merge
The FeatureGate "cluster" is invalid: spec.featureSet: Forbidden: once enabled, tech preview features may not be disabled
[jianl@localhost 415]$ oc patch featuregate cluster --type=json -p '[{"op":"remove", "path":"/spec/featureSet"}]'
The FeatureGate "cluster" is invalid: spec.featureSet: Forbidden: once enabled, tech preview features may not be disabled
[jianl@localhost 415]$ 

@JianLi-RH
Copy link

hi @petr-muller I have finished 4 feature gate related regression cases, they all work normal.

@JianLi-RH
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 28, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 28, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

cvo: set ResourceReconciliationIssues condition

When an appropriate feature flag is set, maintain a ResourceReconciliationIssues
condition on the CV status. This condition is False when no issues were
encountered (signalled by the Failure field on the SyncWorkerStatus
parameter) and True otherwise.


cvo: read enabled feature flags from cluster

Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.

Builds on top of #1031


Because of OCPBUGS-30080, we cannot easily determine running CVO version by a single os.Getenv(), like other operators can. CVO can determine its version from the initial payload it loads from disk though, but this happens a bit later in the code flow, after leadership lease is acquired and all informers are started. At that point we can provide the feature gate / featureset knowledge to the structures that need it: actual CVO controller and the feature changestopper, but these structures also need to be initialized earlier (they require informers which are already started). This leads to a slightly awkard delayed initialization scheme, where the controller structures are initialized early and populated with early content like informers etc. Later, when informers are started and CVO loads its initial payload, we can extract the version from it and use it to populate the feature gate in the controller structures. Because enabled feature gates are avaiable later in the flow, it also means part of the CVO code cannot be gated by a feature gate (like controller initialization, or initial payload loading). We do not need that now but it may cause issues later.

The high-level sequence after this commit looks like this:

  1. Initialize CVO and ChangeStopper controller structures with informers they need, and populate CVO's enabledFeatureGate checker with one panics when used (no code can check for gates before we know them)
  2. Acquire lease and start the informers
  3. Fetch a FeatureGate resource from the cluster (using an informer) and determine the FeatureSet from it (needed to load the payload)
  4. Load the initial payload from disk and extract the version from it
  5. Use the version to determine the enabled feature gates from the FeatureGate resource
  6. Populate the CVO and ChangeStopper controller structures with the newly discovered feature gates

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot bot merged commit e3677e0 into openshift:master Mar 28, 2024
11 checks passed
@petr-muller
Copy link
Member Author

@JianLi-RH Thank you, great work! 👍

@petr-muller petr-muller deleted the ota-1159-result-of-work-rri-condition branch March 28, 2024 11:47
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.16.0-202403281444.p0.ge3677e0.assembly.stream.el9 for distgit cluster-version-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants