New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1159: [2/3] Maintain ReconciliationIssues
condition
#1032
OTA-1159: [2/3] Maintain ReconciliationIssues
condition
#1032
Conversation
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a96e7bd
to
3fd6099
Compare
4e58482
to
33e06ec
Compare
/retest |
33e06ec
to
fc5033e
Compare
pkg/start/start.go
Outdated
var enabledGates cvo.FeatureGates | ||
for _, g := range gate.Status.FeatureGates { | ||
if g.Version != version.Version.String() { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this approach discussed briefly in openshift/api#1405, but it feels a bit weird. If a gate claims enabled in 4.y.z and we happen to be running a different version (e.g. because we're updating into a new CVO) it looks like we'll disble all the gates until whatever manages FeatureGate
catches up with us. But we don't want to blip these off early in the update, do we? We might be unique vs. other existing consumers in going first in the update, while most other consumers trail behind the Kube API server operator (except etcd, which is in parallel). Anyhow, I'm not entirely sure how this plays out, and it would be nice to have the commit message talk me through the behavior as we understand it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think I misunderstood how exactly the FeatureGate
works during upgrades - I for some reason believed the resource is managed by the config-operator for some reason I lived in the world where CVO is upgraded after config-operator
, but that does not seem to be true (0000_00_cluster-version-operator_03_deployment
goes before 0000_10_config-operator_08_clusteroperator
. I'm clearly the ideal person to write these PRs :-P
I started following o/enhancements dev-guide/featuresets.md that lead me to FeatureGateAccess
where the controller errors and requeues until the gates are set, and the dev-guide setup code makes the controller startup block until that happens. That felt scary for CVO so I thought running briefly without gates enalbed (which is fine for the current state of gates in CVO where there is one which is off by default) is acceptable (aggravated by my false belief that featuregatestopper
would kill us once status
finally updates).
I guess our only two real options is to assume some sane default or block? And I guess we cannot really block because of our role in the upgrades because we upgrade before config-operator so we would never see the new version gates?
You are totally correct that whatever corners we cut, we need to document them better. I'll spend more time looking into this and at the very least write much better commit messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by adding the UnknownVersion
meta-flag that informs flag consumer code in CVO that we were not able to read enabled/disabled features for our version, so we should follow some sane default behavior, but can tolerate evidence of previously enabled features if we find them in the cluster.
It brings some complexity but I think it is the only solution for CVO to tolerate FeatureGate
status not having a matching version while not blocking startup on that.
pkg/start/start.go
Outdated
@@ -180,6 +201,7 @@ func (o *Options) Run(ctx context.Context) error { | |||
return false, nil | |||
default: | |||
startingFeatureSet = string(gate.Spec.FeatureSet) | |||
enabledGates = cvoGatesFrom(gate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we're going to start depending on cluster
FeatureGate status
here, we'll need something more granular than our current feature-set watcher to notice those changes and keep us up to date. Maybe pkg/featurechangestopper
(already watching/polling) should grow an API to call a callback on changes, and we have a lock on Operator
and a goroutine that updates Operator.enabledGates
on changes? Or rolls the CVO container? Or...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mistakenly believed that featurechangestopper
would kill us on any change but that's really not what happens. library-go's FeatureGateAccess terminates when the set of enabled/disabled gates changes. Our featurechangestopper
terminates when featureset changes.
I think for our case it could be appropriate to do something like what FeatureGateAccess
does and simply terminate on observed gate changes, probably check just the ones that CVO cares about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c1b2767 by making featurechangestopper
terminate CVO when CVO-relevant flags change.
pkg/cvo/status_test.go
Outdated
expectedRriCondition: nil, | ||
}, | ||
{ | ||
name: "RRI disabled, version unknown, failure => condition present", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe add "existing condition" to the name? And a test case for RRI disabled, version unknown, no existing condition, failure => condition not present
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 added in 3da7615
/test all |
/test e2e-agnostic-operator |
/test lint |
Huh for some reason this PR is not picking up recent commits like petr-muller@24b845e |
/close |
@petr-muller: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9929ed6
to
24b845e
Compare
/reopen |
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@petr-muller Please take a look above test, in step 6, when ResourceReconciliationIssues is true but the message is not a json:
|
Upgrade cluster from normal 4.15 to latest with this PR:
7 Check featuregate again:
|
Thanks for testing, I made a mistake in my notes - the JSON part is actually done in a followup PR #1033, not this one. This PR only needs to have a correct True/False value and natural language message. Sorry, it mixed up in my head. |
@petr-muller ok, so I think the change is not break install and upgrade and also can set ResourceReconciliationIssues correctly. |
@JianLi-RH no worries about time, it's not extra urgent. If we can get it tested this week then we're fine 👍 |
pkg/cvo/cvo.go
Outdated
// This is analogical to VersionForOperatorFromEnv() from o/library-go but the import is pretty | ||
// heavy for a single, simple os.Getenv wrapper, so we just inline the logic here. | ||
operatorVersion := os.Getenv("OPERATOR_IMAGE_VERSION") | ||
klog.Infof("Looking up feature gates for version %s", operatorVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should not be printing this log always after we move this out of feature-gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later commit removed this specific line because logging this would be excessive (we'd do it in a loop):
cluster-version-operator/pkg/featuregates/featuregates.go
Lines 79 to 84 in 7bd3096
// CvoGatesFromFeatureGate finds feature gates for a given version in a FeatureGate resource and returns | |
// CvoGates that reflects them, or the default gates if given version was not found in the FeatureGate | |
func CvoGatesFromFeatureGate(gate *configv1.FeatureGate, version string) CvoGates { | |
enabledGates := DefaultCvoGates(version) | |
for _, g := range gate.Status.FeatureGates { |
But one-time log of how feature gates were detected (what version do we see, if we found matching item in FeatureGate .status, etc) is definitely useful and we should have it. Right now we have a one-time log like this:
$ /usr/bin/grep -e start.go -e featuregates.go -e stopper.go openshift-cluster-version_cluster-version-operator-77b9674d7-58k2s_cluster-version-operator.log
I0325 15:16:00.615836 1 start.go:23] ClusterVersionOperator v1.0.0-1183-gb6dbd748-dirty
I0325 15:16:00.747858 1 start.go:281] Waiting on 1 outstanding goroutines.
I0325 15:20:52.964166 1 start.go:549] FeatureGate found in cluster, using its feature set "" at startup
I0325 15:20:54.090127 1 start.go:574] CVO features for version 4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest enabled at startup: {desiredVersion:4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest unknownVersion:false resourceReconciliationIssuesCondition:false}
I0325 15:20:54.090240 1 featurechangestopper.go:123] Starting stop-on-features-change controller with startingRequiredFeatureSet="" startingCvoGates={desiredVersion:4.16.0-0.ci.test-2024-03-25-143851-ci-op-drnlyhbh-latest unknownVersion:false resourceReconciliationIssuesCondition:false}
resourceReconciliationIssuesConditionType v1.ClusterStatusConditionType = "ResourceReconciliationIssues" | ||
|
||
noResourceReconciliationIssuesReason string = "NoIssues" | ||
noResourceReconciliationIssuesMessage string = "No issues found during resource reconciliation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't resource reconciliation
and reconciliation
is same thing in this case? If yes, we should drop resource
word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let me clean up the code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed all instances of "resource reconciliation" to just "reconciliation" in cc78232
The PR overall looks good to me. I have some small nitpicks which I mentioned in my review. |
ReconciliationIssues
condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
cc78232
to
1604448
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Finished below regression:
enable non-TechPreviewNoUpgrade featureset in featuregate
Wait some minutes, check co again:
|
Setup a 4.15 cluster without TP:
Create an image:
Upgrade to above payload:
When the upgrading finished, check above ns and co again:
|
Continue the test with the cluster in case 3.
Check co:
Check if upgradable:
Check manifest:
Unset featureset to disable TechPreviewNoUpgrade
|
hi @petr-muller I have finished 4 feature gate related regression cases, they all work normal. |
/label qe-approved |
@petr-muller: This pull request references OTA-1159 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@JianLi-RH Thank you, great work! 👍 |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-version-operator-container-v4.16.0-202403281444.p0.ge3677e0.assembly.stream.el9 for distgit cluster-version-operator. |
cvo: set ResourceReconciliationIssues condition
When an appropriate feature flag is set, maintain a
ResourceReconciliationIssues
condition on the CV status. This condition is False when no issues were
encountered (signalled by the
Failure
field on theSyncWorkerStatus
parameter) and True otherwise.
cvo: read enabled feature flags from cluster
Read CVO-related Feature Flags from the cluster resource and propagate
them to CVO controllers. Multiplex the coarse cluster feature flag into
smaller, CVO-specific flags for easier maintenance in the future.
Builds on top of #1031
Because of OCPBUGS-30080, we cannot easily determine running CVO version by a single
os.Getenv()
, like other operators can. CVO can determine its version from the initial payload it loads from disk though, but this happens a bit later in the code flow, after leadership lease is acquired and all informers are started. At that point we can provide the feature gate / featureset knowledge to the structures that need it: actual CVO controller and the feature changestopper, but these structures also need to be initialized earlier (they require informers which are already started). This leads to a slightly awkard delayed initialization scheme, where the controller structures are initialized early and populated with early content like informers etc. Later, when informers are started and CVO loads its initial payload, we can extract the version from it and use it to populate the feature gate in the controller structures. Because enabled feature gates are avaiable later in the flow, it also means part of the CVO code cannot be gated by a feature gate (like controller initialization, or initial payload loading). We do not need that now but it may cause issues later.The high-level sequence after this commit looks like this:
enabledFeatureGate
checker with one panics when used (no code can check for gates before we know them)