OLM-2760: Introduce the "core" ClusterOperator controller #62

timflannagan · 2022-09-19T18:18:05Z

Builds on top of #61 and introduces a "core" ClusterOperator resource and a controller that manages that resource. This controller was described in the phase 0 EP.

openshift-ci · 2022-09-19T18:19:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [timflannagan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

timflannagan · 2022-09-19T18:19:54Z

/jira refresh

config/clusteroperator/core_clusteroperator.yaml

controllers/core_clusteroperator_controller.go

joelanford · 2022-09-19T19:14:26Z

controllers/core_clusteroperator_controller.go

+	coBuilder.WithDegraded(openshiftconfigv1.ConditionFalse)
+	coBuilder.WithAvailable(openshiftconfigv1.ConditionTrue, fmt.Sprintf("The platform operator manager is available at %s", r.ReleaseVersion), clusteroperator.ReasonAsExpected)


Should one or both of Degraded/Available rely on any smoke tests? (e.g. can I list POs without error?)

It depends on what smoke tests we think are reasonable. Reading through https://coreos.slack.com/archives/C030W4V9BHN/p1663354445071699 it seems like setting A=T is sufficient during phase 0 but I'm open to ideas.

At a minimum, I think we should attempt a list of POs before we say A=T? I'm trying to think through various POM failures. Here's my stab at filling in a table.

Failure platform-operators-core platform-operators-aggregated

Can't communicate with apiserver Available=F Available=F

Don't have permission to list POs Available=F Available=F

Failed to list bundles from catalog source(s) Available=F Available=F

Failed to list BDs Available=F Available=F

Failed to locate PO packageName in listed bundles Available=T Available=F

Don't have permission to create/update BDs for POs Available=F Available=F

BD for a PO did not progress to InstallationSucceeded Available=T Available=F

Failed to list bundles from catalog source(s)

This one seems iffy since I recall that catsrcs GRPC connections can be flaky. Seems like we'd want to do retries for a little bit, then go Degraded=true, then eventually go Available=false

Also from the CVO docs:

An operator shoould not report the Available status condition the first time until they are completely rolled out (or within some reasonable percentage if the component must be installed to all nodes)

This seems super important to make sure we get right when managing PO installations for POs that are declared at cluster install time. I put #40 on hold while we think through this.

And the value of listing the PlatformOperators in the cluster is to ensure that CRD has been registered? We basically only care about whether an error was returned vs. length checking the number of items the list query returned. I think my main concern is whether that opens the door for transient events, e.g. the API server is temporarily down, and whether toggling A=[T,F] is desirable in such a configuration.

I need to really dive more into the ClusterOperator documentation to get a better feel for what's applicable here given we don't explicitly manage any operands with this CO controller. Thoughts @bparees @wking?

checking that you can list the api you care about is definitely more "aggressive" than i think most COs bother with. That doesn't make it wrong, per se.

It accomplishes two things:

ensures the api is defined/available

ensures the operator itself has RBAC permissions to interact with the api

That said, i don't know that you need to explicitly poll it, more like if the control loop gets an error interacting with the api, then you might go degraded or unavailable.

I can take a stab at this 5m threshold approach to determine A=[T,F] when I'm back on normal dev duties next week.

I'm mainly concerned with the initial cluster rollout edge case, and having a controller reconciliation loop that blindly sets A=T could be sufficient given we can assume the controller pod is running. Obviously it's not the most robust solution but helps side step any forward looking QE bug where we roll out a bad component container image during phase 0.

having a controller reconciliation loop that blindly sets A=T could be sufficient given we can assume the controller pod is running.

i think that is the right starting point.

as for whether we need a threshold before going degraded... i could see us just keeping it simple and not going degraded

I was thinking about this some more today: what happens if we're currently expressing A=T, and we fail a single list query? Ideally, we wouldn't set A=F given it could be transient issue. Would it make sense instead to set D=T here, which would update the LastTransistionTime, and keep the A=T and P=T conditions? Any subsequent reconciliations that continue to fail a list query would check whether D=T is already set, and if so, compare the LastTransistionTime to the failure threshold to determine whether we should be setting A=F.

@bparees @joelanford Thoughts? It was unclear to me reading through the states listed in https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md#conditions-and-installupgrade whether this was a supported configuration.

Took a rough stab at this implementation in cdda9b5.

@timflannagan not sure what you mean by "supported configuration" but the algorithm/determination seems reasonable to me. Of course you also need to clear D=T once you have a successful list, if and only if the reason for the D=T is failure to list. (i.e. if you were degraded for some other reason, then successfully listing doesn't clear the degraded condition)

controllers/core_clusteroperator_controller.go

manifests/0000_50_cluster-platform-operator-manager_07-core-clusteroperator.yaml

timflannagan · 2022-09-19T21:06:53Z

/retest

tylerslaton

Looks like Joe handled a lot of the questions here already so I left some small comments

config/clusteroperator/core_clusteroperator.yaml

controllers/core_clusteroperator_controller.go

manifests/0000_50_cluster-platform-operator-manager_07-core-clusteroperator.yaml

timflannagan · 2022-09-21T15:17:18Z

/retest

timflannagan · 2022-09-28T20:09:28Z

controllers/core_clusteroperator_controller.go

+// FIXME(tflannag): I'm seeing unit test flakes where we're bumping
+// the lastTransistionTime value despite being in the same state as
+// before which is a bug.


Going to see whether #40 helps improve this behavior given it's refactoring that clusteroperator.Builder library that's being leveraged during reconciliation.

timflannagan · 2022-09-29T02:50:15Z

Moved some of the envtest fixes to #72, which should fix any of the recent unit test case failures where we fail to find the underlying etcd executable. When diving into the failures in CI, it looks like we need to update where the executable directory is stored, and ensure we're writing those files to a directory path that's writable in CI (e.g. /tmp).

Signed-off-by: timflannagan <timflannagan@gmail.com>

…ed library Signed-off-by: timflannagan <timflannagan@gmail.com>

…shared library Signed-off-by: timflannagan <timflannagan@gmail.com>

Signed-off-by: timflannagan <timflannagan@gmail.com>

openshift-ci · 2022-09-29T22:03:33Z

@timflannagan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-techpreview	`ff48f5a`	link	false	`/test e2e-techpreview`
ci/prow/unit	`5e33517`	link	true	`/test unit`
ci/prow/e2e-aws-techpreview	`5e33517`	link	false	`/test e2e-aws-techpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2022-09-30T18:08:28Z

@timflannagan: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tylerslaton · 2022-10-05T20:46:57Z

@timflannagan Anything I can do to help move this forward? Looks like we're blocked by a rebase being needed right now.

timflannagan changed the title ~~feat: Introduce the "core" ClusterOperator controller~~ OLM-2760: Introduce the "core" ClusterOperator controller Sep 19, 2022

openshift-ci bot requested review from exdx and tylerslaton September 19, 2022 18:19

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2022

timflannagan changed the title ~~OLM-2760: Introduce the "core" ClusterOperator controller~~ feat: Introduce the "core" ClusterOperator controller Sep 19, 2022

timflannagan changed the title ~~feat: Introduce the "core" ClusterOperator controller~~ OLM-2760: Introduce the "core" ClusterOperator controller Sep 19, 2022

timflannagan force-pushed the feat/introduce-core-clusteroperator branch from 1079e2e to a1854ea Compare September 19, 2022 18:48

joelanford reviewed Sep 19, 2022

View reviewed changes

timflannagan force-pushed the feat/introduce-core-clusteroperator branch from 8285e7d to 6292aab Compare September 19, 2022 20:02

tylerslaton reviewed Sep 20, 2022

View reviewed changes

config/clusteroperator/core_clusteroperator.yaml Show resolved Hide resolved

controllers/core_clusteroperator_controller.go Outdated Show resolved Hide resolved

manifests/0000_50_cluster-platform-operator-manager_07-core-clusteroperator.yaml Show resolved Hide resolved

timflannagan force-pushed the feat/introduce-core-clusteroperator branch 6 times, most recently from 5950acb to 888ad66 Compare September 28, 2022 20:08

timflannagan commented Sep 28, 2022

View reviewed changes

This was referenced Sep 28, 2022

Envtest cannot successfully start in CI environments #70

Closed

DONOTMERGE: Check whether envtest can be successfully run #71

Closed

timflannagan force-pushed the feat/introduce-core-clusteroperator branch 2 times, most recently from 46128c8 to 4497f39 Compare September 29, 2022 18:33

timflannagan added 3 commits September 29, 2022 14:57

config,manifests: Add the static core ClusterOperator YAML manifest

26ee250

Signed-off-by: timflannagan <timflannagan@gmail.com>

internal,test: Move the custom FindStatusCondition function to a shar…

a1dc08f

…ed library Signed-off-by: timflannagan <timflannagan@gmail.com>

internal: Move the default ClusterOperator initialization logic to a …

7b360cf

…shared library Signed-off-by: timflannagan <timflannagan@gmail.com>

timflannagan force-pushed the feat/introduce-core-clusteroperator branch from 4497f39 to 5e33517 Compare September 29, 2022 19:06

*: Implement the core ClusterOperator controller

5e33517

Signed-off-by: timflannagan <timflannagan@gmail.com>

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2022

timflannagan closed this Oct 6, 2022

		coBuilder.WithDegraded(openshiftconfigv1.ConditionFalse)
		coBuilder.WithAvailable(openshiftconfigv1.ConditionTrue, fmt.Sprintf("The platform operator manager is available at %s", r.ReleaseVersion), clusteroperator.ReasonAsExpected)

Failure	platform-operators-core	platform-operators-aggregated
Can't communicate with apiserver	Available=F	Available=F
Don't have permission to list POs	Available=F	Available=F
Failed to list bundles from catalog source(s)	Available=F	Available=F
Failed to list BDs	Available=F	Available=F
Failed to locate PO packageName in listed bundles	Available=T	Available=F
Don't have permission to create/update BDs for POs	Available=F	Available=F
BD for a PO did not progress to `InstallationSucceeded`	Available=T	Available=F

OLM-2760: Introduce the "core" ClusterOperator controller #62

OLM-2760: Introduce the "core" ClusterOperator controller #62

Uh oh!

Conversation

timflannagan commented Sep 19, 2022

Uh oh!

openshift-ci bot commented Sep 19, 2022

Uh oh!

timflannagan commented Sep 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timflannagan commented Sep 19, 2022

Uh oh!

tylerslaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timflannagan commented Sep 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timflannagan commented Sep 29, 2022

Uh oh!

openshift-ci bot commented Sep 29, 2022

Uh oh!

openshift-merge-robot commented Sep 30, 2022

Uh oh!

tylerslaton commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants