Skip to content

Conversation

@timflannagan
Copy link
Contributor

Builds on top of #61 and introduces a "core" ClusterOperator resource and a controller that manages that resource. This controller was described in the phase 0 EP.

@timflannagan timflannagan changed the title feat: Introduce the "core" ClusterOperator controller OLM-2760: Introduce the "core" ClusterOperator controller Sep 19, 2022
@openshift-ci openshift-ci bot requested review from exdx and tylerslaton September 19, 2022 18:19
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2022
@timflannagan
Copy link
Contributor Author

/jira refresh

@timflannagan timflannagan changed the title OLM-2760: Introduce the "core" ClusterOperator controller feat: Introduce the "core" ClusterOperator controller Sep 19, 2022
@timflannagan timflannagan changed the title feat: Introduce the "core" ClusterOperator controller OLM-2760: Introduce the "core" ClusterOperator controller Sep 19, 2022
@timflannagan timflannagan force-pushed the feat/introduce-core-clusteroperator branch from 1079e2e to a1854ea Compare September 19, 2022 18:48
Comment on lines 85 to 86
coBuilder.WithDegraded(openshiftconfigv1.ConditionFalse)
coBuilder.WithAvailable(openshiftconfigv1.ConditionTrue, fmt.Sprintf("The platform operator manager is available at %s", r.ReleaseVersion), clusteroperator.ReasonAsExpected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should one or both of Degraded/Available rely on any smoke tests? (e.g. can I list POs without error?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what smoke tests we think are reasonable. Reading through https://coreos.slack.com/archives/C030W4V9BHN/p1663354445071699 it seems like setting A=T is sufficient during phase 0 but I'm open to ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, I think we should attempt a list of POs before we say A=T? I'm trying to think through various POM failures. Here's my stab at filling in a table.

Failure platform-operators-core platform-operators-aggregated
Can't communicate with apiserver Available=F Available=F
Don't have permission to list POs Available=F Available=F
Failed to list bundles from catalog source(s) Available=F Available=F
Failed to list BDs Available=F Available=F
Failed to locate PO packageName in listed bundles Available=T Available=F
Don't have permission to create/update BDs for POs Available=F Available=F
BD for a PO did not progress to InstallationSucceeded Available=T Available=F

Failed to list bundles from catalog source(s)

This one seems iffy since I recall that catsrcs GRPC connections can be flaky. Seems like we'd want to do retries for a little bit, then go Degraded=true, then eventually go Available=false

Also from the CVO docs:

An operator shoould not report the Available status condition the first time until they are completely rolled out (or within some reasonable percentage if the component must be installed to all nodes)

This seems super important to make sure we get right when managing PO installations for POs that are declared at cluster install time. I put #40 on hold while we think through this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the value of listing the PlatformOperators in the cluster is to ensure that CRD has been registered? We basically only care about whether an error was returned vs. length checking the number of items the list query returned. I think my main concern is whether that opens the door for transient events, e.g. the API server is temporarily down, and whether toggling A=[T,F] is desirable in such a configuration.

I need to really dive more into the ClusterOperator documentation to get a better feel for what's applicable here given we don't explicitly manage any operands with this CO controller. Thoughts @bparees @wking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking that you can list the api you care about is definitely more "aggressive" than i think most COs bother with. That doesn't make it wrong, per se.

It accomplishes two things:

  1. ensures the api is defined/available
  2. ensures the operator itself has RBAC permissions to interact with the api

That said, i don't know that you need to explicitly poll it, more like if the control loop gets an error interacting with the api, then you might go degraded or unavailable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a stab at this 5m threshold approach to determine A=[T,F] when I'm back on normal dev duties next week.

I'm mainly concerned with the initial cluster rollout edge case, and having a controller reconciliation loop that blindly sets A=T could be sufficient given we can assume the controller pod is running. Obviously it's not the most robust solution but helps side step any forward looking QE bug where we roll out a bad component container image during phase 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a controller reconciliation loop that blindly sets A=T could be sufficient given we can assume the controller pod is running.

i think that is the right starting point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for whether we need a threshold before going degraded... i could see us just keeping it simple and not going degraded

I was thinking about this some more today: what happens if we're currently expressing A=T, and we fail a single list query? Ideally, we wouldn't set A=F given it could be transient issue. Would it make sense instead to set D=T here, which would update the LastTransistionTime, and keep the A=T and P=T conditions? Any subsequent reconciliations that continue to fail a list query would check whether D=T is already set, and if so, compare the LastTransistionTime to the failure threshold to determine whether we should be setting A=F.

@bparees @joelanford Thoughts? It was unclear to me reading through the states listed in https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md#conditions-and-installupgrade whether this was a supported configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a rough stab at this implementation in cdda9b5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timflannagan not sure what you mean by "supported configuration" but the algorithm/determination seems reasonable to me. Of course you also need to clear D=T once you have a successful list, if and only if the reason for the D=T is failure to list. (i.e. if you were degraded for some other reason, then successfully listing doesn't clear the degraded condition)

@timflannagan timflannagan force-pushed the feat/introduce-core-clusteroperator branch from 8285e7d to 6292aab Compare September 19, 2022 20:02
@timflannagan
Copy link
Contributor Author

/retest

Copy link
Contributor

@tylerslaton tylerslaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Joe handled a lot of the questions here already so I left some small comments

@timflannagan
Copy link
Contributor Author

/retest

@timflannagan timflannagan force-pushed the feat/introduce-core-clusteroperator branch 6 times, most recently from 5950acb to 888ad66 Compare September 28, 2022 20:08
Comment on lines 39 to 41
// FIXME(tflannag): I'm seeing unit test flakes where we're bumping
// the lastTransistionTime value despite being in the same state as
// before which is a bug.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to see whether #40 helps improve this behavior given it's refactoring that clusteroperator.Builder library that's being leveraged during reconciliation.

@timflannagan
Copy link
Contributor Author

Moved some of the envtest fixes to #72, which should fix any of the recent unit test case failures where we fail to find the underlying etcd executable. When diving into the failures in CI, it looks like we need to update where the executable directory is stored, and ensure we're writing those files to a directory path that's writable in CI (e.g. /tmp).

@timflannagan timflannagan force-pushed the feat/introduce-core-clusteroperator branch 2 times, most recently from 46128c8 to 4497f39 Compare September 29, 2022 18:33
Signed-off-by: timflannagan <timflannagan@gmail.com>
…ed library

Signed-off-by: timflannagan <timflannagan@gmail.com>
…shared library

Signed-off-by: timflannagan <timflannagan@gmail.com>
@timflannagan timflannagan force-pushed the feat/introduce-core-clusteroperator branch from 4497f39 to 5e33517 Compare September 29, 2022 19:06
Signed-off-by: timflannagan <timflannagan@gmail.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 29, 2022

@timflannagan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-techpreview ff48f5a link false /test e2e-techpreview
ci/prow/unit 5e33517 link true /test unit
ci/prow/e2e-aws-techpreview 5e33517 link false /test e2e-aws-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2022
@openshift-merge-robot
Copy link
Contributor

@timflannagan: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tylerslaton
Copy link
Contributor

@timflannagan Anything I can do to help move this forward? Looks like we're blocked by a rebase being needed right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants