-
Notifications
You must be signed in to change notification settings - Fork 20
OLM-2760: Introduce the "core" ClusterOperator controller #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OLM-2760: Introduce the "core" ClusterOperator controller #62
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: timflannagan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
1079e2e to
a1854ea
Compare
| coBuilder.WithDegraded(openshiftconfigv1.ConditionFalse) | ||
| coBuilder.WithAvailable(openshiftconfigv1.ConditionTrue, fmt.Sprintf("The platform operator manager is available at %s", r.ReleaseVersion), clusteroperator.ReasonAsExpected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should one or both of Degraded/Available rely on any smoke tests? (e.g. can I list POs without error?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on what smoke tests we think are reasonable. Reading through https://coreos.slack.com/archives/C030W4V9BHN/p1663354445071699 it seems like setting A=T is sufficient during phase 0 but I'm open to ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, I think we should attempt a list of POs before we say A=T? I'm trying to think through various POM failures. Here's my stab at filling in a table.
| Failure | platform-operators-core | platform-operators-aggregated |
|---|---|---|
| Can't communicate with apiserver | Available=F | Available=F |
| Don't have permission to list POs | Available=F | Available=F |
| Failed to list bundles from catalog source(s) | Available=F | Available=F |
| Failed to list BDs | Available=F | Available=F |
| Failed to locate PO packageName in listed bundles | Available=T | Available=F |
| Don't have permission to create/update BDs for POs | Available=F | Available=F |
BD for a PO did not progress to InstallationSucceeded |
Available=T | Available=F |
Failed to list bundles from catalog source(s)
This one seems iffy since I recall that catsrcs GRPC connections can be flaky. Seems like we'd want to do retries for a little bit, then go Degraded=true, then eventually go Available=false
Also from the CVO docs:
An operator shoould not report the Available status condition the first time until they are completely rolled out (or within some reasonable percentage if the component must be installed to all nodes)
This seems super important to make sure we get right when managing PO installations for POs that are declared at cluster install time. I put #40 on hold while we think through this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the value of listing the PlatformOperators in the cluster is to ensure that CRD has been registered? We basically only care about whether an error was returned vs. length checking the number of items the list query returned. I think my main concern is whether that opens the door for transient events, e.g. the API server is temporarily down, and whether toggling A=[T,F] is desirable in such a configuration.
I need to really dive more into the ClusterOperator documentation to get a better feel for what's applicable here given we don't explicitly manage any operands with this CO controller. Thoughts @bparees @wking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking that you can list the api you care about is definitely more "aggressive" than i think most COs bother with. That doesn't make it wrong, per se.
It accomplishes two things:
- ensures the api is defined/available
- ensures the operator itself has RBAC permissions to interact with the api
That said, i don't know that you need to explicitly poll it, more like if the control loop gets an error interacting with the api, then you might go degraded or unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a stab at this 5m threshold approach to determine A=[T,F] when I'm back on normal dev duties next week.
I'm mainly concerned with the initial cluster rollout edge case, and having a controller reconciliation loop that blindly sets A=T could be sufficient given we can assume the controller pod is running. Obviously it's not the most robust solution but helps side step any forward looking QE bug where we roll out a bad component container image during phase 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a controller reconciliation loop that blindly sets A=T could be sufficient given we can assume the controller pod is running.
i think that is the right starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for whether we need a threshold before going degraded... i could see us just keeping it simple and not going degraded
I was thinking about this some more today: what happens if we're currently expressing A=T, and we fail a single list query? Ideally, we wouldn't set A=F given it could be transient issue. Would it make sense instead to set D=T here, which would update the LastTransistionTime, and keep the A=T and P=T conditions? Any subsequent reconciliations that continue to fail a list query would check whether D=T is already set, and if so, compare the LastTransistionTime to the failure threshold to determine whether we should be setting A=F.
@bparees @joelanford Thoughts? It was unclear to me reading through the states listed in https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md#conditions-and-installupgrade whether this was a supported configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a rough stab at this implementation in cdda9b5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timflannagan not sure what you mean by "supported configuration" but the algorithm/determination seems reasonable to me. Of course you also need to clear D=T once you have a successful list, if and only if the reason for the D=T is failure to list. (i.e. if you were degraded for some other reason, then successfully listing doesn't clear the degraded condition)
manifests/0000_50_cluster-platform-operator-manager_07-core-clusteroperator.yaml
Show resolved
Hide resolved
8285e7d to
6292aab
Compare
|
/retest |
tylerslaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Joe handled a lot of the questions here already so I left some small comments
manifests/0000_50_cluster-platform-operator-manager_07-core-clusteroperator.yaml
Show resolved
Hide resolved
|
/retest |
5950acb to
888ad66
Compare
| // FIXME(tflannag): I'm seeing unit test flakes where we're bumping | ||
| // the lastTransistionTime value despite being in the same state as | ||
| // before which is a bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to see whether #40 helps improve this behavior given it's refactoring that clusteroperator.Builder library that's being leveraged during reconciliation.
|
Moved some of the envtest fixes to #72, which should fix any of the recent unit test case failures where we fail to find the underlying etcd executable. When diving into the failures in CI, it looks like we need to update where the executable directory is stored, and ensure we're writing those files to a directory path that's writable in CI (e.g. /tmp). |
46128c8 to
4497f39
Compare
Signed-off-by: timflannagan <timflannagan@gmail.com>
…ed library Signed-off-by: timflannagan <timflannagan@gmail.com>
…shared library Signed-off-by: timflannagan <timflannagan@gmail.com>
4497f39 to
5e33517
Compare
Signed-off-by: timflannagan <timflannagan@gmail.com>
|
@timflannagan: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@timflannagan: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@timflannagan Anything I can do to help move this forward? Looks like we're blocked by a rebase being needed right now. |
Builds on top of #61 and introduces a "core" ClusterOperator resource and a controller that manages that resource. This controller was described in the phase 0 EP.