-
Notifications
You must be signed in to change notification settings - Fork 20
minor cleanup and refactoring for consistency and simplicity #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Skipping CI for Draft Pull Request. |
|
/test all |
|
Looks reasonable to me at a glance. This will fail the linting check due to the package import grouping, and I added some unit tests for when the TypeApplied condition is nil, so you'd have to update those as well. |
ea53d9a to
25ab990
Compare
|
/test all |
25ab990 to
b11e258
Compare
|
/test all |
internal/util/util.go
Outdated
| // In the case that the PO resource is expressing failing states, then an error | ||
| // will be returned to reflect that. | ||
| func inspectPlatformOperator(po platformv1alpha1.PlatformOperator) error { | ||
| if equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelanford Hi, when equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) return True, does it mean the PO status part is empty?
if it is correct understanding, I can not understand why return nil here because I thinkreturn nil means PO is in healthy status in method InspectPlatformOperators , and not report error to aggregated CO. But we do not the PO final status yet (maybe finally the PO is not installed successfully). So, here maybe return buildPOFailureMessage(po.GetName(), "PO has no status field")
(I just worry about that before PO installation is done (no final status yet), we report wrong information to aggregated CO. Thanks to check my concern)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean the PO status part is empty?
Yes
I can not understand why return nil here
@timflannagan and I were chatting about this yesterday. In general, a PO with no status is a PO that has yet to be reconciled by the PO controller and that typically happens only as the PO is just created. When the PO controller reconciles this PO, it'll attach a status and then this controller will reconcile again.
I wanted to avoid the aggregated CO going into a failing state temporarily every time a new PO shows up.
One thought I had to guard against the possibility of the PO controller never actually reconciling the PO: We could check status is empty and creationTimestamp is within the last 1 minute.
Which would result in:
- Empty status within a minute of being created: no problem, return
nil - Empty status after a minute of being create: something's wrong, return an error.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelanford Should we add a comment that basically explains what you had written down here? I should've done that in the PR that refactored these checks.
In general, I think that option you outlined seems reasonable at a glance but introduces potential complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, in the current implementation I think we'll see flapping in the aggregate ClusterOperator resource we're managing. I'm working on the POM component's implementation, which is responsible for bubbling up the underlying BundleDeployment status. In the case where the BD resource is waiting to be unpacked, we'll effectively see there's no TypeApplied status condition type present yet, and propagate the ApplyPending status condition reason in the PO resource. The aggregate CO controller would see that, and mark the aggregate CO resource with the following:
$ k get co -w
...
platform-operators-aggregated False True False 0s encountered the failing cincinnati-operatorrwxr4 platform operator with reason "ApplyPending"
...
platform-operators-aggregated False True False 0s encountered the failing cincinnati-operatorrwxr4 platform operator with reason "ApplyPending"
platform-operators-aggregated True True False 0s Which feels like the wrong behavior, UX, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelanford Thanks to check my comment.
Your idea is ok, but potential complexity (same to timflannagan).
maybe we just change return nil to return buildPOFailureMessage(po.GetName(), platformtypes.ReasonApplyPending) because we possible treat no reconciled yet as ApplyPending.
anyway I am also OK if you want to take your idea on 1minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is that it seems misleading and alarmist to say a PO is failing when in fact it has just been created and is waiting for the PO controller to do something with it.
In the most recent commit, I've updated to include the "time since creationTimestamp" check. But if there's another approach that lets us avoid that complexity AND avoid saying the PO is failing, I'm totally willing to get rid of that.
tylerslaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
/lgtm
| coBuilder.WithProgressing(openshiftconfigv1.ConditionTrue, "No POs are present in the cluster") | ||
|
|
||
| // TODO: cleanup condition reasons. | ||
| // 1. use constants for condition reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some condition constants already defined but it doesn't seem like we are using them here for some reason. Maybe we just need to add onto that list?
platform-operators/api/v1alpha1/types.go
Lines 22 to 25 in 03149b2
| ReasonSourceFailed = "SourceFailed" | |
| ReasonApplyFailed = "ApplyFailed" | |
| ReasonApplySuccessful = "ApplySuccessful" | |
| ReasonApplyPending = "ApplyPending" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are reasons for the PO Applied condition type. These are reasons for the ClusterOperator condition types.
| // to reflect those failing PO resources. | ||
| if statusErrorCheck := util.InspectPlatformOperators(poList); statusErrorCheck != nil { | ||
| coBuilder.WithAvailable(openshiftconfigv1.ConditionFalse, statusErrorCheck.Error(), "POError") | ||
| coBuilder.WithAvailable(metav1.ConditionFalse, "POError", statusErrorCheck.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general should we be using the apimachinery api's for this instead of the OpenShift ones? Mainly asking to see how you decided to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're both definitions of basically the exact same thing. I changed this mainly because apimachinery has some nice helpers for interacting with conditions that helps us avoid maintaining complexity around the "rules" of conditions, like knowing when to bump last transition time.
|
lgtm |
b11e258 to
0d50bcf
Compare
|
/retest |
|
/lgtm |
internal/util/util.go
Outdated
| // will be returned to reflect that. | ||
| func inspectPlatformOperator(po platformv1alpha1.PlatformOperator) error { | ||
| if equality.Semantic.DeepEqual(po.Status, platformv1alpha1.PlatformOperatorStatus{}) && | ||
| (po.CreationTimestamp.IsZero() || time.Since(po.CreationTimestamp.Time) < time.Minute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to highlight that my rebase included this addition based on the earlier discussion here: #40 (comment)
0d50bcf to
c7a6fcc
Compare
|
The e2e results are a no-op until #58 merges. |
Based on what Trevor said yesterday, I kinda think we leave it up to CVO to recreate the CO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM still just had some minor very nits and a comment.
| defer func() { | ||
| if err := coWriter.UpdateStatus(ctx, aggregatedCO, coBuilder.GetStatus()); err != nil { | ||
| log.Error(err, "error updating CO status") | ||
| log.Error(err, "error updating cluster operator status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| log.Error(err, "error updating cluster operator status") | |
| log.Error(err, "error updating ClusterOperator status") |
All other instances apply if we want to change this.
| } | ||
| coBuilder.WithAvailable(configv1.ConditionTrue, "All POs in a successful state", "POsHealthy") | ||
| coBuilder.WithProgressing(configv1.ConditionFalse, "All POs in a successful state") | ||
| coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All platform operators are in a successful state") | |
| coBuilder.WithAvailable(metav1.ConditionTrue, clusteroperator.ReasonAsExpected, "All PlatformOperators are in a successful state") |
Applies to all other instances if we want to fix this.
| // TODO: consider something more fine-grained than a catch-all "PlatformOperatorError" reason. | ||
| // There's a non-negligible difference between "PO is explicitly failing installation" and "PO is not yet installed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could define these two conditions as reasons and then bubble them up via util.InspectPlatformOperators() return? That way we can set that reason inside this WithAvailable() call and
have a less generic "PlatformOperatorError" reason.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: joelanford, timflannagan, tylerslaton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Looks like some of the e2e tests need to be updated too. /lgtm cancel |
7357c98 to
fad426b
Compare
Looks like we're missing a couple more test cases 🙃. |
|
#68 should also help with increasing visibility into those test case failures given our current preference towards gomega pattern matcher implementations. |
|
🤮 this is what I get for trying to fix these without actually looking too closely at the failures. |
Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
fad426b to
e080b5e
Compare
|
/lgtm |
|
Meta-question: Is there a process I'm not aware of to merge non-bug/non-feature PRs? The label requirements seem to imply one or the other. |
|
@joelanford: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
We're doing the TE slides so manually adding the px-approved label after a slack conversation in the #tmp-platform-operators channel. /label px-approved Holding, as I think we need QE approval give we're modifying the expected status condition reason/messages/etc. and it's unclear whether that will break any automated testing. /hold |
|
This should also be a no-op for docs. Manually adding that label. /label docs-approved |
|
/label qe-approved |
|
@joelanford: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Superseding this in #75 given the merge conflicts. /close |
|
@timflannagan: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Somewhat scatter-brained refactor/fixup PR for random things I found. I'll clean this up more if desired.
Signed-off-by: Joe Lanford joe.lanford@gmail.com