New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1885376: Remove condition around marketplace OperatorAvailable status update #347
Bug 1885376: Remove condition around marketplace OperatorAvailable status update #347
Conversation
@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
statusErr = r.setStatus(statusConditions) | ||
break | ||
} | ||
reason := "OperatorAvailable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just create it in the succeeded state to start with? all this really confirms is that the our timer
works, right? (which seems like a proxy for "is the pod alive" which we'll already have a signal for)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
It feels like this whole reporting mechanism should be ripped out and we can just have a function that does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ecordell @kevinrizza if we rip out the whole thing, we will essentially rip out the ability to report status based on the availability of the default catalogsources (the way we used to with default CatalogSources).
Are we saying that we don't forsee a requirement coming in that'll ask that of marketplace ever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we still want a thread to keep checking if the operator is up, and if not report OperatorExited
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this really confirms is that the our timer works, right?
This thread running an infinite loop actually does three things atm:
- On startup, creates the clusteroperator and sets the message to
Determining Status
and condition toAvailable:False
Progressing: True
Degraded: False
- Once the startup is done it sets the condition to
Available:True
Progressing: False
Degraded: False
- If the deployment is deleted it sets the clusteroperator condition to
Available:False
withReason: OperatorExited
.
I'm guessing we want the deployment to be monitored and the clusteroperator status to be set as Available:False
if the deployment is deleted.
which seems like a proxy for "is the pod alive" which we'll already have a signal for
@ecordell where do we have the signal for that?
If you're saying that we don't need marketplace to explicitly set the clusteroperator status to false because something else will do that for us, then we can just move this over to a function that sets the condition to true and be done with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current loop is really :
- On startup, creates the clusteroperator and sets the message to
Determining Status
and condition toAvailable:False
Progressing: True
Degraded: False
- 20 seconds later, regardless of what else has happened, sets
Available:True
Progressing: False
Degraded: False
- If the deployment is deleted it sets the clusteroperator condition to
Available:False
withReason: OperatorExited
.
I'm just saying that if there's nothing (aside from the stopchannel) that really prevents us from going available anymore, let's just make this:
- On startup, creates the clusteroperator and sets the message to
Available:True
Progressing: False
Degraded: False
- Every 20s, set
Available:True
Progressing: False
Degraded: False
(heartbeat) with a new timestamp - If the deployment is deleted it sets the clusteroperator condition to
Available:False
withReason: OperatorExited
.
where do we have the signal for that?
I just mean that the pod will be failing and there are already cluster alerts for failing pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time that this was created, there was a non-marketplace e2e test which ensured that operators don't immediately report available. This was over a year ago and might have changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reached out to Trevor King and this test does not seem to exist anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ankitathomas after the PR is finalized, it's probably a good idea to retest a few times (5-10?) and report the status of those tests (if any are failing with clusteroperator Determining Status
) in the comments to make sure we've solved the problem before we merge this.
Test run 1: |
/test all |
Duplicate of operator-framework#347 for running parallel tests. Will be closed in favour of operator-framework#347. /hold
Test run 2: test e2e-aws failed due to error:
Rest of the tests are green. /retest all |
@anik120: The
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test all |
Test run 3: curl -q -s --connect-timeout 1 http://localhost:10249/proxyMode
command terminated with exit code 7 |
/test all |
/lgtm I still think this can be improved, but it's fine to merge as-is. I don't see this adding any new failures to the system, so we can gather data faster about this fixing the issue by merging and seeing it across all platform CI. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ankitathomas, ecordell The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/bugzilla refresh |
@ecordell: This pull request references Bugzilla bug 1885376, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@ecordell: This pull request references Bugzilla bug 1885376, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.6 |
@ecordell: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
1 similar comment
/retest |
/test e2e-aws-upgrade |
/test e2e-aws-serial |
1 similar comment
/test e2e-aws-serial |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test e2e-aws |
@ankitathomas: All pull requests linked via external trackers have merged: Bugzilla bug 1885376 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ecordell: new pull request created: #351 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
marketplace-operator install fails intermittently, with marketplace never setting its status as available.
Marketplace checks for a preexisting false available condition for making this update, which can
incorrectly lead to the install getting stuck. Removing this wrapping condition to allow marketplace to
set its availability to true without this issue.