Bug 1885376: Remove condition around marketplace OperatorAvailable status update #347

ankitathomas · 2020-10-06T19:48:19Z

marketplace-operator install fails intermittently, with marketplace never setting its status as available.
Marketplace checks for a preexisting false available condition for making this update, which can
incorrectly lead to the install getting stuck. Removing this wrapping condition to allow marketplace to
set its availability to true without this issue.

openshift-ci-robot · 2020-10-06T19:48:22Z

@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-10-06T19:52:21Z

@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-10-06T19:52:32Z

@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ecordell · 2020-10-06T20:07:22Z

pkg/status/status.go

-				statusErr = r.setStatus(statusConditions)
-				break
-			}
+			reason := "OperatorAvailable"


why not just create it in the succeeded state to start with? all this really confirms is that the our timer works, right? (which seems like a proxy for "is the pod alive" which we'll already have a signal for)

+1

It feels like this whole reporting mechanism should be ripped out and we can just have a function that does this.

@ecordell @kevinrizza if we rip out the whole thing, we will essentially rip out the ability to report status based on the availability of the default catalogsources (the way we used to with default CatalogSources).

Are we saying that we don't forsee a requirement coming in that'll ask that of marketplace ever?

Also, we still want a thread to keep checking if the operator is up, and if not report OperatorExited right?

all this really confirms is that the our timer works, right?

This thread running an infinite loop actually does three things atm:

On startup, creates the clusteroperator and sets the message to Determining Status and condition to Available:False Progressing: True Degraded: False

Once the startup is done it sets the condition to Available:True Progressing: False Degraded: False

If the deployment is deleted it sets the clusteroperator condition to Available:False with Reason: OperatorExited.

I'm guessing we want the deployment to be monitored and the clusteroperator status to be set as Available:False if the deployment is deleted.

which seems like a proxy for "is the pod alive" which we'll already have a signal for

@ecordell where do we have the signal for that?
If you're saying that we don't need marketplace to explicitly set the clusteroperator status to false because something else will do that for us, then we can just move this over to a function that sets the condition to true and be done with it

The current loop is really :

On startup, creates the clusteroperator and sets the message to Determining Status and condition to Available:False Progressing: True Degraded: False

20 seconds later, regardless of what else has happened, sets Available:True Progressing: False Degraded: False

If the deployment is deleted it sets the clusteroperator condition to Available:False with Reason: OperatorExited.

I'm just saying that if there's nothing (aside from the stopchannel) that really prevents us from going available anymore, let's just make this:

On startup, creates the clusteroperator and sets the message to Available:True Progressing: False Degraded: False

Every 20s, set Available:True Progressing: False Degraded: False (heartbeat) with a new timestamp

If the deployment is deleted it sets the clusteroperator condition to Available:False with Reason: OperatorExited.

where do we have the signal for that?

I just mean that the pod will be failing and there are already cluster alerts for failing pods.

At the time that this was created, there was a non-marketplace e2e test which ensured that operators don't immediately report available. This was over a year ago and might have changed.

I reached out to Trevor King and this test does not seem to exist anymore.

anik120

@ankitathomas after the PR is finalized, it's probably a good idea to retest a few times (5-10?) and report the status of those tests (if any are failing with clusteroperator Determining Status) in the comments to make sure we've solved the problem before we merge this.

ankitathomas · 2020-10-06T23:38:19Z

Test run 1:
test okd-e2e-aws failed due to amazon connectivity error:
Post "https://iam.amazonaws.com/": read tcp 10.128.4.236:43018->52.94.225.3:443: read: connection reset by peer"

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/347/pull-ci-operator-framework-operator-marketplace-master-okd-e2e-aws/1313567790354927616

ankitathomas · 2020-10-06T23:39:57Z

/test all

Duplicate of operator-framework#347 for running parallel tests. Will be closed in favour of operator-framework#347. /hold

anik120 · 2020-10-07T01:40:37Z

Test run 2:

test e2e-aws failed due to error:

    <exec.CodeExitError>: {
        Err: {
            s: "error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:\nCommand stdout:\n\nstderr:\nerror: timed out waiting for the condition\n\nerror:\nexit status 1",
        },
        Code: 1,
    }
    error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:
    Command stdout:
    
    stderr:
    error: timed out waiting for the condition
    
    error:
    exit status 1

Rest of the tests are green.

/retest all

openshift-ci-robot · 2020-10-07T01:40:39Z

@anik120: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

/test e2e-aws
/test e2e-aws-console-olm
/test e2e-aws-operator
/test e2e-aws-serial
/test e2e-aws-upgrade
/test images
/test okd-e2e-aws
/test okd-e2e-aws-console-olm
/test okd-e2e-aws-operator
/test okd-images
/test okd-unit
/test unit

Use /test all to run all jobs.

In response to this:

Test run 2:

test e2e-aws failed due to error:

   <exec.CodeExitError>: {
       Err: {
           s: "error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:\nCommand stdout:\n\nstderr:\nerror: timed out waiting for the condition\n\nerror:\nexit status 1",
       },
       Code: 1,
   }
   error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:
   Command stdout:
   
   stderr:
   error: timed out waiting for the condition
   
   error:
   exit status 1

Rest of the tests are green.

/retest all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anik120 · 2020-10-07T01:41:48Z

/test all

ankitathomas · 2020-10-07T11:09:03Z

Test run 3:
Test e2e-aws-serial failed due to errors relating to kube proxy:

curl -q -s --connect-timeout 1 http://localhost:10249/proxyMode
command terminated with exit code 7

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/347/pull-ci-operator-framework-operator-marketplace-master-e2e-aws/1313625080806248448

ankitathomas · 2020-10-07T11:09:37Z

/test all

ecordell · 2020-10-07T14:02:31Z

/lgtm
/approve

I still think this can be improved, but it's fine to merge as-is. I don't see this adding any new failures to the system, so we can gather data faster about this fixing the issue by merging and seeing it across all platform CI.

openshift-ci-robot · 2020-10-07T14:02:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankitathomas, ecordell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ecordell]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ecordell · 2020-10-07T14:03:09Z

/bugzilla refresh

openshift-ci-robot · 2020-10-07T14:03:12Z

@ecordell: This pull request references Bugzilla bug 1885376, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ecordell · 2020-10-07T14:03:40Z

/bugzilla refresh

openshift-ci-robot · 2020-10-07T14:03:47Z

@ecordell: This pull request references Bugzilla bug 1885376, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ecordell · 2020-10-07T14:03:52Z

/cherry-pick release-4.6

openshift-cherrypick-robot · 2020-10-07T14:03:53Z

@ecordell: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ecordell · 2020-10-07T14:03:58Z

/retest

openshift-bot · 2020-10-07T15:39:11Z

/retest

Please review the full test history for this PR and help us cut down flakes.

anik120 · 2020-10-07T16:26:07Z

/retest

ankitathomas · 2020-10-07T17:50:44Z

/retest

cuppett · 2020-10-07T18:10:38Z

/test e2e-aws-upgrade

cuppett · 2020-10-07T18:46:25Z

/test e2e-aws-serial

ankitathomas · 2020-10-07T18:57:09Z

/test e2e-aws-serial

anik120 · 2020-10-07T18:57:31Z

/retest

openshift-bot · 2020-10-07T20:12:01Z

/retest

Please review the full test history for this PR and help us cut down flakes.

kevinrizza · 2020-10-08T01:23:55Z

/test e2e-aws

openshift-ci-robot · 2020-10-08T02:52:44Z

@ankitathomas: All pull requests linked via external trackers have merged:

operator-framework/operator-marketplace#347

Bugzilla bug 1885376 has been moved to the MODIFIED state.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2020-10-08T02:52:51Z

@ecordell: new pull request created: #351

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

remove condition around marketplace OperatorAvailable status update

0acb44b

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 6, 2020

openshift-ci-robot requested review from dinhxuanvu and hasbro17 October 6, 2020 19:48

ankitathomas requested review from anik120, ecordell and awgreene and removed request for hasbro17 and dinhxuanvu October 6, 2020 19:48

ankitathomas closed this Oct 6, 2020

ankitathomas reopened this Oct 6, 2020

ecordell reviewed Oct 6, 2020

View reviewed changes

anik120 reviewed Oct 6, 2020

View reviewed changes

anik120 added a commit to anik120/operator-marketplace that referenced this pull request Oct 7, 2020

test commit

e95df7b

Duplicate of operator-framework#347 for running parallel tests. Will be closed in favour of operator-framework#347. /hold

anik120 mentioned this pull request Oct 7, 2020

test commit #349

Closed

openshift-ci-robot assigned ecordell Oct 7, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 7, 2020

openshift-merge-robot merged commit b2c7a7a into operator-framework:master Oct 8, 2020

openshift-cherrypick-robot mentioned this pull request Oct 8, 2020

[release-4.6] Bug 1886245: Remove condition around marketplace OperatorAvailable status update #351

Merged

Bug 1885376: Remove condition around marketplace OperatorAvailable status update #347

Bug 1885376: Remove condition around marketplace OperatorAvailable status update #347

Conversation

ankitathomas commented Oct 6, 2020 • edited

openshift-ci-robot commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020

openshift-ci-robot commented Oct 6, 2020

ecordell Oct 6, 2020

Choose a reason for hiding this comment

kevinrizza Oct 6, 2020

Choose a reason for hiding this comment

anik120 Oct 6, 2020

Choose a reason for hiding this comment

anik120 Oct 6, 2020

Choose a reason for hiding this comment

anik120 Oct 6, 2020 • edited

Choose a reason for hiding this comment

ecordell Oct 6, 2020 • edited

Choose a reason for hiding this comment

awgreene Oct 7, 2020 • edited

Choose a reason for hiding this comment

awgreene Oct 7, 2020

Choose a reason for hiding this comment

anik120 left a comment

Choose a reason for hiding this comment

ankitathomas commented Oct 6, 2020

ankitathomas commented Oct 6, 2020

anik120 commented Oct 7, 2020 • edited

openshift-ci-robot commented Oct 7, 2020

anik120 commented Oct 7, 2020

ankitathomas commented Oct 7, 2020

ankitathomas commented Oct 7, 2020

ecordell commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

ecordell commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

ecordell commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

ecordell commented Oct 7, 2020

openshift-cherrypick-robot commented Oct 7, 2020

ecordell commented Oct 7, 2020

openshift-bot commented Oct 7, 2020

anik120 commented Oct 7, 2020

ankitathomas commented Oct 7, 2020

cuppett commented Oct 7, 2020

cuppett commented Oct 7, 2020

ankitathomas commented Oct 7, 2020

anik120 commented Oct 7, 2020

openshift-bot commented Oct 7, 2020

kevinrizza commented Oct 8, 2020

openshift-ci-robot commented Oct 8, 2020

openshift-cherrypick-robot commented Oct 8, 2020

ankitathomas commented Oct 6, 2020 •

edited

anik120 Oct 6, 2020 •

edited

ecordell Oct 6, 2020 •

edited

awgreene Oct 7, 2020 •

edited

anik120 commented Oct 7, 2020 •

edited