Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1885376: Remove condition around marketplace OperatorAvailable status update #347

Conversation

ankitathomas
Copy link
Contributor

@ankitathomas ankitathomas commented Oct 6, 2020

marketplace-operator install fails intermittently, with marketplace never setting its status as available.
Marketplace checks for a preexisting false available condition for making this update, which can
incorrectly lead to the install getting stuck. Removing this wrapping condition to allow marketplace to
set its availability to true without this issue.

@openshift-ci-robot
Copy link
Contributor

@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 6, 2020
@ankitathomas ankitathomas requested review from anik120, ecordell and awgreene and removed request for hasbro17 and dinhxuanvu October 6, 2020 19:48
@ankitathomas ankitathomas reopened this Oct 6, 2020
@openshift-ci-robot
Copy link
Contributor

@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@ankitathomas: This pull request references Bugzilla bug 1885376, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

statusErr = r.setStatus(statusConditions)
break
}
reason := "OperatorAvailable"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just create it in the succeeded state to start with? all this really confirms is that the our timer works, right? (which seems like a proxy for "is the pod alive" which we'll already have a signal for)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

It feels like this whole reporting mechanism should be ripped out and we can just have a function that does this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ecordell @kevinrizza if we rip out the whole thing, we will essentially rip out the ability to report status based on the availability of the default catalogsources (the way we used to with default CatalogSources).

Are we saying that we don't forsee a requirement coming in that'll ask that of marketplace ever?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we still want a thread to keep checking if the operator is up, and if not report OperatorExited right?

Copy link
Contributor

@anik120 anik120 Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this really confirms is that the our timer works, right?

This thread running an infinite loop actually does three things atm:

  1. On startup, creates the clusteroperator and sets the message to Determining Status and condition to Available:False Progressing: True Degraded: False
  2. Once the startup is done it sets the condition to Available:True Progressing: False Degraded: False
  3. If the deployment is deleted it sets the clusteroperator condition to Available:False with Reason: OperatorExited.

I'm guessing we want the deployment to be monitored and the clusteroperator status to be set as Available:False if the deployment is deleted.

which seems like a proxy for "is the pod alive" which we'll already have a signal for

@ecordell where do we have the signal for that?
If you're saying that we don't need marketplace to explicitly set the clusteroperator status to false because something else will do that for us, then we can just move this over to a function that sets the condition to true and be done with it

Copy link
Member

@ecordell ecordell Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current loop is really :

  1. On startup, creates the clusteroperator and sets the message to Determining Status and condition to Available:False Progressing: True Degraded: False
  2. 20 seconds later, regardless of what else has happened, sets Available:True Progressing: False Degraded: False
  3. If the deployment is deleted it sets the clusteroperator condition to Available:False with Reason: OperatorExited.

I'm just saying that if there's nothing (aside from the stopchannel) that really prevents us from going available anymore, let's just make this:

  1. On startup, creates the clusteroperator and sets the message to Available:True Progressing: False Degraded: False
  2. Every 20s, set Available:True Progressing: False Degraded: False (heartbeat) with a new timestamp
  3. If the deployment is deleted it sets the clusteroperator condition to Available:False with Reason: OperatorExited.

where do we have the signal for that?

I just mean that the pod will be failing and there are already cluster alerts for failing pods.

Copy link
Member

@awgreene awgreene Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time that this was created, there was a non-marketplace e2e test which ensured that operators don't immediately report available. This was over a year ago and might have changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reached out to Trevor King and this test does not seem to exist anymore.

Copy link
Contributor

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankitathomas after the PR is finalized, it's probably a good idea to retest a few times (5-10?) and report the status of those tests (if any are failing with clusteroperator Determining Status) in the comments to make sure we've solved the problem before we merge this.

@ankitathomas
Copy link
Contributor Author

Test run 1:
test okd-e2e-aws failed due to amazon connectivity error:
Post "https://iam.amazonaws.com/": read tcp 10.128.4.236:43018->52.94.225.3:443: read: connection reset by peer"

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/347/pull-ci-operator-framework-operator-marketplace-master-okd-e2e-aws/1313567790354927616

@ankitathomas
Copy link
Contributor Author

/test all

anik120 added a commit to anik120/operator-marketplace that referenced this pull request Oct 7, 2020
Duplicate of operator-framework#347 for running parallel tests. Will be closed in favour of operator-framework#347.
/hold
@anik120 anik120 mentioned this pull request Oct 7, 2020
@anik120
Copy link
Contributor

anik120 commented Oct 7, 2020

Test run 2:

test e2e-aws failed due to error:

    <exec.CodeExitError>: {
        Err: {
            s: "error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:\nCommand stdout:\n\nstderr:\nerror: timed out waiting for the condition\n\nerror:\nexit status 1",
        },
        Code: 1,
    }
    error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:
    Command stdout:
    
    stderr:
    error: timed out waiting for the condition
    
    error:
    exit status 1

Rest of the tests are green.

/retest all

@openshift-ci-robot
Copy link
Contributor

@anik120: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-console-olm
  • /test e2e-aws-operator
  • /test e2e-aws-serial
  • /test e2e-aws-upgrade
  • /test images
  • /test okd-e2e-aws
  • /test okd-e2e-aws-console-olm
  • /test okd-e2e-aws-operator
  • /test okd-images
  • /test okd-unit
  • /test unit

Use /test all to run all jobs.

In response to this:

Test run 2:

test e2e-aws failed due to error:

   <exec.CodeExitError>: {
       Err: {
           s: "error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:\nCommand stdout:\n\nstderr:\nerror: timed out waiting for the condition\n\nerror:\nexit status 1",
       },
       Code: 1,
   }
   error running /usr/bin/kubectl --server=https://api.ci-op-wn6s5c4z-8d2e4.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/var/run/secrets/ci.openshift.io/multi-stage/kubeconfig --namespace=e2e-kubectl-8324 run -i --image=docker.io/library/busybox:1.29 --restart=Never success -- /bin/sh -c exit 0:
   Command stdout:
   
   stderr:
   error: timed out waiting for the condition
   
   error:
   exit status 1

Rest of the tests are green.

/retest all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@anik120
Copy link
Contributor

anik120 commented Oct 7, 2020

/test all

@ankitathomas
Copy link
Contributor Author

Test run 3:
Test e2e-aws-serial failed due to errors relating to kube proxy:

curl -q -s --connect-timeout 1 http://localhost:10249/proxyMode
command terminated with exit code 7 

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/347/pull-ci-operator-framework-operator-marketplace-master-e2e-aws/1313625080806248448

@ankitathomas
Copy link
Contributor Author

/test all

@ecordell
Copy link
Member

ecordell commented Oct 7, 2020

/lgtm
/approve

I still think this can be improved, but it's fine to merge as-is. I don't see this adding any new failures to the system, so we can gather data faster about this fixing the issue by merging and seeing it across all platform CI.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankitathomas, ecordell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020
@ecordell
Copy link
Member

ecordell commented Oct 7, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@ecordell: This pull request references Bugzilla bug 1885376, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.6.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member

ecordell commented Oct 7, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 7, 2020
@openshift-ci-robot
Copy link
Contributor

@ecordell: This pull request references Bugzilla bug 1885376, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member

ecordell commented Oct 7, 2020

/cherry-pick release-4.6

@openshift-cherrypick-robot

@ecordell: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member

ecordell commented Oct 7, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@anik120
Copy link
Contributor

anik120 commented Oct 7, 2020

/retest

1 similar comment
@ankitathomas
Copy link
Contributor Author

/retest

@cuppett
Copy link

cuppett commented Oct 7, 2020

/test e2e-aws-upgrade

@cuppett
Copy link

cuppett commented Oct 7, 2020

/test e2e-aws-serial

1 similar comment
@ankitathomas
Copy link
Contributor Author

/test e2e-aws-serial

@anik120
Copy link
Contributor

anik120 commented Oct 7, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@kevinrizza
Copy link
Member

/test e2e-aws

@openshift-merge-robot openshift-merge-robot merged commit b2c7a7a into operator-framework:master Oct 8, 2020
@openshift-ci-robot
Copy link
Contributor

@ankitathomas: All pull requests linked via external trackers have merged:

Bugzilla bug 1885376 has been moved to the MODIFIED state.

In response to this:

Bug 1885376: Remove condition around marketplace OperatorAvailable status update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@ecordell: new pull request created: #351

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants