Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277

Merged
merged 1 commit into from
Jan 20, 2020

Conversation

JAORMX
Copy link
Contributor

@JAORMX JAORMX commented Nov 27, 2019

Description

There are instances where the cluster could be temporarily unavailable.
e.g. when etcd is doing leader re-election.

https://search.svc.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=336h&context=2&type=all

While this is not a very normal scenario, it would be good to be lenient
on the cleanup side of things and retry if such a case happens.

This will be reflected as Timeout or Unavailable errors coming from etcd.

Motivation

When using the e2e test framework to test an operator, if there are network issues, the cleanup functions might fail. This would present itself with the following logs:

client.go:75: resource type Deployment with namespace/name (openshift-compliance/compliance-operator) successfully deleted
client.go:75: resource type ClusterRoleBinding with namespace/name (openshift-compliance/compliance-operator) successfully deleted
context.go:76: A cleanup function failed with error: (rpc error: code = Unavailable desc = etcdserver: leader changed)
client.go:75: resource type RoleBinding with namespace/name (openshift-compliance/compliance-operator) successfully deleted
client.go:75: resource type ClusterRole with namespace/name (openshift-compliance/compliance-operator) successfully deleted

Not that this would be a fairly random and transcient error. This is why I chose to add it to the Cleanup functions, since there are several of them (one per object tracked by the framework) and there is a higher probability of hitting this.

Solution

This patch proposes to retry with backoff if an error happens. After a set number of retries, it'll return the appropriate error. This wouldn't retry forever, only until the timeout hits.

@JAORMX
Copy link
Contributor Author

JAORMX commented Nov 27, 2019

Closing, the error comes directly from etcd and not the apiserver. So it won't be caught like this.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JAORMX,

Thank you for your contribution.

@camilamacedo86
Copy link
Contributor

@AlexNPavel wdyt?

@JAORMX
Copy link
Contributor Author

JAORMX commented Dec 4, 2019

/retest

@JAORMX JAORMX closed this Dec 4, 2019
@JAORMX JAORMX reopened this Dec 4, 2019
@openshift-ci-robot
Copy link

@JAORMX: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jmccormick2001
Copy link
Contributor

/ok-to-test

@openshift-ci-robot openshift-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Dec 10, 2019
Copy link
Member

@estroz estroz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fine workaround for now. The cleanup function logic needs to be cleaned up and edge cases like this codified in a follow-up.

/lgtm

PTAL @jmccormick2001 @hasbro17

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2019
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question about what errors to consider to retry. Seems like a good change otherwise! Thanks!

pkg/test/client.go Outdated Show resolved Hide resolved
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just address @joelanford suggestion; https://github.com/operator-framework/operator-sdk/pull/2277/files#r365028682 and then, it shows fine for me.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2020
@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@JAORMX
Copy link
Contributor Author

JAORMX commented Jan 17, 2020

Thanks for the reviews everyone!

@@ -21,6 +21,7 @@ import (
apierrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/client-go/util/retry"
dynclient "sigs.k8s.io/controller-runtime/pkg/client"
Copy link
Contributor

@camilamacedo86 camilamacedo86 Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JAORMX the entry in CHANGELOG is no longer here :-(
Could you please add it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which entry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An entry describing the change/fix/add to let the users know what was changed. Was not it added before?
See here
Following my suggestion.

...
## Changed
- Added retry logic to the cleanup function from the e2e test framework in order to allow it to be achieved in the scenarios where temporary network issues are faced. ([#2277](https://github.com/operator-framework/operator-sdk/pull/2277))

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sonuds good to me! thanks!

…vailable

There are instances where the cluster could be temporarily unavailable.
e.g. when etcd is doing leader re-election.

    https://search.svc.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=336h&context=2&type=all

While this is not a very normal scenario, it would be good to be lenient
on the cleanup side of things and retry if such a case happens.

This will be reflected as Timeout or Unavailable errors coming from etcd.
This patch proposes to retry with backoff in an error happens. After a
set number of retries, it'll return the appropriate error.
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 17, 2020
@JAORMX
Copy link
Contributor Author

JAORMX commented Jan 20, 2020

/retest

@JAORMX JAORMX closed this Jan 20, 2020
@JAORMX JAORMX reopened this Jan 20, 2020
@jmccormick2001 jmccormick2001 merged commit 6cf4306 into operator-framework:master Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants