-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277
pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277
Conversation
Closing, the error comes directly from etcd and not the apiserver. So it won't be caught like this. |
ca469be
to
ca1b8a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JAORMX,
Thank you for your contribution.
@AlexNPavel wdyt? |
/retest |
@JAORMX: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fine workaround for now. The cleanup function logic needs to be cleaned up and edge cases like this codified in a follow-up.
/lgtm
PTAL @jmccormick2001 @hasbro17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question about what errors to consider to retry. Seems like a good change otherwise! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just address @joelanford suggestion; https://github.com/operator-framework/operator-sdk/pull/2277/files#r365028682 and then, it shows fine for me.
New changes are detected. LGTM label has been removed. |
Thanks for the reviews everyone! |
@@ -21,6 +21,7 @@ import ( | |||
apierrors "k8s.io/apimachinery/pkg/api/errors" | |||
"k8s.io/apimachinery/pkg/runtime" | |||
"k8s.io/apimachinery/pkg/util/wait" | |||
"k8s.io/client-go/util/retry" | |||
dynclient "sigs.k8s.io/controller-runtime/pkg/client" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JAORMX the entry in CHANGELOG is no longer here :-(
Could you please add it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An entry describing the change/fix/add to let the users know what was changed. Was not it added before?
See here
Following my suggestion.
...
## Changed
- Added retry logic to the cleanup function from the e2e test framework in order to allow it to be achieved in the scenarios where temporary network issues are faced. ([#2277](https://github.com/operator-framework/operator-sdk/pull/2277))
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sonuds good to me! thanks!
…vailable There are instances where the cluster could be temporarily unavailable. e.g. when etcd is doing leader re-election. https://search.svc.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=336h&context=2&type=all While this is not a very normal scenario, it would be good to be lenient on the cleanup side of things and retry if such a case happens. This will be reflected as Timeout or Unavailable errors coming from etcd. This patch proposes to retry with backoff in an error happens. After a set number of retries, it'll return the appropriate error.
/retest |
Description
There are instances where the cluster could be temporarily unavailable.
e.g. when etcd is doing leader re-election.
https://search.svc.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=336h&context=2&type=all
While this is not a very normal scenario, it would be good to be lenient
on the cleanup side of things and retry if such a case happens.
This will be reflected as Timeout or Unavailable errors coming from etcd.
Motivation
When using the e2e test framework to test an operator, if there are network issues, the cleanup functions might fail. This would present itself with the following logs:
Not that this would be a fairly random and transcient error. This is why I chose to add it to the Cleanup functions, since there are several of them (one per object tracked by the framework) and there is a higher probability of hitting this.
Solution
This patch proposes to retry with backoff if an error happens. After a set number of retries, it'll return the appropriate error. This wouldn't retry forever, only until the timeout hits.