pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277

JAORMX · 2019-11-27T18:45:38Z

Description

There are instances where the cluster could be temporarily unavailable.
e.g. when etcd is doing leader re-election.

https://search.svc.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=336h&context=2&type=all

While this is not a very normal scenario, it would be good to be lenient
on the cleanup side of things and retry if such a case happens.

This will be reflected as Timeout or Unavailable errors coming from etcd.

Motivation

When using the e2e test framework to test an operator, if there are network issues, the cleanup functions might fail. This would present itself with the following logs:

client.go:75: resource type Deployment with namespace/name (openshift-compliance/compliance-operator) successfully deleted
client.go:75: resource type ClusterRoleBinding with namespace/name (openshift-compliance/compliance-operator) successfully deleted
context.go:76: A cleanup function failed with error: (rpc error: code = Unavailable desc = etcdserver: leader changed)
client.go:75: resource type RoleBinding with namespace/name (openshift-compliance/compliance-operator) successfully deleted
client.go:75: resource type ClusterRole with namespace/name (openshift-compliance/compliance-operator) successfully deleted

Not that this would be a fairly random and transcient error. This is why I chose to add it to the Cleanup functions, since there are several of them (one per object tracked by the framework) and there is a higher probability of hitting this.

Solution

This patch proposes to retry with backoff if an error happens. After a set number of retries, it'll return the appropriate error. This wouldn't retry forever, only until the timeout hits.

JAORMX · 2019-11-27T23:59:07Z

Closing, the error comes directly from etcd and not the apiserver. So it won't be caught like this.

camilamacedo86

Hi @JAORMX,

Thank you for your contribution.

camilamacedo86 · 2019-12-04T12:23:13Z

@AlexNPavel wdyt?

JAORMX · 2019-12-04T16:14:55Z

/retest

openshift-ci-robot · 2019-12-04T16:15:23Z

@JAORMX: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jmccormick2001 · 2019-12-10T16:23:53Z

/ok-to-test

estroz

This is a fine workaround for now. The cleanup function logic needs to be cleaned up and edge cases like this codified in a follow-up.

/lgtm

PTAL @jmccormick2001 @hasbro17

joelanford

Just one question about what errors to consider to retry. Seems like a good change otherwise! Thanks!

pkg/test/client.go

camilamacedo86

Just address @joelanford suggestion; https://github.com/operator-framework/operator-sdk/pull/2277/files#r365028682 and then, it shows fine for me.

openshift-ci-robot · 2020-01-17T06:27:21Z

New changes are detected. LGTM label has been removed.

JAORMX · 2020-01-17T06:27:40Z

Thanks for the reviews everyone!

camilamacedo86 · 2020-01-17T09:01:36Z

pkg/test/client.go

@@ -21,6 +21,7 @@ import (
 	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	"k8s.io/apimachinery/pkg/runtime"
 	"k8s.io/apimachinery/pkg/util/wait"
+	"k8s.io/client-go/util/retry"
 	dynclient "sigs.k8s.io/controller-runtime/pkg/client"


@JAORMX the entry in CHANGELOG is no longer here :-(
Could you please add it again?

which entry?

An entry describing the change/fix/add to let the users know what was changed. Was not it added before?
See here
Following my suggestion.

... ## Changed - Added retry logic to the cleanup function from the e2e test framework in order to allow it to be achieved in the scenarios where temporary network issues are faced. ([#2277](https://github.com/operator-framework/operator-sdk/pull/2277))

WDYT?

Sonuds good to me! thanks!

…vailable There are instances where the cluster could be temporarily unavailable. e.g. when etcd is doing leader re-election. https://search.svc.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=336h&context=2&type=all While this is not a very normal scenario, it would be good to be lenient on the cleanup side of things and retry if such a case happens. This will be reflected as Timeout or Unavailable errors coming from etcd. This patch proposes to retry with backoff in an error happens. After a set number of retries, it'll return the appropriate error.

JAORMX · 2020-01-20T07:41:26Z

/retest

openshift-ci-robot requested review from camilamacedo86 and shawn-hurley November 27, 2019 18:45

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 27, 2019

JAORMX mentioned this pull request Nov 27, 2019

e2e tests: Be resilient to temporary unavailability of k8s openshift/compliance-operator#28

Merged

JAORMX closed this Nov 27, 2019

JAORMX reopened this Nov 28, 2019

JAORMX force-pushed the retry-cleanup branch from ca469be to ca1b8a4 Compare November 28, 2019 08:59

camilamacedo86 suggested changes Dec 4, 2019

View reviewed changes

camilamacedo86 closed this Dec 4, 2019

camilamacedo86 reopened this Dec 4, 2019

camilamacedo86 requested review from camilamacedo86, hasbro17, AlexNPavel, joelanford and estroz December 4, 2019 12:10

camilamacedo86 added the test-framework label Dec 4, 2019

JAORMX closed this Dec 4, 2019

JAORMX reopened this Dec 4, 2019

openshift-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Dec 10, 2019

estroz approved these changes Dec 11, 2019

View reviewed changes

openshift-ci-robot assigned estroz Dec 11, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2019

joelanford reviewed Jan 10, 2020

View reviewed changes

pkg/test/client.go Outdated Show resolved Hide resolved

camilamacedo86 approved these changes Jan 16, 2020

View reviewed changes

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2020

camilamacedo86 reviewed Jan 17, 2020

View reviewed changes

JAORMX force-pushed the retry-cleanup branch from 360b367 to 0020ed0 Compare January 17, 2020 09:43

JAORMX force-pushed the retry-cleanup branch from 0020ed0 to 4299539 Compare January 17, 2020 10:53

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 17, 2020

JAORMX closed this Jan 20, 2020

JAORMX reopened this Jan 20, 2020

JAORMX requested a review from joelanford January 20, 2020 09:59

jmccormick2001 merged commit 6cf4306 into operator-framework:master Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277

pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277

JAORMX commented Nov 27, 2019 •

edited

Loading

JAORMX commented Nov 27, 2019

camilamacedo86 left a comment •

edited

Loading

camilamacedo86 commented Dec 4, 2019

JAORMX commented Dec 4, 2019

openshift-ci-robot commented Dec 4, 2019

jmccormick2001 commented Dec 10, 2019

estroz left a comment

joelanford left a comment

camilamacedo86 left a comment

openshift-ci-robot commented Jan 17, 2020

JAORMX commented Jan 17, 2020

camilamacedo86 Jan 17, 2020 •

edited

Loading

JAORMX Jan 17, 2020

camilamacedo86 Jan 17, 2020

JAORMX Jan 17, 2020

JAORMX commented Jan 20, 2020

pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277

pkg/test/client: retry cleanup function if cluster is temporarily unavailable #2277

Conversation

JAORMX commented Nov 27, 2019 • edited Loading

Description

Motivation

Solution

JAORMX commented Nov 27, 2019

camilamacedo86 left a comment • edited Loading

Choose a reason for hiding this comment

camilamacedo86 commented Dec 4, 2019

JAORMX commented Dec 4, 2019

openshift-ci-robot commented Dec 4, 2019

jmccormick2001 commented Dec 10, 2019

estroz left a comment

Choose a reason for hiding this comment

joelanford left a comment

Choose a reason for hiding this comment

camilamacedo86 left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 17, 2020

JAORMX commented Jan 17, 2020

camilamacedo86 Jan 17, 2020 • edited Loading

Choose a reason for hiding this comment

JAORMX Jan 17, 2020

Choose a reason for hiding this comment

camilamacedo86 Jan 17, 2020

Choose a reason for hiding this comment

JAORMX Jan 17, 2020

Choose a reason for hiding this comment

JAORMX commented Jan 20, 2020

JAORMX commented Nov 27, 2019 •

edited

Loading

camilamacedo86 left a comment •

edited

Loading

camilamacedo86 Jan 17, 2020 •

edited

Loading