Azure cloudprovider retry using flowcontrol #46660

jackfrancis · 2017-05-31T00:13:28Z

An initial attempt at engaging exponential backoff for API error responses.

Addresses #47048

Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE
cloudprovider backoff.

What this PR does / why we need it:

The existing azure cloudprovider implementation has no guard rails in place to adapt to unexpected underlying operational conditions (i.e., clogs in resource plumbing between k8s runtime and the cloud API). The purpose of these changes is to support exponential backoff wrapping around API calls; and to support targeted rate limiting. Both of these options are configurable via --cloud-config.

Implementation inspired by the GCE's use of k8s.io/client-go/util/flowcontrol and k8s.io/apimachinery/pkg/util/wait, this PR likewise uses flowcontrol for rate limiting; and wait to thinly wrap backoff retry attempts to the API.

Special notes for your reviewer:

Pay especial note to the declaration of retry-able conditions from an unsuccessful HTTP request:

all 4xx and 5xx HTTP responses
non-nil error responses

And the declaration of retry success conditions:

2xx HTTP responses

Tests updated to include additions to Config.

Those may be incomplete, or in other ways non-representative.

Release note:

Added exponential backoff to Azure cloudprovider

An initial attempt at engaging exponential backoff for API error responses. Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE cloudprovider backoff.

k8s-ci-robot · 2017-05-31T00:13:36Z

Hi @jackfrancis. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jsafrane · 2017-05-31T08:24:32Z

I know very little about azure API behavior.

/unassign
/assign @colemickens

jsafrane · 2017-05-31T08:24:55Z

@k8s-bot ok to test

jdumars · 2017-05-31T13:58:29Z

/cc @brendandburns

brendandburns · 2017-05-31T17:03:49Z

pkg/cloudprovider/providers/azure/azure_backoff.go

@@ -0,0 +1,149 @@
+/*
+Copyright 2016 The Kubernetes Authors.


brendandburns · 2017-05-31T17:05:19Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+	if err != nil {
+		return true
+	}
+	if resp.StatusCode == 429 || resp.StatusCode == 500 {


Should this be 5xx instead of just 500?

Also, please use http.StatusServerError etc. rather than hard constants.

brendandburns · 2017-05-31T17:05:34Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+// isSuccessHTTPResponse determines if the response from an HTTP request suggests success
+func isSuccessHTTPResponse(resp autorest.Response) bool {
+	// TODO determine the complete set of success conditions
+	if resp.StatusCode == 200 || resp.StatusCode == 201 || resp.StatusCode == 202 {


use http.StatusOK etc instead of constants.

brendandburns · 2017-05-31T17:06:58Z

pkg/cloudprovider/providers/azure/azure.go

@@ -177,6 +179,9 @@ func NewCloud(configReader io.Reader) (cloudprovider.Interface, error) {
 	az.StorageAccountClient = storage.NewAccountsClientWithBaseURI(az.Environment.ResourceManagerEndpoint, az.SubscriptionID)
 	az.StorageAccountClient.Authorizer = servicePrincipalToken

+	// 1 qps, up to 5 burst when in flowcontrol; i.e., aggressive backoff enforcement
+	az.operationPollRateLimiter = flowcontrol.NewTokenBucketRateLimiter(1, 5)


You don't ever use this as far as I can tell.

I think we really want this to be used.

Let's make this configurable (and disable-able) via config file.

@brendandburns @colemickens would the --cloud-config --> Config struct be the best configuration vector for this?

brendandburns · 2017-05-31T17:09:07Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+
+// CreateOrUpdateSGWithRetry invokes az.SecurityGroupsClient.CreateOrUpdate with exponential backoff retry
+func (az *Cloud) CreateOrUpdateSGWithRetry(sg network.SecurityGroup) error {
+	return wait.Poll(operationPollInterval, operationPollTimeoutDuration, func() (bool, error) {


How is this exponential backoff? Poll always just waits for Interval

- corrected Copyright copy/paste - now actually implementing exponential backoff instead of regular interval retries - using more general HTTP response code success/failure determination (e.g., 5xx for retry) - net/http constants ftw

jackfrancis · 2017-05-31T18:54:23Z

@brendandburns thanks for keeping me honest, review notes incorporated

arg cruft in CreateOrUpdateSGWithRetry function declaration

brendandburns · 2017-06-01T20:23:43Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+		return true
+	}
+	// HTTP 5xx suggests we should retry
+	r, err := regexp.Compile(`^5\d\d$`)


Use numeric comparisons

eg

code >= http.StatusInternalError

Do this for the 2xx below too.

brendandburns · 2017-06-01T20:23:54Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+// isSuccessHTTPResponse determines if the response from an HTTP request suggests success
+func isSuccessHTTPResponse(resp autorest.Response) bool {
+	// HTTP 2xx suggests a successful response
+	r, err := regexp.Compile(`^2\d\d$`)


brendandburns · 2017-06-01T20:25:43Z

pkg/cloudprovider/providers/azure/azure_loadbalancer.go

+		if shouldRetryAPIRequest(resp, err) {
+			retryErr := az.CreateOrUpdateSGWithRetry(sg)
+			if retryErr != nil {
+				return nil, retryErr


Do we want to return here? Or just set err = retryErr and let the code below handle both err cases?

I think I prefer that approach.

Yeah, your approach is better. The purpose of the novel retryErr reference (as opposed to reusing the existing err reference prior to checking for retry-ability is to be able to facilitate logging/debug in the retry execution branch. We can still do that by reusing the err reference and eliminate an unnecessary return.

brendandburns · 2017-06-01T20:25:53Z

pkg/cloudprovider/providers/azure/azure_loadbalancer.go

+			resp, err := az.SecurityGroupsClient.CreateOrUpdate(az.ResourceGroup, *reconciledSg.Name, reconciledSg, nil)
+			if shouldRetryAPIRequest(resp, err) {
+				retryErr := az.CreateOrUpdateSGWithRetry(reconciledSg)
+				if retryErr != nil {


brendandburns · 2017-06-01T20:26:01Z

pkg/cloudprovider/providers/azure/azure_loadbalancer.go

-				_, err = az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil)
+				resp, err := az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil)
+				if shouldRetryAPIRequest(resp, err) {
+					retryErr := az.CreateOrUpdateLBWithRetry(lb)


brendandburns · 2017-06-01T20:26:10Z

pkg/cloudprovider/providers/azure/azure_loadbalancer.go

+				resp, err := az.LoadBalancerClient.Delete(az.ResourceGroup, lbName, nil)
+				if shouldRetryAPIRequest(resp, err) {
+					retryErr := az.DeleteLBWithRetry(lbName)
+					if retryErr != nil {


brendandburns · 2017-06-01T20:26:17Z

pkg/cloudprovider/providers/azure/azure_loadbalancer.go

+	resp, err := az.PublicIPAddressesClient.CreateOrUpdate(az.ResourceGroup, *pip.Name, pip, nil)
+	if shouldRetryAPIRequest(resp, err) {
+		retryErr := az.CreateOrUpdatePIPWithRetry(pip)
+		if retryErr != nil {


brendandburns · 2017-06-01T20:26:41Z

pkg/cloudprovider/providers/azure/azure_loadbalancer.go

+		resp, err := az.InterfacesClient.CreateOrUpdate(az.ResourceGroup, *nic.Name, nic, nil)
+		if shouldRetryAPIRequest(resp, err) {
+			retryErr := az.CreateOrUpdateInterfaceWithRetry(nic)
+			if retryErr != nil {


You get the point, I'm going to stop commenting on all of them

I was enjoying how many variations you could come up with. :)

- removed unnecessary return statements - optimized HTTP response code evaluations as numeric comparisons

fejta · 2017-06-01T21:30:53Z

@k8s-bot pull-kubernetes-e2e-kops-aws test this
ref: kubernetes/test-infra#2932

brendandburns · 2017-06-02T01:45:48Z

Looks like you need to run ./hack/update-bazel.sh otherwise, LGTM.

/lgtm
/approve

Thanks!

we don’t need to rate limit the calls _to_ it

brendandburns · 2017-06-07T04:48:26Z

@jackfrancis this looks like it has some go vet errors.

You can run go vet k8s.io/kubernetes/pkg/cloudprovider/providers/azure to generate the errors.

not waiting to rate limit until we get an error response from the API, doing so on initial request for all API requests

jackfrancis · 2017-06-07T05:13:40Z

@brendandburns thx, addressed

jdumars · 2017-06-07T13:34:39Z

/retest

jdumars · 2017-06-07T13:49:56Z

/sig azure

jdumars · 2017-06-07T13:59:25Z

@grodrigues3 @wojtek-t this should have the 1.7 milestone attached

jdumars · 2017-06-07T15:17:12Z

@brendandburns LGTM needed

brendandburns · 2017-06-07T15:21:51Z

Will look soon.

…

--brendan

________________________________ From: Jaice Singer DuMars <notifications@github.com> Sent: Wednesday, June 7, 2017 8:17:41 AM To: kubernetes/kubernetes Cc: Brendan Burns; Mention Subject: Re: [kubernetes/kubernetes] Azure cloudprovider retry using flowcontrol (#46660) @brendandburns<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbrendandburns&data=02%7C01%7Cbburns%40microsoft.com%7Cf8da174c34094b61590e08d4adb858d9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636324454654072622&sdata=8FMGW5GCDO49QqNDjYqu%2B0PeyCP99pW4aEj1%2BN6NNdg%3D&reserved=0> LGTM needed — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46660%23issuecomment-306827481&data=02%7C01%7Cbburns%40microsoft.com%7Cf8da174c34094b61590e08d4adb858d9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636324454654072622&sdata=hRjGv%2FLJiS2o0vpYCscbhOrQT8CucPYj0mvMTycLcjY%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFfDglQ3yFCPhZqDIOYAyORASDTfw5fcks5sBr8VgaJpZM4NrAPC&data=02%7C01%7Cbburns%40microsoft.com%7Cf8da174c34094b61590e08d4adb858d9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636324454654072622&sdata=XdJV8LkvG%2B6COOD7dPE1SS6pCF63y9JuZ5MwmZ4O8MQ%3D&reserved=0>.

brendandburns · 2017-06-07T17:18:20Z

/lgtm
/approve

brendandburns · 2017-06-07T17:20:10Z

/approve #47048

brendandburns · 2017-06-07T17:21:18Z

/approve

k8s-github-robot · 2017-06-07T17:21:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, jackfrancis

Associated issue: 47048

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/cloudprovider/providers/azure/OWNERS~~ [brendandburns]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-06-07T18:05:06Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

yangl900

Looks good, just a few comments.

yangl900 · 2017-06-07T19:14:38Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+		return true
+	}
+	// HTTP 4xx or 5xx suggests we should retry
+	if 399 < resp.StatusCode && resp.StatusCode < 600 {


400 is not a retry-able error in Azure, I think should only retry status code == 429 or status code > 500

@brendandburns I'm happy to incorporate this feedback if you're willing to go through the lgtm approval obstacle course all over agin. @yangl900 the other two remarks I believe we can justify tackling later on as part of a more holistic effort to pair k8s cloudprovider code with specific API idioms

yangl900 · 2017-06-07T19:24:04Z

pkg/cloudprovider/providers/azure/azure.go

+const (
+	// CloudProviderName is the value used for the --cloud-provider flag
+	CloudProviderName      = "azure"
+	rateLimitQPSDefault    = 1.0


read quota is much higher than write, I don't know if you want to differentiate in this change. that potentially give you higher quota. we can improve this later too.

yangl900 · 2017-06-07T19:25:06Z

pkg/cloudprovider/providers/azure/azure_backoff.go

+		var retryErr error
+		machine, exists, retryErr = az.getVirtualMachine(name)
+		if retryErr != nil {
+			glog.Errorf("backoff: failure, will retry,err=%v", retryErr)


there is a Retry-After header returned from 429 requests, probably helpful if we trace that.

k8s-ci-robot · 2017-06-07T19:43:42Z

@jackfrancis: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-gce-etcd3	`acb6517`	link	`@k8s-bot pull-kubernetes-e2e-gce-etcd3 test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-06-07T20:30:56Z

Automatic merge from submit-queue (batch tested with PRs 43005, 46660, 46385, 46991, 47103)

brendandburns · 2017-06-08T03:47:08Z

@jackfrancis can you address @yangl900's comments in a follow on PR?

Thanks!

…60-upstream-release-1.6 Automatic merge from submit-queue Automated cherry pick of #46660 Cherry pick of #46660 on release-1.6. #46660: Azure cloudprovider retry using flowcontrol

Azure cloudprovider retry using flowcontrol

f200f9a

An initial attempt at engaging exponential backoff for API error responses. Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE cloudprovider backoff.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 31, 2017

k8s-github-robot assigned jsafrane and vmarmol May 31, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels May 31, 2017

k8s-ci-robot assigned colemickens and unassigned jsafrane May 31, 2017

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 31, 2017

k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels May 31, 2017

k8s-ci-robot requested a review from brendandburns May 31, 2017 13:58

brendandburns suggested changes May 31, 2017

View reviewed changes

errata, wait.ExponentialBackoff, regex HTTP codes

c6c6cc7

- corrected Copyright copy/paste - now actually implementing exponential backoff instead of regular interval retries - using more general HTTP response code success/failure determination (e.g., 5xx for retry) - net/http constants ftw

errata

c95af06

arg cruft in CreateOrUpdateSGWithRetry function declaration

brendandburns suggested changes Jun 1, 2017

View reviewed changes

two optimizations

17f8dc5

- removed unnecessary return statements - optimized HTTP response code evaluations as numeric comparisons

k8s-ci-robot assigned brendandburns Jun 2, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2017

calebamiles modified the milestone: v1.7 Jun 2, 2017

k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 6, 2017

jackfrancis added 2 commits June 6, 2017 11:19

rate limiting on all azure sdk GET requests

ac931aa

az.getVirtualMachine already rate-limited

148e923

we don’t need to rate limit the calls _to_ it

jackfrancis added 2 commits June 6, 2017 22:09

rate limiting everywhere

6d73a09

not waiting to rate limit until we get an error response from the API, doing so on initial request for all API requests

go vet errata

2accbbd

preferring float32 for rate limit QPS param

acb6517

k8s-ci-robot added the sig/azure label Jun 7, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2017

yangl900 reviewed Jun 7, 2017

View reviewed changes

k8s-github-robot merged commit 3adb9b4 into kubernetes:master Jun 7, 2017

seanknox mentioned this pull request Jun 9, 2017

Automated cherry pick of #46660 #47278

Merged

jackfrancis mentioned this pull request Jul 6, 2017

Enable cloudprovider rate limit / backoff features Azure/acs-engine#892

Merged

jackfrancis mentioned this pull request Apr 12, 2021

REQUEST: New membership for jackfrancis kubernetes/org#2632

Closed

6 tasks

Azure cloudprovider retry using flowcontrol #46660

Azure cloudprovider retry using flowcontrol #46660

Conversation

jackfrancis commented May 31, 2017 • edited by brendandburns

k8s-ci-robot commented May 31, 2017

jsafrane commented May 31, 2017

jsafrane commented May 31, 2017

jdumars commented May 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackfrancis commented May 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fejta commented Jun 1, 2017

brendandburns commented Jun 2, 2017 • edited

brendandburns commented Jun 7, 2017

jackfrancis commented Jun 7, 2017

jdumars commented Jun 7, 2017

jdumars commented Jun 7, 2017

jdumars commented Jun 7, 2017

jdumars commented Jun 7, 2017

brendandburns commented Jun 7, 2017 via email

brendandburns commented Jun 7, 2017

brendandburns commented Jun 7, 2017

brendandburns commented Jun 7, 2017

k8s-github-robot commented Jun 7, 2017

k8s-github-robot commented Jun 7, 2017

yangl900 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 7, 2017

k8s-github-robot commented Jun 7, 2017

brendandburns commented Jun 8, 2017

jackfrancis commented May 31, 2017 •

edited by brendandburns

brendandburns commented Jun 2, 2017 •

edited