New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Azure API calls by replacing the current backoff retry with SDK's backoff #70866

Merged
merged 3 commits into from Dec 21, 2018

Conversation

@feiskyer
Copy link
Member

feiskyer commented Nov 9, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

When something unexpected happened (e.g. hitting Azure API throttling) , Azure cloud provider would retry with a configurable exponential backoff. While this seems work for processing various unexpected errors, it's actually sending too much requests to Azure API.

The reason is Azure Go SDK also includes an internal backoff, e.g. here and here. The SDK would retry 3 attempts for compute APIs and 5 attempts for storage APIs when following HTTP response are got:

  • http.StatusRequestTimeout, // 408

  • http.StatusTooManyRequests, // 429

  • http.StatusInternalServerError, // 500

  • http.StatusBadGateway, // 502

  • http.StatusServiceUnavailable, // 503

  • http.StatusGatewayTimeout, // 504

That means every failed AttachDisk operation would invoke (1+6)*3 = 21 API calls (suppose cloudProviderBackoffRetries is set 6 in cloud config).

We should really avoid those two-layer backoff retries by

  • (1) either disable SDK's backoff and only keep cloud provider's backoff retry
  • (2) or, remove cloud provider's backoff and only use SDK's backoff instead

According to Throttling Resource Manager requests, when the request limits are reached, clients should honor Retry-After value in the response header:

When you reach the request limit, Resource Manager returns the 429 HTTP status code and a Retry-After value in the header. The Retry-After value specifies the number of seconds your application should wait (or sleep) before sending the next request. If you send a request before the retry value has elapsed, your request isn't processed and a new retry value is returned.

Since this header has already been handled within SDK's backoff retries, (2) is our preferred way to reduce the API calls.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #70865

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

New Azure cloud provider option 'cloudProviderBackoffMode' has been added to reduce Azure API retries. Candidate values are:

* default (or empty string): keep same with before.
* v2: only backoff retry with Azure SDK with fixed exponent 2.

/sig azure
/cc @khenidak @andyzhangx

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 9, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Nov 9, 2018

Since this change would take effect on all Azure API calls, we need to run enough tests before merging it. I'm setting those e2e tests flows in cloud-provier-azure repo.

Meanwhile, any comments or suggestions are welcomed.

/cc @brendandburns @khenidak @andyzhangx

/hold

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Nov 12, 2018

/retest

@brendandburns

This comment has been minimized.

Copy link
Contributor

brendandburns commented Nov 13, 2018

I'm a little bit worried about taking out the manual configuration knobs and deferring entirely to the SDK.

I think that we should both maintain the existing behavior, and implement the new one, and switch between them based on a flag.

Since this is a big change, we need to maintain backward/rollback as an option for clusters without reverting the Kubernetes version.

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Nov 14, 2018

I think that we should both maintain the existing behavior, and implement the new one, and switch between them based on a flag.
Since this is a big change, we need to maintain backward/rollback as an option for clusters without reverting the Kubernetes version.

Reasonable. Since there are two layers of retrying today, we should provide a way to disable either SDK's retry or cloud provider's retry.

@feiskyer feiskyer force-pushed the feiskyer:backoff branch from 98e9279 to b7cc902 Nov 19, 2018

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Nov 19, 2018

New Azure cloud provider option 'cloudProviderBackoffMode' has been added to reduce Azure API retries. Candidate values are:

  • default (or empty string): keep same with before.
  • minimum: only backoff retry with Azure SDK with fixed exponent 2.

@brendandburns @khenidak @andyzhangx PTAL

@@ -59,6 +59,9 @@ const (
vmTypeVMSS = "vmss"
vmTypeStandard = "standard"

backoffModeDefault = "default"
backoffModeMinimum = "minimum"

This comment has been minimized.

@andyzhangx

andyzhangx Nov 19, 2018

Member

maybe v2 is better.

This comment has been minimized.

@feiskyer

feiskyer Nov 19, 2018

Member

@brendandburns @khenidak What do you think about the new backoff mode name? I don't think "minimum" is good enough. What's your suggestion?

// * default means two-layer backoff retrying, one in the cloud provider and the other in the Azure SDK.
// * minimum means only backoff in the Azure SDK is used. In such mode, CloudProviderBackoffDuration and
// CloudProviderBackoffJitter are omitted.
// "default" will be used if not specified.

This comment has been minimized.

@andyzhangx

andyzhangx Nov 19, 2018

Member

I would suggest to use sdk retry flavor by default, we have 3 months to test this retry change, and if we don't change the original retry mechanism, I doubt whether aks or acs-engine will use sdk retry flavor, the new retry machanism will never be used.

This comment has been minimized.

@feiskyer

feiskyer Nov 19, 2018

Member

@brendandburns @khenidak What do you think about the default backoff mode?

This comment has been minimized.

@khenidak

khenidak Dec 17, 2018

Contributor

I don't think we should maintain two modes. Do you see a usecase for maintaining the old mode?

This comment has been minimized.

@feiskyer

feiskyer Dec 18, 2018

Member

per @brendandburns suggestion here.

I'm a little bit worried about taking out the manual configuration knobs and deferring entirely to the SDK.

I think that we should both maintain the existing behavior, and implement the new one, and switch between them based on a flag.

Since this is a big change, we need to maintain backward/rollback as an option for clusters without reverting the Kubernetes version.

Show resolved Hide resolved pkg/cloudprovider/providers/azure/azure_backoff.go Outdated
Show resolved Hide resolved pkg/cloudprovider/providers/azure/azure_backoff.go Outdated
Show resolved Hide resolved pkg/cloudprovider/providers/azure/azure_backoff.go Outdated

@feiskyer feiskyer force-pushed the feiskyer:backoff branch from b7cc902 to 3ef7ef8 Dec 4, 2018

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Dec 4, 2018

/hold cancel

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Dec 4, 2018

/retest

@khenidak
Copy link
Contributor

khenidak left a comment

Just one comment really. I think we are taking on a lot of tech debt maintaining two modes of retry behavior. From user stand point the cluster will behave the same. We should switch to the new mode, remove all the retry wrappers around the SDK and let the SDK do the work.

We can also emit a warning message as if we found that user is still supplying values for the old backoff style.

// * default means two-layer backoff retrying, one in the cloud provider and the other in the Azure SDK.
// * minimum means only backoff in the Azure SDK is used. In such mode, CloudProviderBackoffDuration and
// CloudProviderBackoffJitter are omitted.
// "default" will be used if not specified.

This comment has been minimized.

@khenidak

khenidak Dec 17, 2018

Contributor

I don't think we should maintain two modes. Do you see a usecase for maintaining the old mode?

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Dec 20, 2018

/retest

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Dec 20, 2018

@andyzhangx

This comment has been minimized.

Copy link
Member

andyzhangx commented Dec 20, 2018

LGTM
@khenidak @brendandburns any comments on this?
The plan is to keep the old retry method, and use sdk retry methond(v2) in aks-engine by default.

@khenidak

This comment has been minimized.

Copy link
Contributor

khenidak commented Dec 20, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 20, 2018

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Dec 21, 2018

/retest

@k8s-ci-robot k8s-ci-robot merged commit e1552d8 into kubernetes:master Dec 21, 2018

19 checks passed

cla/linuxfoundation feiskyer authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

@feiskyer feiskyer deleted the feiskyer:backoff branch Dec 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment