New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure cloudprovider retry using flowcontrol #46660

Merged
merged 13 commits into from Jun 7, 2017

Conversation

@jackfrancis
Contributor

jackfrancis commented May 31, 2017

An initial attempt at engaging exponential backoff for API error responses.

Addresses #47048

Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE
cloudprovider backoff.

What this PR does / why we need it:

The existing azure cloudprovider implementation has no guard rails in place to adapt to unexpected underlying operational conditions (i.e., clogs in resource plumbing between k8s runtime and the cloud API). The purpose of these changes is to support exponential backoff wrapping around API calls; and to support targeted rate limiting. Both of these options are configurable via --cloud-config.

Implementation inspired by the GCE's use of k8s.io/client-go/util/flowcontrol and k8s.io/apimachinery/pkg/util/wait, this PR likewise uses flowcontrol for rate limiting; and wait to thinly wrap backoff retry attempts to the API.

Special notes for your reviewer:

Pay especial note to the declaration of retry-able conditions from an unsuccessful HTTP request:

  • all 4xx and 5xx HTTP responses
  • non-nil error responses

And the declaration of retry success conditions:

  • 2xx HTTP responses

Tests updated to include additions to Config.

Those may be incomplete, or in other ways non-representative.

Release note:

Added exponential backoff to Azure cloudprovider
Azure cloudprovider retry using flowcontrol
An initial attempt at engaging exponential backoff for API error responses.

Uses k8s.io/client-go/util/flowcontrol; implementation inspired by GCE
cloudprovider backoff.
@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented May 31, 2017

Hi @jackfrancis. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jsafrane

This comment has been minimized.

Member

jsafrane commented May 31, 2017

I know very little about azure API behavior.

/unassign
/assign @colemickens

@k8s-ci-robot k8s-ci-robot assigned colemickens and unassigned jsafrane May 31, 2017

@jsafrane

This comment has been minimized.

Member

jsafrane commented May 31, 2017

@k8s-bot ok to test

@jdumars

This comment has been minimized.

Member

jdumars commented May 31, 2017

@k8s-ci-robot k8s-ci-robot requested a review from brendandburns May 31, 2017

@@ -0,0 +1,149 @@
/*
Copyright 2016 The Kubernetes Authors.

This comment has been minimized.

@brendandburns

brendandburns May 31, 2017

Contributor

nit: 2017

if err != nil {
return true
}
if resp.StatusCode == 429 || resp.StatusCode == 500 {

This comment has been minimized.

@brendandburns

brendandburns May 31, 2017

Contributor

Should this be 5xx instead of just 500?

Also, please use http.StatusServerError etc. rather than hard constants.

// isSuccessHTTPResponse determines if the response from an HTTP request suggests success
func isSuccessHTTPResponse(resp autorest.Response) bool {
// TODO determine the complete set of success conditions
if resp.StatusCode == 200 || resp.StatusCode == 201 || resp.StatusCode == 202 {

This comment has been minimized.

@brendandburns

brendandburns May 31, 2017

Contributor

use http.StatusOK etc instead of constants.

@@ -177,6 +179,9 @@ func NewCloud(configReader io.Reader) (cloudprovider.Interface, error) {
az.StorageAccountClient = storage.NewAccountsClientWithBaseURI(az.Environment.ResourceManagerEndpoint, az.SubscriptionID)
az.StorageAccountClient.Authorizer = servicePrincipalToken
// 1 qps, up to 5 burst when in flowcontrol; i.e., aggressive backoff enforcement
az.operationPollRateLimiter = flowcontrol.NewTokenBucketRateLimiter(1, 5)

This comment has been minimized.

@brendandburns

brendandburns May 31, 2017

Contributor

You don't ever use this as far as I can tell.

I think we really want this to be used.

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

Let's make this configurable (and disable-able) via config file.

This comment has been minimized.

@jackfrancis

jackfrancis Jun 1, 2017

Contributor

@brendandburns @colemickens would the --cloud-config --> Config struct be the best configuration vector for this?

This comment has been minimized.

@brendandburns

brendandburns Jun 5, 2017

Contributor

yes.

// CreateOrUpdateSGWithRetry invokes az.SecurityGroupsClient.CreateOrUpdate with exponential backoff retry
func (az *Cloud) CreateOrUpdateSGWithRetry(sg network.SecurityGroup) error {
return wait.Poll(operationPollInterval, operationPollTimeoutDuration, func() (bool, error) {

This comment has been minimized.

@brendandburns

brendandburns May 31, 2017

Contributor

How is this exponential backoff? Poll always just waits for Interval

errata, wait.ExponentialBackoff, regex HTTP codes
- corrected Copyright copy/paste
- now actually implementing exponential backoff instead of regular interval retries
- using more general HTTP response code success/failure determination (e.g., 5xx for retry)
- net/http constants ftw
@jackfrancis

This comment has been minimized.

Contributor

jackfrancis commented May 31, 2017

@brendandburns thanks for keeping me honest, review notes incorporated

errata
arg cruft in CreateOrUpdateSGWithRetry function declaration
return true
}
// HTTP 5xx suggests we should retry
r, err := regexp.Compile(`^5\d\d$`)

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

Use numeric comparisons

eg

code >= http.StatusInternalError

Do this for the 2xx below too.

// isSuccessHTTPResponse determines if the response from an HTTP request suggests success
func isSuccessHTTPResponse(resp autorest.Response) bool {
// HTTP 2xx suggests a successful response
r, err := regexp.Compile(`^2\d\d$`)

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

As above

if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateSGWithRetry(sg)
if retryErr != nil {
return nil, retryErr

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

Do we want to return here? Or just set err = retryErr and let the code below handle both err cases?

I think I prefer that approach.

This comment has been minimized.

@jackfrancis

jackfrancis Jun 1, 2017

Contributor

Yeah, your approach is better. The purpose of the novel retryErr reference (as opposed to reusing the existing err reference prior to checking for retry-ability is to be able to facilitate logging/debug in the retry execution branch. We can still do that by reusing the err reference and eliminate an unnecessary return.

resp, err := az.SecurityGroupsClient.CreateOrUpdate(az.ResourceGroup, *reconciledSg.Name, reconciledSg, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateSGWithRetry(reconciledSg)
if retryErr != nil {

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

As above

_, err = az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil)
resp, err := az.LoadBalancerClient.CreateOrUpdate(az.ResourceGroup, *lb.Name, lb, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateLBWithRetry(lb)

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

As above

resp, err := az.LoadBalancerClient.Delete(az.ResourceGroup, lbName, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.DeleteLBWithRetry(lbName)
if retryErr != nil {

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

Here too

resp, err := az.PublicIPAddressesClient.CreateOrUpdate(az.ResourceGroup, *pip.Name, pip, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdatePIPWithRetry(pip)
if retryErr != nil {

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

And here

resp, err := az.InterfacesClient.CreateOrUpdate(az.ResourceGroup, *nic.Name, nic, nil)
if shouldRetryAPIRequest(resp, err) {
retryErr := az.CreateOrUpdateInterfaceWithRetry(nic)
if retryErr != nil {

This comment has been minimized.

@brendandburns

brendandburns Jun 1, 2017

Contributor

You get the point, I'm going to stop commenting on all of them

This comment has been minimized.

@jackfrancis

jackfrancis Jun 1, 2017

Contributor

I was enjoying how many variations you could come up with. :)

two optimizations
- removed unnecessary return statements
- optimized HTTP response code evaluations as numeric comparisons
@fejta

This comment has been minimized.

Contributor

fejta commented Jun 1, 2017

@k8s-bot pull-kubernetes-e2e-kops-aws test this
ref: kubernetes/test-infra#2932

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 2, 2017

Looks like you need to run ./hack/update-bazel.sh otherwise, LGTM.

/lgtm
/approve

Thanks!

jackfrancis added some commits Jun 6, 2017

az.getVirtualMachine already rate-limited
we don’t need to rate limit the calls _to_ it
@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 7, 2017

@jackfrancis this looks like it has some go vet errors.

You can run go vet k8s.io/kubernetes/pkg/cloudprovider/providers/azure to generate the errors.

jackfrancis added some commits Jun 7, 2017

rate limiting everywhere
not waiting to rate limit until we get an error response from the API, doing so on initial request for all API requests
@jackfrancis

This comment has been minimized.

Contributor

jackfrancis commented Jun 7, 2017

@brendandburns thx, addressed

@jdumars

This comment has been minimized.

Member

jdumars commented Jun 7, 2017

/retest

@jdumars

This comment has been minimized.

Member

jdumars commented Jun 7, 2017

/sig azure

@jdumars

This comment has been minimized.

Member

jdumars commented Jun 7, 2017

@grodrigues3 @wojtek-t this should have the 1.7 milestone attached

@jdumars

This comment has been minimized.

Member

jdumars commented Jun 7, 2017

@brendandburns LGTM needed

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 7, 2017

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 7, 2017

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 7, 2017

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 7, 2017

/approve #47048

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 7, 2017

/approve

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Jun 7, 2017

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, jackfrancis

Associated issue: 47048

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Jun 7, 2017

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@yangl900

Looks good, just a few comments.

return true
}
// HTTP 4xx or 5xx suggests we should retry
if 399 < resp.StatusCode && resp.StatusCode < 600 {

This comment has been minimized.

@yangl900

yangl900 Jun 7, 2017

400 is not a retry-able error in Azure, I think should only retry status code == 429 or status code > 500

This comment has been minimized.

@jackfrancis

jackfrancis Jun 7, 2017

Contributor

@brendandburns I'm happy to incorporate this feedback if you're willing to go through the lgtm approval obstacle course all over agin. @yangl900 the other two remarks I believe we can justify tackling later on as part of a more holistic effort to pair k8s cloudprovider code with specific API idioms

const (
// CloudProviderName is the value used for the --cloud-provider flag
CloudProviderName = "azure"
rateLimitQPSDefault = 1.0

This comment has been minimized.

@yangl900

yangl900 Jun 7, 2017

read quota is much higher than write, I don't know if you want to differentiate in this change. that potentially give you higher quota. we can improve this later too.

var retryErr error
machine, exists, retryErr = az.getVirtualMachine(name)
if retryErr != nil {
glog.Errorf("backoff: failure, will retry,err=%v", retryErr)

This comment has been minimized.

@yangl900

yangl900 Jun 7, 2017

there is a Retry-After header returned from 429 requests, probably helpful if we trace that.

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Jun 7, 2017

@jackfrancis: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-etcd3 acb6517 link @k8s-bot pull-kubernetes-e2e-gce-etcd3 test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Jun 7, 2017

Automatic merge from submit-queue (batch tested with PRs 43005, 46660, 46385, 46991, 47103)

@k8s-merge-robot k8s-merge-robot merged commit 3adb9b4 into kubernetes:master Jun 7, 2017

7 of 9 checks passed

pull-kubernetes-e2e-gce-etcd3 Jenkins job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce-etcd3
Details
cla/linuxfoundation jackfrancis authorized
Details
pull-kubernetes-bazel Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Jenkins job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Jenkins job succeeded.
Details
pull-kubernetes-node-e2e Jenkins job succeeded.
Details
pull-kubernetes-unit Jenkins job succeeded.
Details
pull-kubernetes-verify Jenkins job succeeded.
Details
@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jun 8, 2017

@jackfrancis can you address @yangl900's comments in a follow on PR?

Thanks!

k8s-merge-robot added a commit that referenced this pull request Jun 12, 2017

Merge pull request #47278 from seanknox/automated-cherry-pick-of-#46660
…-upstream-release-1.6

Automatic merge from submit-queue

Automated cherry pick of #46660

Cherry pick of #46660 on release-1.6.

#46660: Azure cloudprovider retry using flowcontrol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment