Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix azure retry issue when return 2XX with error #78298

Merged
merged 1 commit into from May 28, 2019

Conversation

@andyzhangx
Copy link
Member

commented May 24, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
With this PR, when azure API call return <200, error>, it would regard as error and continue retry.
As we could find in error case, and also talked with VMSS team, return 2XX does not mean the operation is successful(doc: https://github.com/Azure/azure-resource-manager-rpc/blob/master/v1.0/Addendum.md#creating-or-updating-resources), there could also be error condition, e.g.

httpStatusCode 200
resultCode NetworkingInternalOperationError

Which issue(s) this PR fixes:

Fixes #78172

Special notes for your reviewer:
Note: this code change will affect all error retry including following places:

./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(service, "CreateOrUpdateSecurityGroup", resp, err)
./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(service, "CreateOrUpdateLoadBalancer", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(service, "CreateOrUpdatePublicIPAddress", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(service, "CreateOrUpdateInterface", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(service, "DeletePublicIPAddress", resp, err)
./azure_backoff.go:             done, err := az.processHTTPRetryResponse(service, "DeleteLoadBalancer", resp, err)
./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:             done, retryError := az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:             return az.processHTTPRetryResponse(nil, "", resp, err)
./azure_backoff.go:// processHTTPRetryResponse : return true means stop retry, false means continue retry
./azure_backoff.go:func (az *Cloud) processHTTPRetryResponse(service *v1.Service, reason string, resp *http.Response, err error) (bool, error) {
./azure_backoff.go:                     klog.Errorf("processHTTPRetryResponse: backoff failure, will retry, err=%v", err)
./azure_backoff.go:                     klog.Errorf("processHTTPRetryResponse: backoff failure, will retry, HTTP response=%d", resp.StatusCode)
./azure_backoff.go:             klog.Errorf("processHTTPRetryResponse failure with err: %v", err)
./azure_backoff.go:             klog.Errorf("processHTTPRetryResponse failure with HTTP response %q", resp.Status)
./azure_backoff_test.go:                res, err := az.processHTTPRetryResponse(nil, "", resp, test.err)
./azure_controller_common.go:                   return c.cloud.processHTTPRetryResponse(nil, "", resp, err)

Does this PR introduce a user-facing change?:

fix azure retry issue when return 2XX with error

/hold
Let's hold on for VMSS team do the final confirmation and also code review.

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 24, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andyzhangx andyzhangx force-pushed the andyzhangx:azure-retry-issue branch from 95497e0 to 8a45ba1 May 24, 2019

@k8s-ci-robot k8s-ci-robot added size/S and removed size/XS labels May 24, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented May 24, 2019

BTW, this is a fundamental fix for all error retry issues on all ARM calls by azure cloud provider, though I found this issue on VMSS. We should not regard 200 return code as successful operation since it also may return an error code per https://github.com/Azure/azure-resource-manager-rpc/blob/master/v1.0/Addendum.md#creating-or-updating-resources and also from ARM team.

@khenidak @brendandburns

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented May 24, 2019

/test pull-kubernetes-e2e-gce

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented May 25, 2019

/hold cancel
Got confirmation from azure ARM team

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented May 25, 2019

/test pull-kubernetes-e2e-aks-engine-azure

@feiskyer

This comment has been minimized.

Copy link
Member

commented May 27, 2019

pull-kubernetes-e2e-aks-engine-azure is failing because of Azure/aks-engine#1368.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 27, 2019

@feiskyer feiskyer added this to In progress in Provider Azure via automation May 27, 2019

@justaugustus

This comment has been minimized.

Copy link
Member

commented May 27, 2019

/skip pull-kubernetes-e2e-aks-engine-azure

@fejta-bot

This comment has been minimized.

Copy link

commented May 27, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

commented May 27, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot

This comment has been minimized.

Copy link

commented May 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

commented May 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

@andyzhangx: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-aks-engine-azure 8a45ba1 link /test pull-kubernetes-e2e-aks-engine-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit fcc9f16 into kubernetes:master May 28, 2019

20 of 21 checks passed

pull-kubernetes-e2e-aks-engine-azure Job failed.
Details
cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

Provider Azure automation moved this from In progress to Done May 28, 2019

k8s-ci-robot added a commit that referenced this pull request May 31, 2019
Merge pull request #78423 from andyzhangx/automated-cherry-pick-of-#7…
…8298-upstream-release-1.13

Automated cherry pick of #78298: fix azure retry issue when return 2XX with error
k8s-ci-robot added a commit that referenced this pull request May 31, 2019
Merge pull request #78422 from andyzhangx/automated-cherry-pick-of-#7…
…8298-upstream-release-1.14

Automated cherry pick of #78298: fix azure retry issue when return 2XX with error
k8s-ci-robot added a commit that referenced this pull request May 31, 2019
Merge pull request #78424 from andyzhangx/automated-cherry-pick-of-#7…
…8298-upstream-release-1.12

Automated cherry pick of #78298: fix azure retry issue when return 2XX with error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.