Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions for Azure loadbalancer and route updates #77490

Merged
merged 2 commits into from May 9, 2019

Conversation

@feiskyer
Copy link
Member

commented May 6, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

/kind bug
/sig azure
/priority important-soon

What this PR does / why we need it:

Azure cloud provider configures routeTable and loadBalancer by invoking ARM APIs. But on cluster initialization stage, it may update them even when the create operation from provisioning tools doesn't finish. Because the later PUT would cancel a previous ongoing one, this would introduce troubles for provisioning tools.

This PR fixes race conditions for them by adding ETAG in PUT requests, hence cloud provider won't override the concurrent updates from other components.

Which issue(s) this PR fixes:

Fixes #75582

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix race conditions for Azure loadbalancer and route updates.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented May 6, 2019

Doing more validations for race conditions.

/hold
/test pull-kubernetes-e2e-aks-engine-azure

@feiskyer feiskyer force-pushed the feiskyer:azure-lb-route-race branch from 9ce780d to 7ca1c83 May 7, 2019

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

/hold cancel

Ready for review now. @andyzhangx PTAL

@feiskyer feiskyer added this to In progress in Provider Azure via automation May 8, 2019

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

/test pull-kubernetes-e2e-aks-engine-azure

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

/test pull-kubernetes-e2e-aks-engine-azure

az.rtCache.Delete(*routeTable.Name)
return true, err
}
return done, retryError

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx May 9, 2019

Member

if it's <false, error> , then why keep the cache? what about always clean the cache?

This comment has been minimized.

Copy link
@feiskyer

feiskyer May 9, 2019

Author Member

for those cases, cloud provider should backoff retry with exactly same data. re-get the data from ARM API is not needed, as the ETAG is not conflicted.

az.rtCache.Delete(az.RouteTableName)
return true, err
}
return done, retryError

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx May 9, 2019

Member

what about always clean the cache for error condition?

This comment has been minimized.

Copy link
@feiskyer

feiskyer May 9, 2019

Author Member

same with above, re-get the data is not needed because etag is still same.

@andyzhangx
Copy link
Member

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 9, 2019

@k8s-ci-robot k8s-ci-robot merged commit b9bde60 into kubernetes:master May 9, 2019

21 checks passed

cla/linuxfoundation feiskyer authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-aks-engine-azure Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

Provider Azure automation moved this from In progress to Done May 9, 2019

@feiskyer feiskyer deleted the feiskyer:azure-lb-route-race branch May 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.