Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Azure load balancer cleaned up on 404 or 403 #75256

Merged
merged 1 commit into from
Mar 12, 2019

Conversation

feiskyer
Copy link
Member

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

When deleting LoadBalancer services, Azure may return 404 or 403. This is usually caused by wrong annotations configured on the service spec.

Currently, an error is reported in EnsureLoadBalancerDeleted(), so that the service controller to retry deleting it again. However, then 404 or 403 is reported, this retry won't succeed. And if you create a new service with the same name, it will always fail.

This PR fixes the issue by checking the response codes, and reports nil on 404 and 403.

Which issue(s) this PR fixes:

Fixes #75198

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Ensure Azure load balancer cleaned up on 404 or 403 when deleting LoadBalancer services.

/sig azure
/kind bug
/priority critical-urgent
/milestone v1.14
/cc @andyzhangx @khenidak @weinong

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 11, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 11, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 11, 2019
@k8s-ci-robot k8s-ci-robot added sig/azure kind/bug Categorizes issue or PR as related to a bug. labels Mar 11, 2019
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 11, 2019
@k8s-ci-robot
Copy link
Contributor

@feiskyer: GitHub didn't allow me to request PR reviews from the following users: weinong.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

When deleting LoadBalancer services, Azure may return 404 or 403. This is usually caused by wrong annotations configured on the service spec.

Currently, an error is reported in EnsureLoadBalancerDeleted(), so that the service controller to retry deleting it again. However, then 404 or 403 is reported, this retry won't succeed. And if you create a new service with the same name, it will always fail.

This PR fixes the issue by checking the response codes, and reports nil on 404 and 403.

Which issue(s) this PR fixes:

Fixes #75198

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Ensure Azure load balancer cleaned up on 404 or 403 when deleting LoadBalancer services.

/sig azure
/kind bug
/priority critical-urgent
/milestone v1.14
/cc @andyzhangx @khenidak @weinong

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Mar 11, 2019
@feiskyer
Copy link
Member Author

/test pull-kubernetes-e2e-aks-engine-azure

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2019
@feiskyer feiskyer added this to In progress in Provider Azure via automation Mar 11, 2019
@feiskyer feiskyer moved this from In progress to Needs Review in Provider Azure Mar 11, 2019
@feiskyer
Copy link
Member Author

@andyzhangx Could you help to take a look?

Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2019
@feiskyer feiskyer added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 12, 2019
@k8s-ci-robot k8s-ci-robot merged commit 9cbccd3 into kubernetes:master Mar 12, 2019
Provider Azure automation moved this from Needs Review to Done Mar 12, 2019
@djsly
Copy link
Contributor

djsly commented Mar 13, 2019

thanks @feiskyer, we just discovered that issue on 1.13.4. Some users set the wrong idle timeout (3 which isn't a valid value) and another user set a wrong resource group name. Both case, the service creation failed and the service got stuck in Deleting state with the same root cause. Users tried to delete recreate the same service (name + resourcegroup name == same) and the controller manager was ignoring the creation request since it was stuck trying to delete.

After wiping the kube-controller-manager's memory (restart service) all was good.

I'm guessing those PR/Cherry Pick are there to fix that exact issue ?

@feiskyer
Copy link
Member Author

@djsly Yep, cherry picking the fixes to all stable releases.

@feiskyer feiskyer deleted the fix-75198 branch March 14, 2019 00:04
k8s-ci-robot added a commit that referenced this pull request Mar 15, 2019
…56-upstream-release-1.12

Automated cherry pick of #75256: Ensure Azure load balancer cleaned up on 404 or 403
k8s-ci-robot added a commit that referenced this pull request Mar 19, 2019
…56-upstream-release-1.13

Automated cherry pick of #75256: Ensure Azure load balancer cleaned up on 404 or 403
k8s-ci-robot added a commit that referenced this pull request Mar 21, 2019
…56-upstream-release-1.11

Automated cherry pick of #75256: Ensure Azure load balancer cleaned up on 404 or 403
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
4 participants