Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Failed ELB creation retried indefinitely after LB service deletion #17790

Closed
antoineco opened this issue Nov 25, 2015 · 11 comments
Closed

Comments

@antoineco
Copy link
Contributor

(k8s 1.1.2)

I created a service of type LoadBalancer on AWS, which failed because of #12381.

After deleting the service, kube-controller-manager keeps trying to create the ELB indefinitely:

kube-controller-manager[1228]: I1125 18:26:53.922152    1228 servicecontroller.go:222] Got new Sync delta for service: &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:api-lb GenerateName: Namespace:user-api SelfLink:/api/v1/namespaces/user-api/services/api-lb UID:a0827a39-939f-11e5-8b77-0a4c87eff515 ResourceVersion:24376398 Generation:0 CreationTimestamp:2015-11-25 18:09:10 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[app:core-user-rails component:api] Annotations:map[]} Spec:{Type:LoadBalancer Ports:[{Name:http Protocol:TCP Port:9080 TargetPort:{Kind:1 IntVal:0 StrVal:http} NodePort:30131}] Selector:map[app:core-user-rails component:api] ClusterIP:10.10.203.255 ExternalIPs:[] LoadBalancerIP: SessionAffinity:None} Status:{LoadBalancer:{Ingress:[]}}}
kube-controller-manager[1228]: I1125 18:26:53.922308    1228 servicecontroller.go:317] Ensuring LB for service user-api/api-lb
kube-controller-manager[1228]: I1125 18:26:53.922346    1228 aws.go:1582] EnsureTCPLoadBalancer(aa0827a39939f11e58b770a4c87eff51, eu-west-1, <nil>, [0xc208a27bd0], [ip-10-0-12-111.eu-west-1.compute.internal ip-10-0-12-147.eu-west-1.compute.internal ip-10-0-12-110.eu-west-1.compute.internal])
kube-controller-manager[1228]: I1125 18:26:54.470111    1228 aws_loadbalancer.go:50] Creating load balancer with name: aa0827a39939f11e58b770a4c87eff51
kube-controller-manager[1228]: E1125 18:26:54.818167    1228 servicecontroller.go:187] Failed to process service delta. Retrying: Failed to create load balancer for service user-api/api-lb: InvalidConfigurationRequest: ELB cannot be attached to multiple subnets in the same AZ.
kube-controller-manager[1228]: status code: 409, request id: 1a9adde7-93a2-11e5-b3ea-5bdb992c6083
...

This eventually leads to (see ##12121):

kube-controller-manager[1228]: E1125 18:35:11.765525    1228 servicecontroller.go:187] Failed to process service delta. Retrying: Failed to create load balancer for service user-api/api-lb: Throttling: Rate exceeded

The only way I found to stop the hemorrhage was to restart kube-controller-manager.

@davidopp
Copy link
Member

@bprashanth

@kelcecil
Copy link
Contributor

kelcecil commented Dec 4, 2015

We experienced this with Kubernetes 1.1.1 this morning, and the infinite retry coupled with the aggressiveness of the retry caused other parts of our system to have difficulties. Limiting the number of retries and exponentially backing off on subsequent retries would be fantastic.

I'm willing to submit a PR to fix this if provided some guidance on whether my suggestions would be acceptable.

@harsha-y
Copy link

+1

Experienced this on k8s 1.1.1 - a previous occurrence caused AWS to disable ELB creation on a particular account.

Exponential cool down timer seems to be what AWS recommends but at the very least this should have a retry limit and a default delay between retries.

@gopinatht
Copy link

+1 on this. @kelcecil Is the PR already submitted for this?

@bprashanth
Copy link
Contributor

@kelcecil sure a backoff pr sounds like a good fix, if you're still interested. Sorry for the delay @kubernetes/goog-cluster fyi

@kelcecil
Copy link
Contributor

kelcecil commented Feb 3, 2016

@gopinatht @bprashanth The PR isn't submitted yet, but I can possibly get to it in the next week. If someone else wants to take it, then please feel free.

@bprashanth
Copy link
Contributor

Next week sound good, I'll wait for it unless @gopinatht want's to jump in

@gopinatht
Copy link

@bprashanth @kelcecil I have absolutely no experience with this code base. So if I do it, it will be a lot more than a week if I ever need to do it. I offer to review the PR if that helps.

@kelcecil
Copy link
Contributor

kelcecil commented Mar 1, 2016

I talked to @justinsb about this on the #sig-aws Slack channel a little while ago. I'm going to look at generalizing this backoff for everything instead of just AWS. Should make things easier. I'm going to start hacking on it this week.

@justinsb
Copy link
Member

justinsb commented Mar 2, 2016

@kelcecil I've got a PR pending for backoff in the servicecontroller #21982 . It isn't great (it uses a goroutine to defer), but it should be doable for 1.2, while a better implementation would probably be too invasive.

@justinsb
Copy link
Member

justinsb commented Jun 4, 2016

This should now be fixed in 1.2: we have a backoff. I believe we still retry indefinitely, but we should not hit rate limits. Please reopen if it continues in 1.2 or later.

@justinsb justinsb closed this as completed Jun 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants