-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaking GCE load balancer target pools detected in services e2e test #8377
Comments
There were 29 GCE forwarding rules, but 50 GCE target pools, so target pools seem to be leaking. 50 is also the quota limit for target pools on our project, as far as I know, so that's probably the problem. I've deleted all of the above - lets see whether that sorts out the problem. |
To be clear, the 29 forwarding rules are probably from some running e2e tests - there are many, and I didn't check. But that number should be the same as the number of target pools. The discrepency indicates a probable leak. |
Yes, that fixed it. e2e's are all green again. Dropping priority, but keeping open to fix the source of the target pool leak. |
Yup, we're leaking target pools again. All of the leaked pools are named k8s-jenkins-gke-e2e-* , so the problem appears to be specific to GKE. |
Correction - the instances behind the leaked pools are named k8s-jenkins-gke-e2e-* |
cc: @roberthbailey @brendandburns FYI |
This is a dupe of #7753, which #7852 should at least help with. Closing this in favor of that. It's very odd that it's affecting GKE so disproportionately, though. After cleaning up the target pools whose forwarding rules were gone, all the non-soak-test target pools belonged to GKE, which indicates that GKE isn't just leaking target pools, it's also leaking forwarding rules. The first thing that strikes me about that is that it's very possibly happening on the occasions when the services tests are the last ones run, and the cluster gets torn down before the service controller has had time to clean up. This is happening in GKE but not in GCE because GKE tears down all resources in parallel, while the GCE script synchronously waits for all nodes to be deleted before deleting the master. |
This started failing consistently on our continuous integration system (kubernetes-e2e-gce) at Build #352 (May 16, 2015 1:08:38 AM). No obvious culprit PR's in the vicinity. Perhaps an inderlying GCE issue?
I noticed a bunch of GCE load balancers seemingly left lying around in the relevant GCE project. Perhaps the system or test is leaking LB's, and reaching it's GCE quota. I'll look into that...
cc: @a-robinson
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:348
Expected error:
<*errors.errorString | 0xc208e160c0>: {
s: "service external-lb-test in namespace e2e-tests-service-0-5460a388-2c1d-4be9-a258-a19ee3ad8ea0 doesn't have a public IP after 240.00 seconds",
}
service external-lb-test in namespace e2e-tests-service-0-5460a388-2c1d-4be9-a258-a19ee3ad8ea0 doesn't have a public IP after 240.00 seconds
not to have occurred
The text was updated successfully, but these errors were encountered: