Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaking GCE load balancer target pools detected in services e2e test #8377

Closed
ghost opened this issue May 16, 2015 · 7 comments
Closed

Leaking GCE load balancer target pools detected in services e2e test #8377

ghost opened this issue May 16, 2015 · 7 comments
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ghost
Copy link

ghost commented May 16, 2015

This started failing consistently on our continuous integration system (kubernetes-e2e-gce) at Build #352 (May 16, 2015 1:08:38 AM). No obvious culprit PR's in the vicinity. Perhaps an inderlying GCE issue?
I noticed a bunch of GCE load balancers seemingly left lying around in the relevant GCE project. Perhaps the system or test is leaking LB's, and reaching it's GCE quota. I'll look into that...

cc: @a-robinson

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:348

Expected error:
<*errors.errorString | 0xc208e160c0>: {
s: "service external-lb-test in namespace e2e-tests-service-0-5460a388-2c1d-4be9-a258-a19ee3ad8ea0 doesn't have a public IP after 240.00 seconds",
}
service external-lb-test in namespace e2e-tests-service-0-5460a388-2c1d-4be9-a258-a19ee3ad8ea0 doesn't have a public IP after 240.00 seconds
not to have occurred

@ghost ghost added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra labels May 16, 2015
@ghost ghost added this to the v1.0 milestone May 16, 2015
@ghost
Copy link
Author

ghost commented May 16, 2015

There were 29 GCE forwarding rules, but 50 GCE target pools, so target pools seem to be leaking. 50 is also the quota limit for target pools on our project, as far as I know, so that's probably the problem. I've deleted all of the above - lets see whether that sorts out the problem.

@ghost
Copy link
Author

ghost commented May 16, 2015

To be clear, the 29 forwarding rules are probably from some running e2e tests - there are many, and I didn't check. But that number should be the same as the number of target pools. The discrepency indicates a probable leak.

@ghost
Copy link
Author

ghost commented May 16, 2015

Yes, that fixed it. e2e's are all green again. Dropping priority, but keeping open to fix the source of the target pool leak.

@ghost ghost added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 16, 2015
@ghost ghost modified the milestones: v1.0-candidate, v1.0 May 16, 2015
@ghost ghost added the team/cluster label May 16, 2015
@ghost ghost changed the title e2e regression: Service does not get a public IP Leaking GCE load balancer target pools detected in services e2e test May 16, 2015
@ghost
Copy link
Author

ghost commented May 16, 2015

Yup, we're leaking target pools again. All of the leaked pools are named k8s-jenkins-gke-e2e-* , so the problem appears to be specific to GKE.

@ghost
Copy link
Author

ghost commented May 16, 2015

Correction - the instances behind the leaked pools are named k8s-jenkins-gke-e2e-*

@ghost
Copy link
Author

ghost commented May 16, 2015

cc: @roberthbailey @brendandburns FYI

@a-robinson
Copy link
Contributor

This is a dupe of #7753, which #7852 should at least help with. Closing this in favor of that.

It's very odd that it's affecting GKE so disproportionately, though. After cleaning up the target pools whose forwarding rules were gone, all the non-soak-test target pools belonged to GKE, which indicates that GKE isn't just leaking target pools, it's also leaking forwarding rules.

The first thing that strikes me about that is that it's very possibly happening on the occasions when the services tests are the last ones run, and the cluster gets torn down before the service controller has had time to clean up. This is happening in GKE but not in GCE because GKE tears down all resources in parallel, while the GCE script synchronously waits for all nodes to be deleted before deleting the master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

1 participant