Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins #5722

zmerlynn · 2015-03-20T18:38:46Z

This test hasn't succeeded in the last 30 runs on GCE or GKE. Trying to figure out what's going on.

zmerlynn · 2015-03-20T18:43:22Z

Oh, this test is just horrible. I manually cleaned up the ELB on the gke-test job, and the next time around the error changed to a timeout instead of the error.

Prior to attempting to create new ones, cleanup from previous runs. Timeouts, 500s, etc. are possible here, and if they happen, you don't want to die forever. Along the way: Remove the timeout, it was clearly copied from the previous function and is actually an anti-pattern that needs to be fixed after discovering it doesn't play well with defers. Fixes kubernetes#5722

zmerlynn · 2015-03-21T00:58:57Z

Amazingly, #5732 wasn't enough to close this.

STEP: cleanup previous service services-namespace-test0 in namespace namespace0
STEP: cleanup previous service services-namespace-test0 in namespace namespace1
STEP: creating service services-namespace-test0 in namespace namespace0
STEP: creating service services-namespace-test0 in namespace namespace1
STEP: deleting service services-namespace-test0 in namespace namespace0

• Failure [76.741 seconds]
Services
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:292
  should correctly serve identically named services in different namespaces on different external IP addresses [It]
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:291

  Expected error:
      <*errors.StatusError | 0xc208190c00>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {SelfLink: "", ResourceVersion: ""},
              Status: "Failure",
              Message: "The resource 'projects/kubernetes-jenkins/regions/us-central1/targetPools/e2e-test-jenkins-namespace1-services-namespace-test0' already exists",
              Reason: "",
              Details: nil,
              Code: 500,
          },
      }
      The resource 'projects/kubernetes-jenkins/regions/us-central1/targetPools/e2e-test-jenkins-namespace1-services-namespace-test0' already exists
  not to have occurred

Bleh. Maybe the namespace approach is right, just to avoid any possible GCE name collision issues. I suspect we may be running into an issue where delete / re-add is just too fast, and I don't want to stick in a sleep. :/

cc @quinton-hoole

zmerlynn · 2015-03-21T01:07:00Z

So this is interesting. That exact output above, but the GCE cloud console shows a e2e-test-jenkins-namespace0-services-namespace-test0 remaining. What's interesting here is that the test actually does log errors from the deletion case, and clearly says it deleted STEP: deleting service services-namespace-test0 in namespace namespace0. It sounds a lot like theres a bug here in the cloud provider layer during ELB deletion.

ghost · 2015-03-21T01:12:26Z

This is actually a bug in our code (rather than the test), and a consequence of of the synchronous creation and deletion of GCE ELB's being fixed elsewhere. When we start creating the ELB's asynchronously, these Kupernetes API calls will start succeeding, and anti-entropy mechanisms in our backend will make sure that the ELB creation/deletion eventually succeeds.

ghost · 2015-03-21T01:13:28Z

Issue #5180 refers above.

ghost · 2015-03-21T01:13:51Z

See also discussion in PR #5732

zmerlynn · 2015-03-21T01:16:25Z

Is there an approach to get this test to pass prior to that getting fixed, but sticking to API primitives?

ghost · 2015-03-21T14:51:07Z

Yes, just increase the test timeout to 240 seconds or beyond.

On Fri, Mar 20, 2015 at 6:16 PM, Zach Loafman notifications@github.com
wrote:

Is there an approach to get this test to pass prior to that getting fixed,
but sticking to API primitives?

—
Reply to this email directly or view it on GitHub
#5722 (comment)
.

zmerlynn · 2015-03-21T20:00:05Z

It's actually not a timeout issue (we're not seeing timeouts, just complaints about duplicate resources). I just noticed that I wasn't paying close attention to the complaint, which is that the target pool was duplicated. I had cleaned up one resource and not the other on the Jenkins project explicitly. :/

Separately, I'm an idiot and the #5732 isn't nearly enough, because it's fine for developer flows (the ^C case), but those services don't actually exist in the Jenkins case because the cluster is newly created. The problem is that at some point, we lost all ability to "re-claim" ELBs by name. This actually used to just happen, and was a source of user complaints. If that was still working, a test flake from a 500-backend-error and then subsequent duplicate wouldn't be an issue.

roberthbailey · 2015-03-24T23:52:58Z

I just looked at Jenkins and this test now says that it has failed 2 times in the last 30 runs. If this isn't resolved, it's probably not a P0 any longer.

ghost · 2015-03-24T23:59:44Z

Ack.
On Mar 24, 2015 4:53 PM, "Robert Bailey" notifications@github.com wrote:

I just looked at Jenkins and this test now says that it has failed 2 times
in the last 30 runs. If this isn't resolved, it's probably not a P0 any
longer.

—
Reply to this email directly or view it on GitHub
#5722 (comment)
.

zmerlynn · 2015-03-25T04:34:21Z

It's not resolved simply because I occasionally have to go clean out ELBs if one of the things fails. However, it's not as critical as it was, true.

a-robinson · 2015-03-25T05:47:13Z

It's not resolved, but should be less considerably less flaky with the increased timeout that got added this afternoon.

ghost · 2015-03-25T16:18:21Z

I also plan to make each invocation run in a new namespace, which will make
it still more robust (but leakage will still eventually consume GCE
quota). And derekwaynecar@ nearly has namespace deletion working (so we'll
be able to delete whole k8s namespaces, which will help me to cover the
case where the cleanup code in the test fails). And I think we almost have
asynchronous ELB creation/deletion in (right alex?). So in summary, Zach,
I hope not to have anyone doing manual ELB cleanups "real soon now" :-)

On Tue, Mar 24, 2015 at 10:47 PM, Alex Robinson notifications@github.com
wrote:

It's not resolved, but should be less considerably less flaky with the
increased timeout that got added this afternoon.

—
Reply to this email directly or view it on GitHub
#5722 (comment)
.

derekwaynecarr · 2015-03-26T19:16:04Z

@markturansky has a TODO to clean-up persistentvolumeclaims as part of namespace termination

----- Original Message -----
From: "Quinton Hoole" notifications@github.com
To: "GoogleCloudPlatform/kubernetes" kubernetes@noreply.github.com
Sent: Wednesday, March 25, 2015 12:18:52 PM
Subject: Re: [kubernetes] Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins (#5722)

I also plan to make each invocation run in a new namespace, which will make
it still more robust (but leakage will still eventually consume GCE
quota). And derekwaynecar@ nearly has namespace deletion working (so we'll
be able to delete whole k8s namespaces, which will help me to cover the
case where the cleanup code in the test fails). And I think we almost have
asynchronous ELB creation/deletion in (right alex?). So in summary, Zach,
I hope not to have anyone doing manual ELB cleanups "real soon now" :-)

On Tue, Mar 24, 2015 at 10:47 PM, Alex Robinson notifications@github.com
wrote:

It's not resolved, but should be less considerably less flaky with the
increased timeout that got added this afternoon.

—
Reply to this email directly or view it on GitHub
#5722 (comment)
.

Reply to this email directly or view it on GitHub:
#5722 (comment)

ghost · 2015-03-30T23:56:49Z

PR #6125 fixes the original issue reported in this issue.

Prior to attempting to create new ones, cleanup from previous runs. Timeouts, 500s, etc. are possible here, and if they happen, you don't want to die forever. Along the way: Remove the timeout, it was clearly copied from the previous function and is actually an anti-pattern that needs to be fixed after discovering it doesn't play well with defers. Fixes kubernetes#5722

zmerlynn added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 20, 2015

zmerlynn self-assigned this Mar 20, 2015

mbforbes added the area/test-infra label Mar 20, 2015

zmerlynn mentioned this issue Mar 20, 2015

service.go e2e: Cleanup ELBs from previous runs #5732

Merged

mbforbes closed this as completed in #5732 Mar 20, 2015

zmerlynn reopened this Mar 21, 2015

zmerlynn mentioned this issue Mar 24, 2015

Services should be able to create a functioning external load balancer is failing on Jenkins #5846

Closed

a-robinson mentioned this issue Mar 24, 2015

Increase the client-specified timeout for service create/update/delete. #5858

Merged

yifan-gu mentioned this issue Mar 24, 2015

Introduce GetPods() to replace GetKubeletDockerContainers() #5702

Merged

zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 25, 2015

ghost closed this as completed Mar 30, 2015

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins #5722

Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins #5722

zmerlynn commented Mar 20, 2015

zmerlynn commented Mar 20, 2015

zmerlynn commented Mar 21, 2015

zmerlynn commented Mar 21, 2015

ghost commented Mar 21, 2015

ghost commented Mar 21, 2015

ghost commented Mar 21, 2015

zmerlynn commented Mar 21, 2015

ghost commented Mar 21, 2015

zmerlynn commented Mar 21, 2015

roberthbailey commented Mar 24, 2015

ghost commented Mar 24, 2015

zmerlynn commented Mar 25, 2015

a-robinson commented Mar 25, 2015

ghost commented Mar 25, 2015

derekwaynecarr commented Mar 26, 2015

ghost commented Mar 30, 2015

Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins #5722

Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins #5722

Comments

zmerlynn commented Mar 20, 2015

zmerlynn commented Mar 20, 2015

zmerlynn commented Mar 21, 2015

zmerlynn commented Mar 21, 2015

ghost commented Mar 21, 2015

ghost commented Mar 21, 2015

ghost commented Mar 21, 2015

zmerlynn commented Mar 21, 2015

ghost commented Mar 21, 2015

zmerlynn commented Mar 21, 2015

roberthbailey commented Mar 24, 2015

ghost commented Mar 24, 2015

zmerlynn commented Mar 25, 2015

a-robinson commented Mar 25, 2015

ghost commented Mar 25, 2015

derekwaynecarr commented Mar 26, 2015

ghost commented Mar 30, 2015