Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins #5722

Closed
zmerlynn opened this issue Mar 20, 2015 · 16 comments · Fixed by #5732
Assignees
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@zmerlynn
Copy link
Member

This test hasn't succeeded in the last 30 runs on GCE or GKE. Trying to figure out what's going on.

@zmerlynn zmerlynn added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 20, 2015
@zmerlynn
Copy link
Member Author

Oh, this test is just horrible. I manually cleaned up the ELB on the gke-test job, and the next time around the error changed to a timeout instead of the error.

@zmerlynn zmerlynn self-assigned this Mar 20, 2015
zmerlynn added a commit to zmerlynn/kubernetes that referenced this issue Mar 20, 2015
Prior to attempting to create new ones, cleanup from previous runs.
Timeouts, 500s, etc. are possible here, and if they happen, you don't
want to die forever.

Along the way: Remove the timeout, it was clearly copied from the
previous function and is actually an anti-pattern that needs to be
fixed after discovering it doesn't play well with defers.

Fixes kubernetes#5722
@zmerlynn
Copy link
Member Author

Amazingly, #5732 wasn't enough to close this.

STEP: cleanup previous service services-namespace-test0 in namespace namespace0
STEP: cleanup previous service services-namespace-test0 in namespace namespace1
STEP: creating service services-namespace-test0 in namespace namespace0
STEP: creating service services-namespace-test0 in namespace namespace1
STEP: deleting service services-namespace-test0 in namespace namespace0

• Failure [76.741 seconds]
Services
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:292
  should correctly serve identically named services in different namespaces on different external IP addresses [It]
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/service.go:291

  Expected error:
      <*errors.StatusError | 0xc208190c00>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {SelfLink: "", ResourceVersion: ""},
              Status: "Failure",
              Message: "The resource 'projects/kubernetes-jenkins/regions/us-central1/targetPools/e2e-test-jenkins-namespace1-services-namespace-test0' already exists",
              Reason: "",
              Details: nil,
              Code: 500,
          },
      }
      The resource 'projects/kubernetes-jenkins/regions/us-central1/targetPools/e2e-test-jenkins-namespace1-services-namespace-test0' already exists
  not to have occurred

Bleh. Maybe the namespace approach is right, just to avoid any possible GCE name collision issues. I suspect we may be running into an issue where delete / re-add is just too fast, and I don't want to stick in a sleep. :/

cc @quinton-hoole

@zmerlynn zmerlynn reopened this Mar 21, 2015
@zmerlynn
Copy link
Member Author

So this is interesting. That exact output above, but the GCE cloud console shows a e2e-test-jenkins-namespace0-services-namespace-test0 remaining. What's interesting here is that the test actually does log errors from the deletion case, and clearly says it deleted STEP: deleting service services-namespace-test0 in namespace namespace0. It sounds a lot like theres a bug here in the cloud provider layer during ELB deletion.

@ghost
Copy link

ghost commented Mar 21, 2015

This is actually a bug in our code (rather than the test), and a consequence of of the synchronous creation and deletion of GCE ELB's being fixed elsewhere. When we start creating the ELB's asynchronously, these Kupernetes API calls will start succeeding, and anti-entropy mechanisms in our backend will make sure that the ELB creation/deletion eventually succeeds.

@ghost
Copy link

ghost commented Mar 21, 2015

Issue #5180 refers above.

@ghost
Copy link

ghost commented Mar 21, 2015

See also discussion in PR #5732

@zmerlynn
Copy link
Member Author

Is there an approach to get this test to pass prior to that getting fixed, but sticking to API primitives?

@ghost
Copy link

ghost commented Mar 21, 2015

Yes, just increase the test timeout to 240 seconds or beyond.

On Fri, Mar 20, 2015 at 6:16 PM, Zach Loafman notifications@github.com
wrote:

Is there an approach to get this test to pass prior to that getting fixed,
but sticking to API primitives?


Reply to this email directly or view it on GitHub
#5722 (comment)
.

@zmerlynn
Copy link
Member Author

It's actually not a timeout issue (we're not seeing timeouts, just complaints about duplicate resources). I just noticed that I wasn't paying close attention to the complaint, which is that the target pool was duplicated. I had cleaned up one resource and not the other on the Jenkins project explicitly. :/

Separately, I'm an idiot and the #5732 isn't nearly enough, because it's fine for developer flows (the ^C case), but those services don't actually exist in the Jenkins case because the cluster is newly created. The problem is that at some point, we lost all ability to "re-claim" ELBs by name. This actually used to just happen, and was a source of user complaints. If that was still working, a test flake from a 500-backend-error and then subsequent duplicate wouldn't be an issue.

@roberthbailey
Copy link
Contributor

I just looked at Jenkins and this test now says that it has failed 2 times in the last 30 runs. If this isn't resolved, it's probably not a P0 any longer.

@ghost
Copy link

ghost commented Mar 24, 2015

Ack.
On Mar 24, 2015 4:53 PM, "Robert Bailey" notifications@github.com wrote:

I just looked at Jenkins and this test now says that it has failed 2 times
in the last 30 runs. If this isn't resolved, it's probably not a P0 any
longer.


Reply to this email directly or view it on GitHub
#5722 (comment)
.

@zmerlynn zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 25, 2015
@zmerlynn
Copy link
Member Author

It's not resolved simply because I occasionally have to go clean out ELBs if one of the things fails. However, it's not as critical as it was, true.

@a-robinson
Copy link
Contributor

It's not resolved, but should be less considerably less flaky with the increased timeout that got added this afternoon.

@ghost
Copy link

ghost commented Mar 25, 2015

I also plan to make each invocation run in a new namespace, which will make
it still more robust (but leakage will still eventually consume GCE
quota). And derekwaynecar@ nearly has namespace deletion working (so we'll
be able to delete whole k8s namespaces, which will help me to cover the
case where the cleanup code in the test fails). And I think we almost have
asynchronous ELB creation/deletion in (right alex?). So in summary, Zach,
I hope not to have anyone doing manual ELB cleanups "real soon now" :-)

On Tue, Mar 24, 2015 at 10:47 PM, Alex Robinson notifications@github.com
wrote:

It's not resolved, but should be less considerably less flaky with the
increased timeout that got added this afternoon.


Reply to this email directly or view it on GitHub
#5722 (comment)
.

@derekwaynecarr
Copy link
Member

@markturansky has a TODO to clean-up persistentvolumeclaims as part of namespace termination

----- Original Message -----
From: "Quinton Hoole" notifications@github.com
To: "GoogleCloudPlatform/kubernetes" kubernetes@noreply.github.com
Sent: Wednesday, March 25, 2015 12:18:52 PM
Subject: Re: [kubernetes] Test "Services should correctly serve identically named services in different namespaces on different external IP addresses" failing on Jenkins (#5722)

I also plan to make each invocation run in a new namespace, which will make
it still more robust (but leakage will still eventually consume GCE
quota). And derekwaynecar@ nearly has namespace deletion working (so we'll
be able to delete whole k8s namespaces, which will help me to cover the
case where the cleanup code in the test fails). And I think we almost have
asynchronous ELB creation/deletion in (right alex?). So in summary, Zach,
I hope not to have anyone doing manual ELB cleanups "real soon now" :-)

On Tue, Mar 24, 2015 at 10:47 PM, Alex Robinson notifications@github.com
wrote:

It's not resolved, but should be less considerably less flaky with the
increased timeout that got added this afternoon.


Reply to this email directly or view it on GitHub
#5722 (comment)
.


Reply to this email directly or view it on GitHub:
#5722 (comment)

@ghost
Copy link

ghost commented Mar 30, 2015

PR #6125 fixes the original issue reported in this issue.

@ghost ghost closed this as completed Mar 30, 2015
akram pushed a commit to akram/kubernetes that referenced this issue Apr 7, 2015
Prior to attempting to create new ones, cleanup from previous runs.
Timeouts, 500s, etc. are possible here, and if they happen, you don't
want to die forever.

Along the way: Remove the timeout, it was clearly copied from the
previous function and is actually an anti-pattern that needs to be
fixed after discovering it doesn't play well with defers.

Fixes kubernetes#5722
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants