e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

dchen1107 · 2016-03-28T18:14:25Z

We ran into a lot of resource leakage issues lately, which causes the test failed at the end:

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-release-1.2/822/console
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-release-1.2/820/console
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-slow/3599/console
...

ixdy · 2016-03-30T17:04:16Z

820:

ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno 111] Connection refused>

822:

ERROR: gcloud crashed (error): [Errno 111] Connection refused

Don't see anything like that for 3599, but I bet it was a similar issue.

fejta · 2016-03-30T22:01:17Z

Connection refused is almost certainly a jenkins cpu is overloaded issue

fejta · 2016-03-30T22:04:20Z

https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/5655/

gmarek · 2016-03-31T16:46:47Z

cc @bprashanth @jlowdermilk

j3ffml · 2016-03-31T18:20:11Z

ci build of gcloud has retries to prevent crashing on connection refused from metadataserver. That was rolled back due to errors that look unrelated. I've re-installed ci build so these crashes should at least be less frequent. See #23544 for more context.

fejta · 2016-04-01T08:31:17Z

https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-release-1.0/4424/

wojtek-t · 2016-04-05T00:49:20Z

ping - this is still happening quite frequently, e.g.:
http://kubekins.dls.corp.google.com/view/Scalability/job/kubernetes-kubemark-5-gce/926/console

gmarek · 2016-04-05T00:49:54Z

@jlowdermilk - do you know if rolling back to 'old' gcloud version would solve the problem?

fejta · 2016-04-07T02:27:32Z

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-slow/2467/console

fejta · 2016-04-07T02:31:01Z

@gmarek @jlowdermilk @ixdy the core problem is that we are overloading the jenkins master. We need to start running jobs on slaves.

j3ffml · 2016-04-07T03:41:17Z

I think it's a combination of jenkins master overloading and GCE changes that made metadataserver a bit more flaky under load. Next week's release should see some improvement but we still need to run jobs on slaves.

gmarek · 2016-04-09T14:05:18Z

It's nearly constantly failing for Kubemarks:

06:35:28 gcloud docker push gcr.io/k8s-jenkins-kubemark/kubemark
06:35:28 ERROR: gcloud crashed (error): [Errno 111] Connection refused

fejta · 2016-04-12T08:37:41Z

@jlowdermilk and I have been talking to the GCE team about this. In parallel I have been working on a hacky server to cache requests to the metadata server on our jenkins VMs, but it looks like this will not work because of https://github.com/GoogleCloudPlatform/gcloud-golang/blob/master/compute/metadata/metadata.go#L208

fejta · 2016-04-12T09:45:59Z

Looks like I can work around this by adding multiple entries for metadata.google.internal to /etc/hosts, PR sent out for review, and this is enabled on jenkins for the time being.

fejta · 2016-04-12T09:55:37Z

VICTORY

wojtek-t · 2016-04-12T09:59:10Z

@fejta - awesome!!!
Thanks a lot for working on it!

[even if this just reduces flakiness and not elimates it - what is happening currently is just a disaster]

fejta · 2016-04-12T10:35:25Z

BOOM! We are back up and merging: http://submit-queue.k8s.io/#/e2e

The cache reduces load on the metadata server from ~20qps to ~0 qps (1 query per ten mins or so), and it also retries intermittent flakes refreshing the access token via the metadata server (since we know we're on gce it is safe to retry errno 111, which metadata.go and gcloud cannot do for perf reasons for regular users). So I expect this will elminate the errno 111 errors (no impact on other flakes)

fejta · 2016-04-13T01:47:57Z

So this was fixed all day until we migrated stuff over to docker, which caused problems because dns works differently. I'm rolling that back #24176

wojtek-t · 2016-04-13T06:15:34Z

@fejta - it seems that rollback didn't help - it is still failing a lot

spxtr · 2016-06-03T23:56:56Z

Yes, that's a good idea.

ghost · 2016-06-04T00:26:03Z

@ixdy @fejta @spxtr @dchen1107 Drive-by comment. Should we perhaps disable the metadata cache until it's more stable? How much would that slow things down?

spxtr · 2016-06-04T00:27:48Z

@quinton-hoole It's not that the metadata cache isn't stable, it's that we don't check to make sure that it's up before starting a job and when Jenkins VMs restart we don't automatically start it. Without it, every job fails deterministically.

ixdy · 2016-06-04T00:31:10Z

@quinton-hoole I tried turning off the metadata cache a few weeks ago (using the metadata server directly), but the metadata server itself is still not stable enough for our use.

nikhiljindal · 2016-06-07T04:51:18Z

Happened again for me: https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/26914/kubernetes-pull-build-test-e2e-gce/43517/

ERROR: gcloud crashed (error): [Errno 111] Connection refused

wojtek-t · 2016-06-07T11:13:24Z

Also happened to me - reopening.

fgrzadkowski · 2016-06-07T13:29:40Z

This is blocking a lot of PRs and merge queue is completely blocked. What are the instructions for restarting this cache?

fgrzadkowski · 2016-06-07T14:38:20Z

I have manually started metadata server of pull-kubernetes-master. We'll see if this'll help. For future reference I used the following command:

k8s.io/test-infra/jenkins/metadata-cache$ ./metadata-cache-control.sh remote_update pull-jenkins-master

wojtek-t · 2016-06-07T16:04:58Z

Can we document it somewhere?

ixdy · 2016-06-07T17:05:07Z

ideally we should fix the metadata server cache so it reliably restarts when the VM restarts.

gmarek · 2016-06-07T19:44:03Z

Documenting it before we're there sounds like a way to go.

goltermann · 2016-06-08T20:53:39Z

What remains before we can close this P0? Is it just documenting how to restart the cache?

spxtr · 2016-06-08T23:29:04Z

Documenting and ensuring that the cache is running at all times. I don't think it's a P0 right now.

fejta · 2016-06-09T00:39:28Z

I'll update our internal documentation for this.

dchen1107 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 28, 2016

fejta assigned fejta and ixdy and unassigned fejta Mar 30, 2016

fejta changed the title ~~e2e flaky: Google Cloud Platform resources leaked while running tests~~ e2e flaky: Connection refused errors [Errno 111] on jenkins Mar 30, 2016

j3ffml mentioned this issue Apr 8, 2016

e2e flaky: kubernetes-e2e-gke-release-1.2 failed to create cluster sometimes #23544

Closed

fejta assigned fejta and unassigned ixdy Apr 12, 2016

fejta mentioned this issue Apr 12, 2016

Script to cache metadata requests on the jenkins master #24131

Merged

nikhiljindal mentioned this issue Jun 7, 2016

Updating federation up script to create secrets with federation-apiserver and k8s apiservers kubeconfigs #26914

Merged

This was referenced Jun 7, 2016

MESOS: fix race condition in contrib/mesos/pkg/queue/delay #24916

Merged

Move /seccomp/ into domain prefix in seccomp annotations #26710

Merged

wojtek-t reopened this Jun 7, 2016

wojtek-t mentioned this issue Jun 7, 2016

Fix Retry-After in clients #26874

Merged

kevin-wangzefeng mentioned this issue Jun 7, 2016

fix node.labels leak in scheduler predicates e2e #26886

Merged

sttts mentioned this issue Jun 7, 2016

Flake 26210: decouple explicit access from port 80 #26961

Merged

zmerlynn mentioned this issue Jun 7, 2016

GCE provider: Log full contents of long operations #26962

Merged

fgrzadkowski mentioned this issue Jun 7, 2016

Fix scalability flakes in small clusters #26959

Merged

dcbw mentioned this issue Jun 7, 2016

kubelet/kubenet: split hostport handling into separate module #26934

Merged

colhom mentioned this issue Jun 7, 2016

Implement first set of federated service e2e tests. #26636

Merged

resouer mentioned this issue Jun 8, 2016

Refactoring runner resource container linedelimiter to it's own pkg #26958

Merged

spxtr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 8, 2016

fejta closed this as completed Jun 9, 2016

krousey mentioned this issue Jul 7, 2016

Broken test runs: gcloud crashing #28612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

dchen1107 commented Mar 28, 2016

ixdy commented Mar 30, 2016

fejta commented Mar 30, 2016

fejta commented Mar 30, 2016

gmarek commented Mar 31, 2016

j3ffml commented Mar 31, 2016

fejta commented Apr 1, 2016

wojtek-t commented Apr 5, 2016

gmarek commented Apr 5, 2016

fejta commented Apr 7, 2016

fejta commented Apr 7, 2016

j3ffml commented Apr 7, 2016

gmarek commented Apr 9, 2016

fejta commented Apr 12, 2016

fejta commented Apr 12, 2016

fejta commented Apr 12, 2016

wojtek-t commented Apr 12, 2016

fejta commented Apr 12, 2016

fejta commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

spxtr commented Jun 3, 2016

ghost commented Jun 4, 2016

spxtr commented Jun 4, 2016

ixdy commented Jun 4, 2016

nikhiljindal commented Jun 7, 2016 •

edited

Loading

wojtek-t commented Jun 7, 2016

fgrzadkowski commented Jun 7, 2016

fgrzadkowski commented Jun 7, 2016

wojtek-t commented Jun 7, 2016

ixdy commented Jun 7, 2016

gmarek commented Jun 7, 2016

goltermann commented Jun 8, 2016

spxtr commented Jun 8, 2016

fejta commented Jun 9, 2016

e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

Comments

dchen1107 commented Mar 28, 2016

ixdy commented Mar 30, 2016

fejta commented Mar 30, 2016

fejta commented Mar 30, 2016

gmarek commented Mar 31, 2016

j3ffml commented Mar 31, 2016

fejta commented Apr 1, 2016

wojtek-t commented Apr 5, 2016

gmarek commented Apr 5, 2016

fejta commented Apr 7, 2016

fejta commented Apr 7, 2016

j3ffml commented Apr 7, 2016

gmarek commented Apr 9, 2016

fejta commented Apr 12, 2016

fejta commented Apr 12, 2016

fejta commented Apr 12, 2016

wojtek-t commented Apr 12, 2016

fejta commented Apr 12, 2016

fejta commented Apr 13, 2016

wojtek-t commented Apr 13, 2016

spxtr commented Jun 3, 2016

ghost commented Jun 4, 2016

spxtr commented Jun 4, 2016

ixdy commented Jun 4, 2016

nikhiljindal commented Jun 7, 2016 • edited Loading

wojtek-t commented Jun 7, 2016

fgrzadkowski commented Jun 7, 2016

fgrzadkowski commented Jun 7, 2016

wojtek-t commented Jun 7, 2016

ixdy commented Jun 7, 2016

gmarek commented Jun 7, 2016

goltermann commented Jun 8, 2016

spxtr commented Jun 8, 2016

fejta commented Jun 9, 2016

nikhiljindal commented Jun 7, 2016 •

edited

Loading