Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

Closed
dchen1107 opened this issue Mar 28, 2016 · 46 comments
Closed

e2e flaky: Connection refused errors [Errno 111] on jenkins #23545

dchen1107 opened this issue Mar 28, 2016 · 46 comments
Assignees
Labels
area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@dchen1107 dchen1107 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 28, 2016
@ixdy
Copy link
Member

ixdy commented Mar 30, 2016

820:

ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno 111] Connection refused>

822:

ERROR: gcloud crashed (error): [Errno 111] Connection refused

Don't see anything like that for 3599, but I bet it was a similar issue.

@fejta fejta assigned fejta and ixdy and unassigned fejta Mar 30, 2016
@fejta
Copy link
Contributor

fejta commented Mar 30, 2016

Connection refused is almost certainly a jenkins cpu is overloaded issue

@fejta fejta changed the title e2e flaky: Google Cloud Platform resources leaked while running tests e2e flaky: Connection refused errors [Errno 111] on jenkins Mar 30, 2016
@gmarek
Copy link
Contributor

gmarek commented Mar 31, 2016

cc @bprashanth @jlowdermilk

@j3ffml
Copy link
Contributor

j3ffml commented Mar 31, 2016

ci build of gcloud has retries to prevent crashing on connection refused from metadataserver. That was rolled back due to errors that look unrelated. I've re-installed ci build so these crashes should at least be less frequent. See #23544 for more context.

@wojtek-t
Copy link
Member

wojtek-t commented Apr 5, 2016

ping - this is still happening quite frequently, e.g.:
http://kubekins.dls.corp.google.com/view/Scalability/job/kubernetes-kubemark-5-gce/926/console

@gmarek
Copy link
Contributor

gmarek commented Apr 5, 2016

@jlowdermilk - do you know if rolling back to 'old' gcloud version would solve the problem?

@fejta
Copy link
Contributor

fejta commented Apr 7, 2016

@fejta
Copy link
Contributor

fejta commented Apr 7, 2016

@gmarek @jlowdermilk @ixdy the core problem is that we are overloading the jenkins master. We need to start running jobs on slaves.

@j3ffml
Copy link
Contributor

j3ffml commented Apr 7, 2016

I think it's a combination of jenkins master overloading and GCE changes that made metadataserver a bit more flaky under load. Next week's release should see some improvement but we still need to run jobs on slaves.

@gmarek
Copy link
Contributor

gmarek commented Apr 9, 2016

It's nearly constantly failing for Kubemarks:

06:35:28 gcloud docker push gcr.io/k8s-jenkins-kubemark/kubemark
06:35:28 ERROR: gcloud crashed (error): [Errno 111] Connection refused

@fejta fejta assigned fejta and unassigned ixdy Apr 12, 2016
@fejta
Copy link
Contributor

fejta commented Apr 12, 2016

@jlowdermilk and I have been talking to the GCE team about this. In parallel I have been working on a hacky server to cache requests to the metadata server on our jenkins VMs, but it looks like this will not work because of https://github.com/GoogleCloudPlatform/gcloud-golang/blob/master/compute/metadata/metadata.go#L208

@fejta
Copy link
Contributor

fejta commented Apr 12, 2016

Looks like I can work around this by adding multiple entries for metadata.google.internal to /etc/hosts, PR sent out for review, and this is enabled on jenkins for the time being.

@wojtek-t
Copy link
Member

@fejta - awesome!!!
Thanks a lot for working on it!

[even if this just reduces flakiness and not elimates it - what is happening currently is just a disaster]

@fejta
Copy link
Contributor

fejta commented Apr 12, 2016

BOOM! We are back up and merging: http://submit-queue.k8s.io/#/e2e

The cache reduces load on the metadata server from ~20qps to ~0 qps (1 query per ten mins or so), and it also retries intermittent flakes refreshing the access token via the metadata server (since we know we're on gce it is safe to retry errno 111, which metadata.go and gcloud cannot do for perf reasons for regular users). So I expect this will elminate the errno 111 errors (no impact on other flakes)

@fejta
Copy link
Contributor

fejta commented Apr 13, 2016

So this was fixed all day until we migrated stuff over to docker, which caused problems because dns works differently. I'm rolling that back #24176

@wojtek-t
Copy link
Member

@fejta - it seems that rollback didn't help - it is still failing a lot

@spxtr
Copy link
Contributor

spxtr commented Jun 3, 2016

Yes, that's a good idea.

@ghost
Copy link

ghost commented Jun 4, 2016

@ixdy @fejta @spxtr @dchen1107 Drive-by comment. Should we perhaps disable the metadata cache until it's more stable? How much would that slow things down?

@spxtr
Copy link
Contributor

spxtr commented Jun 4, 2016

@quinton-hoole It's not that the metadata cache isn't stable, it's that we don't check to make sure that it's up before starting a job and when Jenkins VMs restart we don't automatically start it. Without it, every job fails deterministically.

@ixdy
Copy link
Member

ixdy commented Jun 4, 2016

@quinton-hoole I tried turning off the metadata cache a few weeks ago (using the metadata server directly), but the metadata server itself is still not stable enough for our use.

@nikhiljindal
Copy link
Contributor

nikhiljindal commented Jun 7, 2016

Happened again for me: https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/26914/kubernetes-pull-build-test-e2e-gce/43517/

ERROR: gcloud crashed (error): [Errno 111] Connection refused

@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2016

Also happened to me - reopening.

@fgrzadkowski
Copy link
Contributor

This is blocking a lot of PRs and merge queue is completely blocked. What are the instructions for restarting this cache?

@fgrzadkowski
Copy link
Contributor

I have manually started metadata server of pull-kubernetes-master. We'll see if this'll help. For future reference I used the following command:

k8s.io/test-infra/jenkins/metadata-cache$ ./metadata-cache-control.sh remote_update pull-jenkins-master

@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2016

Can we document it somewhere?

@ixdy
Copy link
Member

ixdy commented Jun 7, 2016

ideally we should fix the metadata server cache so it reliably restarts when the VM restarts.

@gmarek
Copy link
Contributor

gmarek commented Jun 7, 2016

Documenting it before we're there sounds like a way to go.

@goltermann
Copy link
Contributor

What remains before we can close this P0? Is it just documenting how to restart the cache?

@spxtr spxtr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 8, 2016
@spxtr
Copy link
Contributor

spxtr commented Jun 8, 2016

Documenting and ensuring that the cache is running at all times. I don't think it's a P0 right now.

@fejta
Copy link
Contributor

fejta commented Jun 9, 2016

I'll update our internal documentation for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests