-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flaky: Connection refused errors [Errno 111] on jenkins #23545
Comments
820:
822:
Don't see anything like that for 3599, but I bet it was a similar issue. |
Connection refused is almost certainly a jenkins cpu is overloaded issue |
ci build of gcloud has retries to prevent crashing on connection refused from metadataserver. That was rolled back due to errors that look unrelated. I've re-installed ci build so these crashes should at least be less frequent. See #23544 for more context. |
ping - this is still happening quite frequently, e.g.: |
@jlowdermilk - do you know if rolling back to 'old' gcloud version would solve the problem? |
@gmarek @jlowdermilk @ixdy the core problem is that we are overloading the jenkins master. We need to start running jobs on slaves. |
I think it's a combination of jenkins master overloading and GCE changes that made metadataserver a bit more flaky under load. Next week's release should see some improvement but we still need to run jobs on slaves. |
It's nearly constantly failing for Kubemarks:
|
@jlowdermilk and I have been talking to the GCE team about this. In parallel I have been working on a hacky server to cache requests to the metadata server on our jenkins VMs, but it looks like this will not work because of https://github.com/GoogleCloudPlatform/gcloud-golang/blob/master/compute/metadata/metadata.go#L208 |
Looks like I can work around this by adding multiple entries for metadata.google.internal to /etc/hosts, PR sent out for review, and this is enabled on jenkins for the time being. |
@fejta - awesome!!! [even if this just reduces flakiness and not elimates it - what is happening currently is just a disaster] |
BOOM! We are back up and merging: http://submit-queue.k8s.io/#/e2e The cache reduces load on the metadata server from ~20qps to ~0 qps (1 query per ten mins or so), and it also retries intermittent flakes refreshing the access token via the metadata server (since we know we're on gce it is safe to retry errno 111, which metadata.go and gcloud cannot do for perf reasons for regular users). So I expect this will elminate the errno 111 errors (no impact on other flakes) |
So this was fixed all day until we migrated stuff over to docker, which caused problems because dns works differently. I'm rolling that back #24176 |
@fejta - it seems that rollback didn't help - it is still failing a lot |
Yes, that's a good idea. |
@ixdy @fejta @spxtr @dchen1107 Drive-by comment. Should we perhaps disable the metadata cache until it's more stable? How much would that slow things down? |
@quinton-hoole It's not that the metadata cache isn't stable, it's that we don't check to make sure that it's up before starting a job and when Jenkins VMs restart we don't automatically start it. Without it, every job fails deterministically. |
@quinton-hoole I tried turning off the metadata cache a few weeks ago (using the metadata server directly), but the metadata server itself is still not stable enough for our use. |
Happened again for me: https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/26914/kubernetes-pull-build-test-e2e-gce/43517/
|
Also happened to me - reopening. |
This is blocking a lot of PRs and merge queue is completely blocked. What are the instructions for restarting this cache? |
I have manually started metadata server of
|
Can we document it somewhere? |
ideally we should fix the metadata server cache so it reliably restarts when the VM restarts. |
Documenting it before we're there sounds like a way to go. |
What remains before we can close this P0? Is it just documenting how to restart the cache? |
Documenting and ensuring that the cache is running at all times. I don't think it's a P0 right now. |
I'll update our internal documentation for this. |
We ran into a lot of resource leakage issues lately, which causes the test failed at the end:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-release-1.2/822/console
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-release-1.2/820/console
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-slow/3599/console
...
The text was updated successfully, but these errors were encountered: