Build error: dial tcp: i/o timeout #5796

xelfe · 2015-11-09T16:25:54Z

I got this on anybuild/deploy since my fresh install this week-end;

build;
cleanup.go:23] Removing temporary directory /tmp/s2i-build539997888
fs.go:99] Removing directory '/tmp/s2i-build539997888'
builder.go:55] Build error: dial tcp: i/o timeout

event;
Error syncing pod, skipping: failed to delete containers ([exit status 1])

I tryed with Centos 7.1 and Fedora 21 as host and I always get the same result. Ive use the ansible deployment and I follow each step of the "advanced installation" I never had this problem before last week.

bparees · 2015-11-10T16:41:50Z

Can you please provide the full build logs? Even better would be verbose build logs:
https://docs.openshift.org/latest/dev_guide/builds.html#accessing-build-logs

If you docker run an image on your node, does it have network connectivity?

Can you also provide the logs from your failed deploy?

Is this a multinode or single node deployment?

xelfe · 2015-11-10T17:46:05Z

this is what I get;
1 I1110 17:38:25.267262 1 sti.go:163] The value of ALLOWED_UIDS is [1-]
2 I1110 17:38:25.291097 1 docker.go:213] Image registry.access.redhat.com/openshift3/nodejs 010-rhel7:latest available locally
3 I1110 17:38:25.291182 1 sti.go:185] Creating a new S2I builder with build config: "Builder Name:\t\tNode.js 0.10\nBuilder Image:\t\tregistry.access.redhat.com/openshift3/nodejs-010-rhel7:latest\nSource:\t\t\tfile:///tmp/s2i-build526253021/upload/src\nOutput Image Tag:\t172.30.58.78:5000/test/nodejs-example:latest\nEnvironment:\t\tOPENSHIFT_BUILD_SOURCE=https://github.com/openshift/nodejs-ex.git,OPENSHIFT_BUILD_NAME=nodejs-example-2,OPENSHIFT_BUILD_NAMESPACE=test\nIncremental Build:\tdisabled\nRemove Old Build:\tdisabled\nForce Pull:\t\tdisabled\nQuiet:\t\t\tdisabled\nLayered Build:\t\tdisabled\nWorkdir:\t\t/tmp/s2i-build526253021\nDocker NetworkMode:\tcontainer:9c1987ed78b70824e219a1b3bc2862c4c883c9e4a042fbe84a85007498e1612b\nDocker Endpoint:\tunix:///var/run/docker.sock\n"
4 I1110 17:38:25.312374 1 docker.go:213] Image registry.access.redhat.com/openshift3/nodejs-010-rhel7:latest available locally
5 I1110 17:38:25.362679 1 sti.go:136] Preparing to build 172.30.58.78:5000/test/nodejs-example:latest
6 I1110 17:38:41.366156 1 cleanup.go:23] Removing temporary directory /tmp/s2i-build526253021
7 I1110 17:38:41.366186 1 fs.go:99] Removing directory '/tmp/s2i-build526253021'
8 F1110 17:38:41.366493 1 builder.go:59] Build error: dial tcp: i/o timeout

I use 1 master and 1 node. The registry and the router was fine for the deployement.

xelfe · 2015-11-10T19:48:53Z

So. I've tried with a minimal install with Centos 7 and Fedora 21 server images with the "advanced installation" method and both fails. This morning i vie a try to the fedora-cloud-21 images and everything is going fine and it's work.

bparees · 2015-11-10T19:53:39Z

Thanks for the update, sounds like an issue (or config/usage issue) in the installer then.

benbarclay · 2015-11-23T00:22:05Z

Also seeing this behaviour when following the Advanced Install method to spin up a dev cluster.

detiber · 2015-11-30T21:26:31Z

@bparees Any idea of what the builder is trying to communicate with at the moment it's getting the tcp timeout error? Without knowing that, tracking down the installation/configuration issue may be difficult.

bparees · 2015-11-30T21:30:35Z

it's generally trying to do the git clone from the repo, so probably https call to github.

bparees · 2015-11-30T21:31:00Z

often indicative of a networking issue with the SDN which is preventing external access.

sbadakhc · 2015-12-03T04:59:59Z

Is there any more information about this? I'm seeing it for the first time on a vanilla installaion with RHEL7.

shawndwells · 2015-12-11T20:44:19Z

Piling on here. Receiving this error when following the OpenShift training, specifically on the Sinatra lab:
https://github.com/openshift/training/blob/master/08-S2I-Introduction.md

bit4man · 2015-12-13T22:46:30Z

Same error using trying to create a phpcake app

MacThrawn · 2015-12-14T08:23:26Z

Since we migrated our testcluster (3 Nodes, 2 etcd) from 1.0.7 to 1.1.0.1 we can not build because of this issue:
F1214 08:15:59.809202 1 builder.go:59] Build error: dial tcp: i/o timeout

bparees · 2015-12-14T15:56:34Z

@eparis can someone from the networking team help out here?

talset · 2015-12-14T17:58:51Z

I have the same issue often from a fresh install and also on a running plateform hosted on AWS.
I tried to restart openshift-node and snd :

systemctl stop atomic-openshift-node 
rm -rf /run/openshift-sdn
systemctl stop docker
systemctl restart  iptables
systemctl restart  openvswitch
systemctl start atomic-openshift-node

But not working. I was thinking it maybe is related to the MTU, so I tried to decrease the MTU but the issue appear again.
I also saw a lot of incomplete ARP too.

The only way is to reboot my instances each time when it is appear.

If someone have an idea please share ;)

eparis · 2015-12-16T23:19:31Z

@danwinship Can you take a look here?

danwinship · 2015-12-17T13:57:48Z

can someone seeing this bug try running https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh and then upload the output somewhere? You need to run it from a host that has a valid KUBECONFIG, and that can ssh as root to the master and each node.

abhat · 2015-12-18T23:16:35Z

@danwinship seeing the same error with OSE3.1. Made sure pods have external connectivity by pinging github/such.

abhat · 2015-12-21T03:21:56Z

@liggitt @danwinship do you think #6418 is related to this problem?

liggitt · 2015-12-21T03:45:18Z

No, #6418 was shortening an excessively long dial timeout solely for the image import controller (used to import image stream tags from a docker registry). It has no bearing on builds or any other dial timeouts.

PiotrKlimczak · 2015-12-23T15:30:58Z

Same problem here, using latest build: origin deployed with ansible on CentOS 7.1.
My pods do have an access to the internet, tested with curl using both ip addresses and domain names.
Also was able to do git clone on github repo from pod.
Tested on docker registry pod.
So it doesn't seems to be SDN related from my understanding.

Anyway git clone command ended in a bit strange way:

bash-4.2$ git clone https://github.com/openshift/training.git
Cloning into 'training'...
remote: Counting objects: 3234, done.
remote: Total 3234 (delta 0), reused 0 (delta 0), pack-reused 3234
Receiving objects: 100% (3234/3234), 1.61 MiB | 2.22 MiB/s, done.
Resolving deltas: 100% (2059/2059), done.
fatal: unable to look up current user in the passwd file: no such user
Unexpected end of command stream

Also noticed that i have no skydns pod installed as default, which i was expecting to be present.. should i? So name resolution takes about 10s. Could that be a problem?

Also what is strange, log says it is timeout but there are just few milliseconds from previous logs, where in case of timeout i would expect the log to appear after some pause.

danwinship · 2016-01-04T15:40:55Z

Also noticed that i have no skydns pod installed as default, which i was expecting to be present.. should i? So name resolution takes about 10s. Could that be a problem?

Also what is strange, log says it is timeout but there are just few milliseconds from previous logs, where in case of timeout i would expect the log to appear after some pause.

Ignore the two "removing directory" lines; they presumably happen after the timeout happens but before it gets logged. So:

I1110 17:38:25.362679 1 sti.go:136] Preparing to build 172.30.58.78:5000/test/nodejs-example:latest
F1110 17:38:41.366493 1 builder.go:59] Build error: dial tcp: i/o timeout

So the timeout is 16 seconds... if it needed to do two DNS lookups in there, and DNS lookups are taking 10 seconds each, then that might be the problem

danwinship · 2016-01-04T15:43:37Z

My pods do have an access to the internet, tested with curl using both ip addresses and domain names.
Also was able to do git clone on github repo from pod.
Tested on docker registry pod.
So it doesn't seems to be SDN related from my understanding.

Can master reach the pods by both name and IP? Can pods reach other pods by name and IP?

If so, then yeah, definitely seems non-SDN-related.

akram · 2016-01-05T15:29:09Z

I have the same issue here and I noticed that the DNS configuration of container is weird:

$more /etc/resolv.conf
nameserver 172.30.0.1
nameserver 10.38.5.26
nameserver 10.11.5.19
....

If I rapidly jump into the container and try a curl, DNS resolution does not work
But, if I delete the first entry, DNS resolution seems OK and my curl from the build container works.

danwinship · 2016-01-05T15:37:02Z

So service IP addresses are failing; this is probably the same bug as openshift/openshift-sdn#231

akram · 2016-01-05T17:46:41Z

@xelfe can you check your openshift-master logs:

journalctl -u  atomic-openshift-master -f

I realised that I was facing the following error:

Jan 05 15:32:16 localhost.localdomain atomic-openshift-master[45927]: F0105 15:32:16.837384   45927 flatsdn.go:27] SDN initialization failed: Failed to obtain IP address from node name: localhost.localdomain

abhat · 2016-01-06T04:22:00Z

In my setup the issue was related to DNS. The kube's DNS was not starting correctly because of another dnsmasq instance on the master that was run to serve the nodes of the cluster. After moving that dns server to another (non-cluster) node, things started working well.

danwinship · 2016-01-15T15:50:52Z

So service IP addresses are failing; this is probably the same bug as openshift/openshift-sdn#231

Actually it looks like that is something different, and this bug was openshift/openshift-sdn#236. Which is now fixed in origin via #6532, so this can be closed.

danmcp added priority/P2 component/build labels Nov 10, 2015

danmcp assigned bparees Nov 10, 2015

bparees added component/install and removed component/build labels Nov 10, 2015

bparees assigned detiber and unassigned bparees Nov 10, 2015

eparis assigned danwinship and unassigned detiber Dec 16, 2015

PiotrKlimczak mentioned this issue Dec 23, 2015

build always failed #6071

Closed

danwinship mentioned this issue Jan 5, 2016

Fix a bug with services in multitenant openshift/openshift-sdn#236

Merged

eparis closed this as completed Jan 15, 2016

HackToHell mentioned this issue Feb 28, 2019

"dial tcp: i/o timeout" issues with FastDP and Kubernetes Services. weaveworks/weave#3605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build error: dial tcp: i/o timeout #5796

Build error: dial tcp: i/o timeout #5796

xelfe commented Nov 9, 2015

bparees commented Nov 10, 2015

xelfe commented Nov 10, 2015

xelfe commented Nov 10, 2015

bparees commented Nov 10, 2015

benbarclay commented Nov 23, 2015

detiber commented Nov 30, 2015

bparees commented Nov 30, 2015

bparees commented Nov 30, 2015

sbadakhc commented Dec 3, 2015

shawndwells commented Dec 11, 2015

bit4man commented Dec 13, 2015

MacThrawn commented Dec 14, 2015

bparees commented Dec 14, 2015

talset commented Dec 14, 2015

eparis commented Dec 16, 2015

danwinship commented Dec 17, 2015

abhat commented Dec 18, 2015

abhat commented Dec 21, 2015

liggitt commented Dec 21, 2015

PiotrKlimczak commented Dec 23, 2015

danwinship commented Jan 4, 2016

danwinship commented Jan 4, 2016

akram commented Jan 5, 2016

danwinship commented Jan 5, 2016

akram commented Jan 5, 2016

abhat commented Jan 6, 2016

danwinship commented Jan 15, 2016

Build error: dial tcp: i/o timeout #5796

Build error: dial tcp: i/o timeout #5796

Comments

xelfe commented Nov 9, 2015

bparees commented Nov 10, 2015

xelfe commented Nov 10, 2015

xelfe commented Nov 10, 2015

bparees commented Nov 10, 2015

benbarclay commented Nov 23, 2015

detiber commented Nov 30, 2015

bparees commented Nov 30, 2015

bparees commented Nov 30, 2015

sbadakhc commented Dec 3, 2015

shawndwells commented Dec 11, 2015

bit4man commented Dec 13, 2015

MacThrawn commented Dec 14, 2015

bparees commented Dec 14, 2015

talset commented Dec 14, 2015

eparis commented Dec 16, 2015

danwinship commented Dec 17, 2015

abhat commented Dec 18, 2015

abhat commented Dec 21, 2015

liggitt commented Dec 21, 2015

PiotrKlimczak commented Dec 23, 2015

danwinship commented Jan 4, 2016

danwinship commented Jan 4, 2016

akram commented Jan 5, 2016

danwinship commented Jan 5, 2016

akram commented Jan 5, 2016

abhat commented Jan 6, 2016

danwinship commented Jan 15, 2016