Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build error: dial tcp: i/o timeout #5796

Closed
xelfe opened this issue Nov 9, 2015 · 27 comments
Closed

Build error: dial tcp: i/o timeout #5796

xelfe opened this issue Nov 9, 2015 · 27 comments

Comments

@xelfe
Copy link

xelfe commented Nov 9, 2015

I got this on anybuild/deploy since my fresh install this week-end;

build;
cleanup.go:23] Removing temporary directory /tmp/s2i-build539997888
fs.go:99] Removing directory '/tmp/s2i-build539997888'
builder.go:55] Build error: dial tcp: i/o timeout

event;
Error syncing pod, skipping: failed to delete containers ([exit status 1])

I tryed with Centos 7.1 and Fedora 21 as host and I always get the same result. Ive use the ansible deployment and I follow each step of the "advanced installation" I never had this problem before last week.

@bparees
Copy link
Contributor

bparees commented Nov 10, 2015

Can you please provide the full build logs? Even better would be verbose build logs:
https://docs.openshift.org/latest/dev_guide/builds.html#accessing-build-logs

If you docker run an image on your node, does it have network connectivity?

Can you also provide the logs from your failed deploy?

Is this a multinode or single node deployment?

@xelfe
Copy link
Author

xelfe commented Nov 10, 2015

this is what I get;
1 I1110 17:38:25.267262 1 sti.go:163] The value of ALLOWED_UIDS is [1-]
2 I1110 17:38:25.291097 1 docker.go:213] Image registry.access.redhat.com/openshift3/nodejs 010-rhel7:latest available locally
3 I1110 17:38:25.291182 1 sti.go:185] Creating a new S2I builder with build config: "Builder Name:\t\tNode.js 0.10\nBuilder Image:\t\tregistry.access.redhat.com/openshift3/nodejs-010-rhel7:latest\nSource:\t\t\tfile:///tmp/s2i-build526253021/upload/src\nOutput Image Tag:\t172.30.58.78:5000/test/nodejs-example:latest\nEnvironment:\t\tOPENSHIFT_BUILD_SOURCE=https://github.com/openshift/nodejs-ex.git,OPENSHIFT_BUILD_NAME=nodejs-example-2,OPENSHIFT_BUILD_NAMESPACE=test\nIncremental Build:\tdisabled\nRemove Old Build:\tdisabled\nForce Pull:\t\tdisabled\nQuiet:\t\t\tdisabled\nLayered Build:\t\tdisabled\nWorkdir:\t\t/tmp/s2i-build526253021\nDocker NetworkMode:\tcontainer:9c1987ed78b70824e219a1b3bc2862c4c883c9e4a042fbe84a85007498e1612b\nDocker Endpoint:\tunix:///var/run/docker.sock\n"
4 I1110 17:38:25.312374 1 docker.go:213] Image registry.access.redhat.com/openshift3/nodejs-010-rhel7:latest available locally
5 I1110 17:38:25.362679 1 sti.go:136] Preparing to build 172.30.58.78:5000/test/nodejs-example:latest
6 I1110 17:38:41.366156 1 cleanup.go:23] Removing temporary directory /tmp/s2i-build526253021
7 I1110 17:38:41.366186 1 fs.go:99] Removing directory '/tmp/s2i-build526253021'
8 F1110 17:38:41.366493 1 builder.go:59] Build error: dial tcp: i/o timeout

I use 1 master and 1 node. The registry and the router was fine for the deployement.

@xelfe
Copy link
Author

xelfe commented Nov 10, 2015

So. I've tried with a minimal install with Centos 7 and Fedora 21 server images with the "advanced installation" method and both fails. This morning i vie a try to the fedora-cloud-21 images and everything is going fine and it's work.

@bparees
Copy link
Contributor

bparees commented Nov 10, 2015

Thanks for the update, sounds like an issue (or config/usage issue) in the installer then.

@benbarclay
Copy link

Also seeing this behaviour when following the Advanced Install method to spin up a dev cluster.

@detiber
Copy link

detiber commented Nov 30, 2015

@bparees Any idea of what the builder is trying to communicate with at the moment it's getting the tcp timeout error? Without knowing that, tracking down the installation/configuration issue may be difficult.

@bparees
Copy link
Contributor

bparees commented Nov 30, 2015

it's generally trying to do the git clone from the repo, so probably https call to github.

@bparees
Copy link
Contributor

bparees commented Nov 30, 2015

often indicative of a networking issue with the SDN which is preventing external access.

@sbadakhc
Copy link

sbadakhc commented Dec 3, 2015

Is there any more information about this? I'm seeing it for the first time on a vanilla installaion with RHEL7.

@shawndwells
Copy link

Piling on here. Receiving this error when following the OpenShift training, specifically on the Sinatra lab:
https://github.com/openshift/training/blob/master/08-S2I-Introduction.md

@bit4man
Copy link

bit4man commented Dec 13, 2015

Same error using trying to create a phpcake app

@MacThrawn
Copy link

Since we migrated our testcluster (3 Nodes, 2 etcd) from 1.0.7 to 1.1.0.1 we can not build because of this issue:
F1214 08:15:59.809202 1 builder.go:59] Build error: dial tcp: i/o timeout

@bparees
Copy link
Contributor

bparees commented Dec 14, 2015

@eparis can someone from the networking team help out here?

@talset
Copy link

talset commented Dec 14, 2015

I have the same issue often from a fresh install and also on a running plateform hosted on AWS.
I tried to restart openshift-node and snd :

systemctl stop atomic-openshift-node 
rm -rf /run/openshift-sdn
systemctl stop docker
systemctl restart  iptables
systemctl restart  openvswitch
systemctl start atomic-openshift-node 

But not working. I was thinking it maybe is related to the MTU, so I tried to decrease the MTU but the issue appear again.
I also saw a lot of incomplete ARP too.

The only way is to reboot my instances each time when it is appear.

If someone have an idea please share ;)

@eparis
Copy link
Member

eparis commented Dec 16, 2015

@danwinship Can you take a look here?

@eparis eparis assigned danwinship and unassigned detiber Dec 16, 2015
@danwinship
Copy link
Contributor

can someone seeing this bug try running https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh and then upload the output somewhere? You need to run it from a host that has a valid KUBECONFIG, and that can ssh as root to the master and each node.

@abhat
Copy link
Contributor

abhat commented Dec 18, 2015

@danwinship seeing the same error with OSE3.1. Made sure pods have external connectivity by pinging github/such.

@abhat
Copy link
Contributor

abhat commented Dec 21, 2015

@liggitt @danwinship do you think #6418 is related to this problem?

@liggitt
Copy link
Contributor

liggitt commented Dec 21, 2015

No, #6418 was shortening an excessively long dial timeout solely for the image import controller (used to import image stream tags from a docker registry). It has no bearing on builds or any other dial timeouts.

@PiotrKlimczak
Copy link

Same problem here, using latest build: origin deployed with ansible on CentOS 7.1.
My pods do have an access to the internet, tested with curl using both ip addresses and domain names.
Also was able to do git clone on github repo from pod.
Tested on docker registry pod.
So it doesn't seems to be SDN related from my understanding.

Anyway git clone command ended in a bit strange way:

bash-4.2$ git clone https://github.com/openshift/training.git
Cloning into 'training'...
remote: Counting objects: 3234, done.
remote: Total 3234 (delta 0), reused 0 (delta 0), pack-reused 3234
Receiving objects: 100% (3234/3234), 1.61 MiB | 2.22 MiB/s, done.
Resolving deltas: 100% (2059/2059), done.
fatal: unable to look up current user in the passwd file: no such user
Unexpected end of command stream

Also noticed that i have no skydns pod installed as default, which i was expecting to be present.. should i? So name resolution takes about 10s. Could that be a problem?

Also what is strange, log says it is timeout but there are just few milliseconds from previous logs, where in case of timeout i would expect the log to appear after some pause.

@danwinship
Copy link
Contributor

Also noticed that i have no skydns pod installed as default, which i was expecting to be present.. should i? So name resolution takes about 10s. Could that be a problem?

Also what is strange, log says it is timeout but there are just few milliseconds from previous logs, where in case of timeout i would expect the log to appear after some pause.

Ignore the two "removing directory" lines; they presumably happen after the timeout happens but before it gets logged. So:

I1110 17:38:25.362679 1 sti.go:136] Preparing to build 172.30.58.78:5000/test/nodejs-example:latest
F1110 17:38:41.366493 1 builder.go:59] Build error: dial tcp: i/o timeout

So the timeout is 16 seconds... if it needed to do two DNS lookups in there, and DNS lookups are taking 10 seconds each, then that might be the problem

@danwinship
Copy link
Contributor

My pods do have an access to the internet, tested with curl using both ip addresses and domain names.
Also was able to do git clone on github repo from pod.
Tested on docker registry pod.
So it doesn't seems to be SDN related from my understanding.

Can master reach the pods by both name and IP? Can pods reach other pods by name and IP?

If so, then yeah, definitely seems non-SDN-related.

@akram
Copy link
Contributor

akram commented Jan 5, 2016

I have the same issue here and I noticed that the DNS configuration of container is weird:

$more /etc/resolv.conf
nameserver 172.30.0.1
nameserver 10.38.5.26
nameserver 10.11.5.19
....

If I rapidly jump into the container and try a curl, DNS resolution does not work
But, if I delete the first entry, DNS resolution seems OK and my curl from the build container works.

@danwinship
Copy link
Contributor

So service IP addresses are failing; this is probably the same bug as openshift/openshift-sdn#231

@akram
Copy link
Contributor

akram commented Jan 5, 2016

@xelfe can you check your openshift-master logs:

journalctl -u  atomic-openshift-master -f

I realised that I was facing the following error:

Jan 05 15:32:16 localhost.localdomain atomic-openshift-master[45927]: F0105 15:32:16.837384   45927 flatsdn.go:27] SDN initialization failed: Failed to obtain IP address from node name: localhost.localdomain

@abhat
Copy link
Contributor

abhat commented Jan 6, 2016

In my setup the issue was related to DNS. The kube's DNS was not starting correctly because of another dnsmasq instance on the master that was run to serve the nodes of the cluster. After moving that dns server to another (non-cluster) node, things started working well.

@danwinship
Copy link
Contributor

So service IP addresses are failing; this is probably the same bug as openshift/openshift-sdn#231

Actually it looks like that is something different, and this bug was openshift/openshift-sdn#236. Which is now fixed in origin via #6532, so this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests