New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous i/o timeouts #13337

Closed
jzelinskie opened this Issue May 19, 2015 · 49 comments

Comments

Projects
None yet
@jzelinskie

jzelinskie commented May 19, 2015

Description of problem:

I've had numerous people report an issue connecting to registries (on prem, Docker Hub, and Quay.io) that has been quite tricky to track down. It can begin and end at seemingly random times.

$ sudo docker pull ...
FATA[0021] Error response from daemon: v1 ping attempt failed with error: Get https://quay.io/v1/_ping: dial tcp: i/o timeout.
If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add `--insecure-registry quay.io` to the daemon's arguments.
In the case of HTTPS, if you have access to the registry's CA certificate, no need for the flag; simply place the CA certificate at /etc/docker/certs.d/quay.io/ca.crt

It doesn't matter what API call is made (as long as it needs to connect to a registry), docker fails to establish a connection to the registry (in the case of every registry except the Docker Hub this endpoint is /v1/_ping). This problem persists despite docker daemon being restarted, but does not persist once the machine has been rebooted. Using curl to hit the endpoint works and dig resolves the domain correctly, yet the docker daemon will continue to fail connecting to the machine. This leads me to believe the issue is not related to the DNS cache.

The following data is taken from the last person reported suffering from this issue.

docker version:

$ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.4.1
Git commit (server): a8a31ef

docker info:

N/A

uname -a:

Linux Mint

$ uname -a
3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"

Uptime for this box was only a few hours.

Environment details (AWS, VirtualBox, physical, etc.):

I've seen this occur specifically on version 1.5.0, build a8a31ef on Debian, Ubuntu, Amazon Linux via residential connections, GCE, and AWS. I'm not sure that this version is necessarily coupled with the issue, though.

How reproducible:

I haven't been able to personally reproduce the issue.

Steps to Reproduce:

  1. normal docker usage
  2. docker i/o timeouts on commands that interact with registries

Actual Results:

Receive tcp i/o timeouts from a perfectly functioning registry.

Expected Results:

Never receive tcp i/o timeouts from a perfectly functioning registry.

Additional info:

See description.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 19, 2015

Member

@stevvooe anything here that sounds familiar to you?

Member

thaJeztah commented May 19, 2015

@stevvooe anything here that sounds familiar to you?

@jzelinskie

This comment has been minimized.

Show comment
Hide comment
@jzelinskie

jzelinskie May 19, 2015

I was also shown this, which may be relevant: golang/go#6336

jzelinskie commented May 19, 2015

I was also shown this, which may be relevant: golang/go#6336

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe May 20, 2015

Contributor

@thaJeztah There are issues with timeouts in the registry client code within the daemon. We are actually working to make the transport instantiation much cleaner to address this.

@jzelinskie Please take a look at docker/docker-registry#286. It sounds really similar. The main issue was than IP address update was not picked up by DNS in the docker daemon. Read through the comments and see if the behavior seems similar.

Other than that, there isn't much information to go on here. There is a lot of variation that can cause an io timeout, such as machine load and network load. Without knowing the specific conditions of each failure, this might be a wild goose chase. 🐦

This problem persists despite docker daemon being restarted, but does not persist once the machine has been rebooted.

Based on this, it seems like we may dealing with an operating system issue (DNS cache or other problem). While dig and curl seem to work, they may not be using the same resolution path. I'd be interested if another Go project had trouble with DNS resolution when a machine is in this state, while curl still worked. In other words, let's not eliminate DNS from the investigation.

I'd recommend you keep collecting information and attempt to reproduce the issue. When this does happen, try to collect as much information as possible.

Contributor

stevvooe commented May 20, 2015

@thaJeztah There are issues with timeouts in the registry client code within the daemon. We are actually working to make the transport instantiation much cleaner to address this.

@jzelinskie Please take a look at docker/docker-registry#286. It sounds really similar. The main issue was than IP address update was not picked up by DNS in the docker daemon. Read through the comments and see if the behavior seems similar.

Other than that, there isn't much information to go on here. There is a lot of variation that can cause an io timeout, such as machine load and network load. Without knowing the specific conditions of each failure, this might be a wild goose chase. 🐦

This problem persists despite docker daemon being restarted, but does not persist once the machine has been rebooted.

Based on this, it seems like we may dealing with an operating system issue (DNS cache or other problem). While dig and curl seem to work, they may not be using the same resolution path. I'd be interested if another Go project had trouble with DNS resolution when a machine is in this state, while curl still worked. In other words, let's not eliminate DNS from the investigation.

I'd recommend you keep collecting information and attempt to reproduce the issue. When this does happen, try to collect as much information as possible.

@jzelinskie

This comment has been minimized.

Show comment
Hide comment
@jzelinskie

jzelinskie May 22, 2015

@stevvooe I've read over that issue (and those related) in the past when attempting to help people debug the issue. I've semi-disregarded it because I had reports of people experiencing it on docker versions that were using Go 1.3+.

At this point, I'm not sure if all the different i/o timeout reports are even related to the same root cause. For example, I had someone that was using their own DNS servers watch and see what the daemon was requesting, and they claimed that after successfully resolving quay.io, it also attempted to resolve quay.io.local and they postulated that the failure might be trying to connect to the local address. However, another person receiving the same error flushed their DNS and restarted the daemon and the issue persisted.

Would you mind if I pointed people your way on IRC so we could collaborate in real time to collect information from the next report I receive?

jzelinskie commented May 22, 2015

@stevvooe I've read over that issue (and those related) in the past when attempting to help people debug the issue. I've semi-disregarded it because I had reports of people experiencing it on docker versions that were using Go 1.3+.

At this point, I'm not sure if all the different i/o timeout reports are even related to the same root cause. For example, I had someone that was using their own DNS servers watch and see what the daemon was requesting, and they claimed that after successfully resolving quay.io, it also attempted to resolve quay.io.local and they postulated that the failure might be trying to connect to the local address. However, another person receiving the same error flushed their DNS and restarted the daemon and the issue persisted.

Would you mind if I pointed people your way on IRC so we could collaborate in real time to collect information from the next report I receive?

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe May 22, 2015

Contributor

@jzelinskie I'm not sure if you read all the way through, but did you see docker/docker-registry#286 (comment)? It is not related to Go 1.3 but to fallback DNS servers. Considering the perceived behavior and that this is related to DNS, it is a very likely candidate.

Considering there are different symptoms, it might not be the right thing to place all these users in the same bucket. It sounds like there is an issue that is resolved by restart and another that is slightly different. It's possible they have the same root cause, but that assumption may hinder your analysis.

Yes, please do point them to IRC if they are experiencing an issue. We'll try to collect the right information.

Contributor

stevvooe commented May 22, 2015

@jzelinskie I'm not sure if you read all the way through, but did you see docker/docker-registry#286 (comment)? It is not related to Go 1.3 but to fallback DNS servers. Considering the perceived behavior and that this is related to DNS, it is a very likely candidate.

Considering there are different symptoms, it might not be the right thing to place all these users in the same bucket. It sounds like there is an issue that is resolved by restart and another that is slightly different. It's possible they have the same root cause, but that assumption may hinder your analysis.

Yes, please do point them to IRC if they are experiencing an issue. We'll try to collect the right information.

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni May 27, 2015

Hi all,

This is a big issue for us. We're seeing very frequent i/o timeout errors when trying to interact with quay.io's Docker registry and it's heavily impacting our continuous deployment and automated deployments.

We're seeing this problem on various Docker daemon versions ranging from 1.3.2 to 1.6.0. We're running on Amazon Linux on AWS in us-east-1 and we don't do anything specific about our DNS setup. But I'm less and less convinced that it has anything to do with DNS, as the name resolution works fine and the resolved IP address we see in the error message is valid and points to quay.io.

We've been trying to troubleshoot this problem with @jzelinskie for a long time now, to no avail. Help would be greatly appreciated.

Let me know if there is any information you want that could help debug the problem. I'm also available on #docker as sam`.

mpetazzoni commented May 27, 2015

Hi all,

This is a big issue for us. We're seeing very frequent i/o timeout errors when trying to interact with quay.io's Docker registry and it's heavily impacting our continuous deployment and automated deployments.

We're seeing this problem on various Docker daemon versions ranging from 1.3.2 to 1.6.0. We're running on Amazon Linux on AWS in us-east-1 and we don't do anything specific about our DNS setup. But I'm less and less convinced that it has anything to do with DNS, as the name resolution works fine and the resolved IP address we see in the error message is valid and points to quay.io.

We've been trying to troubleshoot this problem with @jzelinskie for a long time now, to no avail. Help would be greatly appreciated.

Let me know if there is any information you want that could help debug the problem. I'm also available on #docker as sam`.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe May 27, 2015

Contributor

@mpetazzoni I'd recommend dropping into the docker-distribution IRC channel where someone can help with your issue. If you could collect information about when it happens, when it does not happen, the state of the network (are packets being dropped?), round trip time, etc, it might help to divide the problem space. Does it happen only with quay or does it happen with other registries?

Unfortunately, IO timeouts can happen for a number of reasons. Eliminating any possible cause without more information will hinder the investigation.

@dmp42 Mind taking a look?

Contributor

stevvooe commented May 27, 2015

@mpetazzoni I'd recommend dropping into the docker-distribution IRC channel where someone can help with your issue. If you could collect information about when it happens, when it does not happen, the state of the network (are packets being dropped?), round trip time, etc, it might help to divide the problem space. Does it happen only with quay or does it happen with other registries?

Unfortunately, IO timeouts can happen for a number of reasons. Eliminating any possible cause without more information will hinder the investigation.

@dmp42 Mind taking a look?

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni May 27, 2015

For reference, the error message from the Docker daemon's log:

time="2015-05-27T20:03:56Z" level="error" msg="unable to login against registry endpoint https://quay.io/v1/: Get https://quay.io/v1/users/: dial tcp 23.21.77.53:443: connection timed out"
Get https://quay.io/v1/users/: dial tcp 23.21.77.53:443: connection timed out
time="2015-05-27T20:03:56Z" level="info" msg="-job auth() = ERR (1)"

mpetazzoni commented May 27, 2015

For reference, the error message from the Docker daemon's log:

time="2015-05-27T20:03:56Z" level="error" msg="unable to login against registry endpoint https://quay.io/v1/: Get https://quay.io/v1/users/: dial tcp 23.21.77.53:443: connection timed out"
Get https://quay.io/v1/users/: dial tcp 23.21.77.53:443: connection timed out
time="2015-05-27T20:03:56Z" level="info" msg="-job auth() = ERR (1)"
@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 May 28, 2015

Contributor

@mpetazzoni so, the error unfortunately is non descriptive. Like @stevvooe pointed out, different conditions might end up there. Might be the client fault, might be the server fault, might be a network issue.

Bottom-line: we need more data.

Figuring what's the occurrence of this would definitely help (once in ten attempt? more? less?), also if this is affecting other registries (docker hub) similarly.

Would you be able to collect network information (while this fails) using tcpdump, or wireshark?

I'll be on irc (#docker-distribution) tomorrow at about 10AM PST.

A final note: I don't know what quay.io is running - headers indicate tengine, and I know that at some point they forked out the python registry.
Asking for info there would certainly help (logs).

Contributor

dmp42 commented May 28, 2015

@mpetazzoni so, the error unfortunately is non descriptive. Like @stevvooe pointed out, different conditions might end up there. Might be the client fault, might be the server fault, might be a network issue.

Bottom-line: we need more data.

Figuring what's the occurrence of this would definitely help (once in ten attempt? more? less?), also if this is affecting other registries (docker hub) similarly.

Would you be able to collect network information (while this fails) using tcpdump, or wireshark?

I'll be on irc (#docker-distribution) tomorrow at about 10AM PST.

A final note: I don't know what quay.io is running - headers indicate tengine, and I know that at some point they forked out the python registry.
Asking for info there would certainly help (logs).

@dmp42 dmp42 added the Distribution label May 28, 2015

@jzelinskie

This comment has been minimized.

Show comment
Hide comment
@jzelinskie

jzelinskie May 28, 2015

@dmp42 We've never had anyone report getting this error in between requests in something like a pull or push -- only at first connection which is always hitting the ping endpoint for every registry except the Docker Hub. The /v1/_ping endpoint for Quay.io is served statically from nginx (tengine in this case) with an ELB in between; pings never need to reach a Python or Go process. People have been getting these errors entirely unrelated to our traffic/load -- they don't even show up in our ELB logs.

If you need any more info, feel free to query me whenever on IRC: jzelinskie@freenode.

jzelinskie commented May 28, 2015

@dmp42 We've never had anyone report getting this error in between requests in something like a pull or push -- only at first connection which is always hitting the ping endpoint for every registry except the Docker Hub. The /v1/_ping endpoint for Quay.io is served statically from nginx (tengine in this case) with an ELB in between; pings never need to reach a Python or Go process. People have been getting these errors entirely unrelated to our traffic/load -- they don't even show up in our ELB logs.

If you need any more info, feel free to query me whenever on IRC: jzelinskie@freenode.

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni May 28, 2015

The errors are unfortunately seemingly random, but happen with a pretty high frequency in our automated deployment process, and from a variety of hosts in our environment. Again, from which one is pretty much random, so getting any kind of tcpdump or network trace is close to impossible since (a) we can't predict if it will happen and (b) we can't predict on which host it will happen.

We have not been able to correlate these failures with any kind of other network issues on our end. As far as we can tell, the network sees no glitches while this happens. I've even tried to reproduce the problem by pulling an image from Quay.io in a loop, and in two days it never happened.

As @jzelinskie said it's only at first connection, while hitting the ping endpoint.

mpetazzoni commented May 28, 2015

The errors are unfortunately seemingly random, but happen with a pretty high frequency in our automated deployment process, and from a variety of hosts in our environment. Again, from which one is pretty much random, so getting any kind of tcpdump or network trace is close to impossible since (a) we can't predict if it will happen and (b) we can't predict on which host it will happen.

We have not been able to correlate these failures with any kind of other network issues on our end. As far as we can tell, the network sees no glitches while this happens. I've even tried to reproduce the problem by pulling an image from Quay.io in a loop, and in two days it never happened.

As @jzelinskie said it's only at first connection, while hitting the ping endpoint.

@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 May 28, 2015

Contributor

@jzelinskie thanks for the infos.

On top of the head I see no reason why go would fail on _ping and not on other requests - on the other hand, since every communication with a private registry does start with a _ping, it would make sense that the symptom starts here if for some reason network communication is disrupted.

@mpetazzoni unfortunately, there is no way out without some network traces.

Also, I absolutely need to understand the frequency of this - if only to understand the likeliness of this being a code issue, or a network issue.

Given how rare the symptom seems to be (you mentioned 2 days), doing curl at random moments as a sanity check probably holds little value.

Others: if someone has the means to dig into tcpdump-ing into this, can you reach out tomorrow 10PST?

Thanks.

Contributor

dmp42 commented May 28, 2015

@jzelinskie thanks for the infos.

On top of the head I see no reason why go would fail on _ping and not on other requests - on the other hand, since every communication with a private registry does start with a _ping, it would make sense that the symptom starts here if for some reason network communication is disrupted.

@mpetazzoni unfortunately, there is no way out without some network traces.

Also, I absolutely need to understand the frequency of this - if only to understand the likeliness of this being a code issue, or a network issue.

Given how rare the symptom seems to be (you mentioned 2 days), doing curl at random moments as a sanity check probably holds little value.

Others: if someone has the means to dig into tcpdump-ing into this, can you reach out tomorrow 10PST?

Thanks.

@tim-kretschmer-c2fo

This comment has been minimized.

Show comment
Hide comment
@tim-kretschmer-c2fo

tim-kretschmer-c2fo May 28, 2015

Also encountered this problem. A reboot resolved it. System info below.

This is an EC2 instance running CoreOS. It had been up for 12 days, and up for a month before that.

$ docker info
Containers: 2
Images: 112
Storage Driver: btrfs
 Build Version: Btrfs v3.17.1
 Library Version: 101
Execution Driver: native-0.2
Kernel Version: 3.19.3
Operating System: CoreOS 647.0.0
CPUs: 2
Total Memory: 3.863 GiB
Name: ip-10-1-2-196
ID: QEOG:MEK5:LZRJ:N6QM:53O2:GYZS:CIGZ:2FFS:BNAO:H4EK:QFI4:GBGJ
$ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.3.3
Git commit (client): a8a31ef-dirty
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.3.3
Git commit (server): a8a31ef-dirty
$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=647.0.0
VERSION_ID=647.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 647.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
$ journalctl -u docker
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="POST /v1.17/images/create?fromImage=quay.io%2Fc2fo%2Fauth-manage&tag=1.7.1"
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="+job pull(quay.io/c2fo/auth-manage, 1.7.1)"
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="+job resolve_repository(quay.io/c2fo/auth-manage)"
May 28 14:35:15 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:15Z" level="info" msg="-job resolve_repository(quay.io/c2fo/auth-manage) = OK (0)"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: Get https://quay.io/v1/_ping: dial tcp: i/o timeout
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="info" msg="-job pull(quay.io/c2fo/auth-manage, 1.7.1) = ERR (1)"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="error" msg="Handler for POST /images/create returned error: Get https://quay.io/v1/_ping: dial tcp: i/o timeout"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="error" msg="HTTP Error: statusCode=500 Get https://quay.io/v1/_ping: dial tcp: i/o timeout"

tim-kretschmer-c2fo commented May 28, 2015

Also encountered this problem. A reboot resolved it. System info below.

This is an EC2 instance running CoreOS. It had been up for 12 days, and up for a month before that.

$ docker info
Containers: 2
Images: 112
Storage Driver: btrfs
 Build Version: Btrfs v3.17.1
 Library Version: 101
Execution Driver: native-0.2
Kernel Version: 3.19.3
Operating System: CoreOS 647.0.0
CPUs: 2
Total Memory: 3.863 GiB
Name: ip-10-1-2-196
ID: QEOG:MEK5:LZRJ:N6QM:53O2:GYZS:CIGZ:2FFS:BNAO:H4EK:QFI4:GBGJ
$ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.3.3
Git commit (client): a8a31ef-dirty
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.3.3
Git commit (server): a8a31ef-dirty
$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=647.0.0
VERSION_ID=647.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 647.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
$ journalctl -u docker
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="POST /v1.17/images/create?fromImage=quay.io%2Fc2fo%2Fauth-manage&tag=1.7.1"
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="+job pull(quay.io/c2fo/auth-manage, 1.7.1)"
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="+job resolve_repository(quay.io/c2fo/auth-manage)"
May 28 14:35:15 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:15Z" level="info" msg="-job resolve_repository(quay.io/c2fo/auth-manage) = OK (0)"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: Get https://quay.io/v1/_ping: dial tcp: i/o timeout
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="info" msg="-job pull(quay.io/c2fo/auth-manage, 1.7.1) = ERR (1)"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="error" msg="Handler for POST /images/create returned error: Get https://quay.io/v1/_ping: dial tcp: i/o timeout"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="error" msg="HTTP Error: statusCode=500 Get https://quay.io/v1/_ping: dial tcp: i/o timeout"
@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 May 28, 2015

Contributor

@tim-kretschmer-c2fo can you clarify?
According to @mpetazzoni this problem happens randomly, without any way to predict when or where it will hit - is your situation different? since you are stating that "a reboot fixed it", did you end-up in a situation where it was happening all the time?

If you can reproduce such a situation, can you get a tcpdump showing the failed requests?

Also, "a8a31ef-dirty" for a git commit suggest you are running a modified version of docker. Can you provide details on what has been modified?

Thanks.

Contributor

dmp42 commented May 28, 2015

@tim-kretschmer-c2fo can you clarify?
According to @mpetazzoni this problem happens randomly, without any way to predict when or where it will hit - is your situation different? since you are stating that "a reboot fixed it", did you end-up in a situation where it was happening all the time?

If you can reproduce such a situation, can you get a tcpdump showing the failed requests?

Also, "a8a31ef-dirty" for a git commit suggest you are running a modified version of docker. Can you provide details on what has been modified?

Thanks.

@tim-kretschmer-c2fo

This comment has been minimized.

Show comment
Hide comment
@tim-kretschmer-c2fo

tim-kretschmer-c2fo May 28, 2015

@dmp42 this was a random occurrence, and the first time it happened to us. If the machine enters that state again I will try to get that dump and paste the result here. I do not think we modified anything in docker, whatever version is running on CoreOS is what we have been using.

tim-kretschmer-c2fo commented May 28, 2015

@dmp42 this was a random occurrence, and the first time it happened to us. If the machine enters that state again I will try to get that dump and paste the result here. I do not think we modified anything in docker, whatever version is running on CoreOS is what we have been using.

@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 May 28, 2015

Contributor

@jzelinskie speaking about DNS resolution, this one is definitely an issue: #10863

ipv6/ipv4 resolution preference at large is not well defined. Probably not a Docker issue per-se, but still.

Contributor

dmp42 commented May 28, 2015

@jzelinskie speaking about DNS resolution, this one is definitely an issue: #10863

ipv6/ipv4 resolution preference at large is not well defined. Probably not a Docker issue per-se, but still.

@jzelinskie

This comment has been minimized.

Show comment
Hide comment
@jzelinskie

jzelinskie May 29, 2015

@dmp42 this seems like it could be related, what are the changes that @icecrime mention wrt removing the milestone?

jzelinskie commented May 29, 2015

@dmp42 this seems like it could be related, what are the changes that @icecrime mention wrt removing the milestone?

@icecrime

This comment has been minimized.

Show comment
Hide comment
@icecrime

icecrime May 29, 2015

Contributor

@jzelinskie Removing the milestone simply means this doesn't seem like it's going to make it for 1.7.0. We need more eyes from the networking team on this (@Madhu @mrjana).

Contributor

icecrime commented May 29, 2015

@jzelinskie Removing the milestone simply means this doesn't seem like it's going to make it for 1.7.0. We need more eyes from the networking team on this (@Madhu @mrjana).

@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 May 29, 2015

Contributor

@jzelinskie IIRC the understanding about this specific ticket so far is that in a mixed world (ipv6/ipv4) resolution precedence is not well defined - not to mention wildcard matching is a recipe for disaster.

Either way, I would gladly do whatever it takes to figure any registry issue - unfortunately given the very nature of these, it's paramount that we manage to get access to a reproducible test case, or tcpdumps.

Contributor

dmp42 commented May 29, 2015

@jzelinskie IIRC the understanding about this specific ticket so far is that in a mixed world (ipv6/ipv4) resolution precedence is not well defined - not to mention wildcard matching is a recipe for disaster.

Either way, I would gladly do whatever it takes to figure any registry issue - unfortunately given the very nature of these, it's paramount that we manage to get access to a reproducible test case, or tcpdumps.

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni Jun 1, 2015

I'm not sure I understand exactly why this could be related to IPv4/IPv6 resolution precedence issues. Our DNS setup is completely vanilla from Amazon Linux, and Quay.io does not expose AAAA records at all. Plus, our error message clearly shows an IPv4 address the i/o timeout happens on.

Our /etc/resolv.conf:

search ec2.internal
nameserver 169.254.169.253

And, as done from one of our instances:

$ host -t AAAA quay.io
quay.io has no AAAA record
$ host -t A quay.io
quay.io has address 23.21.59.93
quay.io has address 50.17.199.231
quay.io has address 107.22.188.65
quay.io has address 184.73.154.212

The only thing I can think of, is that Quay.io's set of IP addresses changes, and when it does we somehow still try to hit one of the previous ones and it fails, so it would be more of a cache/TTL issue, either with Go's DNS code, or with Amazon's DNS servers?

mpetazzoni commented Jun 1, 2015

I'm not sure I understand exactly why this could be related to IPv4/IPv6 resolution precedence issues. Our DNS setup is completely vanilla from Amazon Linux, and Quay.io does not expose AAAA records at all. Plus, our error message clearly shows an IPv4 address the i/o timeout happens on.

Our /etc/resolv.conf:

search ec2.internal
nameserver 169.254.169.253

And, as done from one of our instances:

$ host -t AAAA quay.io
quay.io has no AAAA record
$ host -t A quay.io
quay.io has address 23.21.59.93
quay.io has address 50.17.199.231
quay.io has address 107.22.188.65
quay.io has address 184.73.154.212

The only thing I can think of, is that Quay.io's set of IP addresses changes, and when it does we somehow still try to hit one of the previous ones and it fails, so it would be more of a cache/TTL issue, either with Go's DNS code, or with Amazon's DNS servers?

@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 Jun 2, 2015

Contributor

@mpetazzoni I was just pointing out one of the many reasons an error like that would happen. Since the error in non-specific, guessing is useless to solve your specific case.

Without tcpdump or a reproducible test case, there is nothing to do...

Contributor

dmp42 commented Jun 2, 2015

@mpetazzoni I was just pointing out one of the many reasons an error like that would happen. Since the error in non-specific, guessing is useless to solve your specific case.

Without tcpdump or a reproducible test case, there is nothing to do...

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni Jun 2, 2015

Any suggestions on how to get a tcpdump on something I have no idea when or even if it will happen? I'm all ears.

mpetazzoni commented Jun 2, 2015

Any suggestions on how to get a tcpdump on something I have no idea when or even if it will happen? I'm all ears.

@dmp42

This comment has been minimized.

Show comment
Hide comment
@dmp42

dmp42 Jun 3, 2015

Contributor

@mpetazzoni capturing all docker traffic for a day on a given machine is the only way I can think of.
If this is not sufficient to get a trace, then we need to assess how often this really happen.

I wish I had a better solution, but short of figuring out an obvious code bug, getting information or reproduction is the only way to figure out what's going on.

Contributor

dmp42 commented Jun 3, 2015

@mpetazzoni capturing all docker traffic for a day on a given machine is the only way I can think of.
If this is not sufficient to get a trace, then we need to assess how often this really happen.

I wish I had a better solution, but short of figuring out an obvious code bug, getting information or reproduction is the only way to figure out what's going on.

@chandra-tp

This comment has been minimized.

Show comment
Hide comment
@chandra-tp

chandra-tp Jun 7, 2015

I faced same issue on our staging server. Initially we thought there may be some issue with docker hub but after done some more investigation we found that there a lot of dangling() images.

After we removed all of them we were able to push the image to docker hub.

Looks like docker is not able to handle this scenario. Not sure if this is due to dangling images or the number of images. The error is very confusing "docker timeout exceeded".

chandra-tp commented Jun 7, 2015

I faced same issue on our staging server. Initially we thought there may be some issue with docker hub but after done some more investigation we found that there a lot of dangling() images.

After we removed all of them we were able to push the image to docker hub.

Looks like docker is not able to handle this scenario. Not sure if this is due to dangling images or the number of images. The error is very confusing "docker timeout exceeded".

@sporkmonger

This comment has been minimized.

Show comment
Hide comment
@sporkmonger

sporkmonger Jun 28, 2015

Not sure if this helps w/ tracking down root cause or not, but thought I'd toss it out there as something to investigate.

I got an error similar to this after experimenting w/ setting up SkyDNS as the first nameserver in /etc/resolv.conf. Since SkyDNS was running in a container itself, there was a period during start-up where that first name server wouldn't be running and would therefore always fail over to the second one (which was Google public DNS). Not sure if it will help w/ reproduction or not, but maybe just put a bogus IP address into the first-place spot in /etc/resolv.conf and then good name servers in the second/third places? Expected result is that hostname lookups should take way longer than normal, but shouldn't actually fail. Actual result is that SHTF with error messages very similar to everything people are reporting above. The issues people are seeing could just be an intermittent issue w/ DNS hostname resolution on the first name server that triggers a one-time failover to the second DNS server, and which then quickly recovers. Not sure why that would cause all these problems though.

I've since concluded that what I tried to do might not be a good idea and I've stopped doing it. As soon as I reordered the name servers in /etc/resolv.conf the problem stopped happening. Hopefully it helps someone figure this out.

sporkmonger commented Jun 28, 2015

Not sure if this helps w/ tracking down root cause or not, but thought I'd toss it out there as something to investigate.

I got an error similar to this after experimenting w/ setting up SkyDNS as the first nameserver in /etc/resolv.conf. Since SkyDNS was running in a container itself, there was a period during start-up where that first name server wouldn't be running and would therefore always fail over to the second one (which was Google public DNS). Not sure if it will help w/ reproduction or not, but maybe just put a bogus IP address into the first-place spot in /etc/resolv.conf and then good name servers in the second/third places? Expected result is that hostname lookups should take way longer than normal, but shouldn't actually fail. Actual result is that SHTF with error messages very similar to everything people are reporting above. The issues people are seeing could just be an intermittent issue w/ DNS hostname resolution on the first name server that triggers a one-time failover to the second DNS server, and which then quickly recovers. Not sure why that would cause all these problems though.

I've since concluded that what I tried to do might not be a good idea and I've stopped doing it. As soon as I reordered the name servers in /etc/resolv.conf the problem stopped happening. Hopefully it helps someone figure this out.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jun 29, 2015

Contributor

@sporkmonger This sounds relevant.

@mpetazzoni I wonder if this has something to do with the fact that the IP 169.254.169.253 is un-routable.

Contributor

stevvooe commented Jun 29, 2015

@sporkmonger This sounds relevant.

@mpetazzoni I wonder if this has something to do with the fact that the IP 169.254.169.253 is un-routable.

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni Jun 30, 2015

So this would be the result of the two issues:

  1. Go's DNS library doesn't like Amazon's DNS server IP address, and so sometimes doesn't resolve correctly (or sometimes EC2's DNS server is unreachable, could happen, I guess?)
  2. Because of this DNS issue, the DNS resolution for quay.io can't succeed and the Docker daemon uses whatever was in the DNS cache, which could be a now invalid IP address that Quay is no longer using, which results in the TCP timeout we're seeing since nothing is listening on the other end.

If I understand correctly, switching to a routable, more reliable DNS server address (8.8.8.8) should solve the problem? That's still a sucky failure mode for Go's DNS library.

mpetazzoni commented Jun 30, 2015

So this would be the result of the two issues:

  1. Go's DNS library doesn't like Amazon's DNS server IP address, and so sometimes doesn't resolve correctly (or sometimes EC2's DNS server is unreachable, could happen, I guess?)
  2. Because of this DNS issue, the DNS resolution for quay.io can't succeed and the Docker daemon uses whatever was in the DNS cache, which could be a now invalid IP address that Quay is no longer using, which results in the TCP timeout we're seeing since nothing is listening on the other end.

If I understand correctly, switching to a routable, more reliable DNS server address (8.8.8.8) should solve the problem? That's still a sucky failure mode for Go's DNS library.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jul 1, 2015

Contributor

@mpetazzoni That conclusion is probably goes a little far (and is pretty unfair to Go's DNS library). We still don't have any evidence that is part of the problem.

Where is the 169.254.169.253 nameserver coming from? How is that value being populated?

Contributor

stevvooe commented Jul 1, 2015

@mpetazzoni That conclusion is probably goes a little far (and is pretty unfair to Go's DNS library). We still don't have any evidence that is part of the problem.

Where is the 169.254.169.253 nameserver coming from? How is that value being populated?

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni Jul 1, 2015

It's the default /etc/resolv.conf coming with the Amazon Linux AMI. Our configuration management system doesn't touch this.

mpetazzoni commented Jul 1, 2015

It's the default /etc/resolv.conf coming with the Amazon Linux AMI. Our configuration management system doesn't touch this.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jul 2, 2015

Contributor

@mpetazzoni Ok, so I understand you're theory a little bit better. Go's DNS picks up the unresolved value and starts black-holing DNS requests.

Is it possible that docker is coming up too early in the startup process (systemd, init.d, etc.)? Can we make it depend on a full and successful DHCP resolution, including the nameservers, before starting up?

Contributor

stevvooe commented Jul 2, 2015

@mpetazzoni Ok, so I understand you're theory a little bit better. Go's DNS picks up the unresolved value and starts black-holing DNS requests.

Is it possible that docker is coming up too early in the startup process (systemd, init.d, etc.)? Can we make it depend on a full and successful DHCP resolution, including the nameservers, before starting up?

@mpetazzoni

This comment has been minimized.

Show comment
Hide comment
@mpetazzoni

mpetazzoni Jul 2, 2015

I think our Docker daemon starts at the right time in the boot sequence and the networking is already all correctly setup by then. Would you suggest we try using 8.8.8.8 as our only nameserver instead of Amazon's, and see if that makes a difference?

By the way, we haven't encountered this timeout in ~3 weeks. We've been progressively upgrading our system packages and Docker version on our instances over the past several weeks, so maybe that helped (newer Go version?) ? But of course it doesn't prove anything.

mpetazzoni commented Jul 2, 2015

I think our Docker daemon starts at the right time in the boot sequence and the networking is already all correctly setup by then. Would you suggest we try using 8.8.8.8 as our only nameserver instead of Amazon's, and see if that makes a difference?

By the way, we haven't encountered this timeout in ~3 weeks. We've been progressively upgrading our system packages and Docker version on our instances over the past several weeks, so maybe that helped (newer Go version?) ? But of course it doesn't prove anything.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jul 2, 2015

Contributor

@mpetazzoni Networking may be setup but DHCP may not have set a nameserver in resolv.conf.

Digressing, if you haven't seen it in a few weeks, perhaps the root cause has disappeared. Let's keep watching. If we see the timeout again, we'll look at the DHCP -> resolv.conf process and monitor from there.

Thanks for sticking this out.

Contributor

stevvooe commented Jul 2, 2015

@mpetazzoni Networking may be setup but DHCP may not have set a nameserver in resolv.conf.

Digressing, if you haven't seen it in a few weeks, perhaps the root cause has disappeared. Let's keep watching. If we see the timeout again, we'll look at the DHCP -> resolv.conf process and monitor from there.

Thanks for sticking this out.

@notnownikki

This comment has been minimized.

Show comment
Hide comment
@notnownikki

notnownikki Jul 2, 2015

We're seeing this issue, we've got our registry's hostname in /etc/hosts to try and work around any possible DNS issue, and we're still seeing it.

We're running on precise, version info:

$ docker version
Client version: 1.4.1
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 5bc2ff8
OS/Arch (client): linux/amd64
Server version: 1.4.1
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 5bc2ff8

notnownikki commented Jul 2, 2015

We're seeing this issue, we've got our registry's hostname in /etc/hosts to try and work around any possible DNS issue, and we're still seeing it.

We're running on precise, version info:

$ docker version
Client version: 1.4.1
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 5bc2ff8
OS/Arch (client): linux/amd64
Server version: 1.4.1
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 5bc2ff8

@jesusaurus

This comment has been minimized.

Show comment
Hide comment
@jesusaurus

jesusaurus Jul 2, 2015

I work with @notnownikki and I can add that 107 out of the 481 "docker pull" commands our CI system has run over the past week resulted in the i/o timeout error.

jesusaurus commented Jul 2, 2015

I work with @notnownikki and I can add that 107 out of the 481 "docker pull" commands our CI system has run over the past week resulted in the i/o timeout error.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jul 2, 2015

Contributor

@jesusaurus @notnownikki Please confirm that the error you're getting is dial tcp: i/o timeout and not something else. This issue specifically covers IO timeout on DNS resolution.

Contributor

stevvooe commented Jul 2, 2015

@jesusaurus @notnownikki Please confirm that the error you're getting is dial tcp: i/o timeout and not something else. This issue specifically covers IO timeout on DNS resolution.

@notnownikki

This comment has been minimized.

Show comment
Hide comment
@notnownikki

notnownikki Jul 2, 2015

Yes, this is dial tcp i/o timeout.

notnownikki commented Jul 2, 2015

Yes, this is dial tcp i/o timeout.

@cmheisel

This comment has been minimized.

Show comment
Hide comment
@cmheisel

cmheisel Jul 28, 2015

I'm seeing the same issue on my teams servers in our data center (not AWS):

Update: Restarting the box did not fix the issue

Command

docker pull quay.io/myorg/myimage:mytag

Error message

FATA[0020] Error response from daemon: v1 ping attempt failed with error: Get https://quay.io/v1/_ping: dial tcp: i/o timeout. If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add `--insecure-registry quay.io` to the daemon's arguments. In the case of HTTPS, if you have access to the registry's CA certificate, no need for the flag; simply place the CA certificate at /etc/docker/certs.d/quay.io/ca.crt 

Docker version

Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64

cmheisel commented Jul 28, 2015

I'm seeing the same issue on my teams servers in our data center (not AWS):

Update: Restarting the box did not fix the issue

Command

docker pull quay.io/myorg/myimage:mytag

Error message

FATA[0020] Error response from daemon: v1 ping attempt failed with error: Get https://quay.io/v1/_ping: dial tcp: i/o timeout. If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add `--insecure-registry quay.io` to the daemon's arguments. In the case of HTTPS, if you have access to the registry's CA certificate, no need for the flag; simply place the CA certificate at /etc/docker/certs.d/quay.io/ca.crt 

Docker version

Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64
@SamVerschueren

This comment has been minimized.

Show comment
Hide comment
@SamVerschueren

SamVerschueren Aug 10, 2015

I am receiving the i/o timeout on AWS.

Get https://registry-1.docker.io/v1/repositories/library/jenkins/tags: read tcp 54.208.130.47:443: i/o timeout
$ docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): linux/amd64
Server version: 1.7.0
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 0baf609
OS/Arch (server): linux/amd64

SamVerschueren commented Aug 10, 2015

I am receiving the i/o timeout on AWS.

Get https://registry-1.docker.io/v1/repositories/library/jenkins/tags: read tcp 54.208.130.47:443: i/o timeout
$ docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): linux/amd64
Server version: 1.7.0
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 0baf609
OS/Arch (server): linux/amd64
@Ankitmaroo

This comment has been minimized.

Show comment
Hide comment
@Ankitmaroo

Ankitmaroo Aug 12, 2015

I am getting similar problem while running the docker pull from Jenkins. I cant reproduce it when i try to ssh into server and run similar pull command. Once it starts happening which is very random, it happens 3-4 times continuously and stops. Most of the time i restart daemon and problem resolves. Surprisingly in a given Jenkins job few pull from the same repo works fine and some timeout.

First error was this:

time="2015-08-12T10:57:43-07:00" level=fatal msg="Failed to upload layer: Put https://ussf-prd-lndv03:5000/v1/images/bad577d45faa011a1fae043c32204690790e5aa90a618f44e6b790aee8695537/layer: dial tcp 10.50.76.27:5000: connection timed out"

which came 2 times and later converted to this error :

time="2015-08-12T11:16:13-07:00" level=fatal msg="Error response from daemon: v1 ping attempt failed with error: Get https://ussf-prd-lndv03:5000/v1/_ping: dial tcp 10.50.76.27:5000: i/o timeout. If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add --insecure-registry ussf-prd-lndv03:5000 to the daemon's arguments. In the case of HTTPS, if you have access to the registry's CA certificate, no need for the flag; simply place the CA certificate at /etc/docker/certs.d/ussf-prd-lndv03:5000/ca.crt"
Build step 'Execute shell' marked build as failure

The problem resolved for now (and in past) by restarting daemon. Restarting registry didn't help.

$ docker version
Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64

Ankitmaroo commented Aug 12, 2015

I am getting similar problem while running the docker pull from Jenkins. I cant reproduce it when i try to ssh into server and run similar pull command. Once it starts happening which is very random, it happens 3-4 times continuously and stops. Most of the time i restart daemon and problem resolves. Surprisingly in a given Jenkins job few pull from the same repo works fine and some timeout.

First error was this:

time="2015-08-12T10:57:43-07:00" level=fatal msg="Failed to upload layer: Put https://ussf-prd-lndv03:5000/v1/images/bad577d45faa011a1fae043c32204690790e5aa90a618f44e6b790aee8695537/layer: dial tcp 10.50.76.27:5000: connection timed out"

which came 2 times and later converted to this error :

time="2015-08-12T11:16:13-07:00" level=fatal msg="Error response from daemon: v1 ping attempt failed with error: Get https://ussf-prd-lndv03:5000/v1/_ping: dial tcp 10.50.76.27:5000: i/o timeout. If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add --insecure-registry ussf-prd-lndv03:5000 to the daemon's arguments. In the case of HTTPS, if you have access to the registry's CA certificate, no need for the flag; simply place the CA certificate at /etc/docker/certs.d/ussf-prd-lndv03:5000/ca.crt"
Build step 'Execute shell' marked build as failure

The problem resolved for now (and in past) by restarting daemon. Restarting registry didn't help.

$ docker version
Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64

@jcarpe

This comment has been minimized.

Show comment
Hide comment
@jcarpe

jcarpe Nov 4, 2015

I am not entirely sure if this is related, however, this seems the most relevant issue. I had been using boot2docker and removed that in order to install the DockerToolbox. The installation completes, and I am presented with the whale screen saying docker ins configured to use the default machine. When I try to run any docker command I am presented with the following:

An error occurred trying to connect: Get https://192.168.59.103:2376/v1.20/info: dial tcp 192.168.59.103:2376: i/o timeout

This is very baffling. I have found a few solutions people have claimed to fix this issue, but I have had no luck. It's confusing that I would get the message saying docker is running on a given IP and then would immediately get a message saying it cannot connect.

jcarpe commented Nov 4, 2015

I am not entirely sure if this is related, however, this seems the most relevant issue. I had been using boot2docker and removed that in order to install the DockerToolbox. The installation completes, and I am presented with the whale screen saying docker ins configured to use the default machine. When I try to run any docker command I am presented with the following:

An error occurred trying to connect: Get https://192.168.59.103:2376/v1.20/info: dial tcp 192.168.59.103:2376: i/o timeout

This is very baffling. I have found a few solutions people have claimed to fix this issue, but I have had no luck. It's confusing that I would get the message saying docker is running on a given IP and then would immediately get a message saying it cannot connect.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Nov 4, 2015

Contributor

@jcarpe While the error message for your issue is similar, the causes are very different. The problem you describe looks like an installation issue with toolbox (probably routing related). I'd suggest filing an issue over at https://github.com/docker/toolbox or seeking help through IRC.

Contributor

stevvooe commented Nov 4, 2015

@jcarpe While the error message for your issue is similar, the causes are very different. The problem you describe looks like an installation issue with toolbox (probably routing related). I'd suggest filing an issue over at https://github.com/docker/toolbox or seeking help through IRC.

@tzejohn

This comment has been minimized.

Show comment
Hide comment
@tzejohn

tzejohn Nov 30, 2015

I am seeing the following reproducibly in a VM that I am provisioning via Ansible. Executing the same steps from the AWS console results in an instance that does not exhibit this same behavior. I am not setting up any custom routing or anything, just using a base AWS Linux image and then installing docker.

time="2015-11-30T14:39:05.870410892Z" level=info msg="POST /v1.19/images/create?fromImage=hello-world%3Alatest"
time="2015-11-30T14:40:06.416661098Z" level=error msg="Error from V2 registry: Get https://registry-1.docker.io/v2/library/hello-world/manifests/latest: read tcp 52.21.249.68:443: i/o timeout"

tzejohn commented Nov 30, 2015

I am seeing the following reproducibly in a VM that I am provisioning via Ansible. Executing the same steps from the AWS console results in an instance that does not exhibit this same behavior. I am not setting up any custom routing or anything, just using a base AWS Linux image and then installing docker.

time="2015-11-30T14:39:05.870410892Z" level=info msg="POST /v1.19/images/create?fromImage=hello-world%3Alatest"
time="2015-11-30T14:40:06.416661098Z" level=error msg="Error from V2 registry: Get https://registry-1.docker.io/v2/library/hello-world/manifests/latest: read tcp 52.21.249.68:443: i/o timeout"
@tzejohn

This comment has been minimized.

Show comment
Hide comment
@tzejohn

tzejohn Nov 30, 2015

Actually, I terminated the instances, and started new ones from Ansible, and I can no longer duplicate this issue.

tzejohn commented Nov 30, 2015

Actually, I terminated the instances, and started new ones from Ansible, and I can no longer duplicate this issue.

@mrfoobar1

This comment has been minimized.

Show comment
Hide comment
@mrfoobar1

mrfoobar1 Jan 6, 2016

@everyone anyone knows if there's some sort of timeout and retry mechanisim on a docker pull? I'm trying to automate a fresh install with the latest update on debian jessie and from times to times in vagrant it hangs for a very long time.

I'm just wondering if we can setup a timeout and retry for nasty dev / production environments. For instance, tonight I'm working on a laptop with a wifi connection and it has hard times to pull an image. I'm just glad it happened in such conditions, because sometime in with cloud computing you can expect such behaviours when your cloud provide is shy about sharing such information due to SLAs.

Here's some debug session on the docker process:
12925 ? Sl 0:00 | _ docker build -t bebasebox .

root@pxe:/home/vagrant# strace -p 12925
Process 12925 attached
epoll_wait(5, ^CProcess 12925 detached
<detached ...>

Attaching to process 12925
Reading symbols from /usr/bin/docker...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libapparmor.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libapparmor.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libpthread-2.19.so...done.
done.
[New LWP 12928]
[New LWP 12927]
[New LWP 12926]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/x86_64-linux-gnu/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libselinux.so.1
Reading symbols from /lib/x86_64-linux-gnu/libudev.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libudev.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpcre.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libpcre.so.3
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/librt-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
0x00000000004dc629 in runtime.epollwait ()
(gdb) bt
#0 0x00000000004dc629 in runtime.epollwait ()
#1 0x00000000004af983 in runtime.netpoll ()
#2 0x00007ffd00000005 in ?? ()
#3 0x00007ffd3aca3a08 in ?? ()
#4 0xffffffff00000080 in ?? ()
#5 0x0000000000000000 in ?? ()

Sorry, I don't have the proper debug info :\ I guess I got a bit lazy here :)

mrfoobar1 commented Jan 6, 2016

@everyone anyone knows if there's some sort of timeout and retry mechanisim on a docker pull? I'm trying to automate a fresh install with the latest update on debian jessie and from times to times in vagrant it hangs for a very long time.

I'm just wondering if we can setup a timeout and retry for nasty dev / production environments. For instance, tonight I'm working on a laptop with a wifi connection and it has hard times to pull an image. I'm just glad it happened in such conditions, because sometime in with cloud computing you can expect such behaviours when your cloud provide is shy about sharing such information due to SLAs.

Here's some debug session on the docker process:
12925 ? Sl 0:00 | _ docker build -t bebasebox .

root@pxe:/home/vagrant# strace -p 12925
Process 12925 attached
epoll_wait(5, ^CProcess 12925 detached
<detached ...>

Attaching to process 12925
Reading symbols from /usr/bin/docker...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libapparmor.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libapparmor.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libpthread-2.19.so...done.
done.
[New LWP 12928]
[New LWP 12927]
[New LWP 12926]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/x86_64-linux-gnu/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libselinux.so.1
Reading symbols from /lib/x86_64-linux-gnu/libudev.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libudev.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpcre.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libpcre.so.3
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/librt-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
0x00000000004dc629 in runtime.epollwait ()
(gdb) bt
#0 0x00000000004dc629 in runtime.epollwait ()
#1 0x00000000004af983 in runtime.netpoll ()
#2 0x00007ffd00000005 in ?? ()
#3 0x00007ffd3aca3a08 in ?? ()
#4 0xffffffff00000080 in ?? ()
#5 0x0000000000000000 in ?? ()

Sorry, I don't have the proper debug info :\ I guess I got a bit lazy here :)

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jan 6, 2016

Contributor

@mrfoobar1 I'd recommend opening another issue to request that feature. The right approach here is a timeout to first response header, but this can be hard to control.

Also, if you're not getting the i/o timeout on dial tcp, you might be getting a different issue.

Since this issue is a little vague and we haven't had recent reports, I'm going to go ahead and close this.

Contributor

stevvooe commented Jan 6, 2016

@mrfoobar1 I'd recommend opening another issue to request that feature. The right approach here is a timeout to first response header, but this can be hard to control.

Also, if you're not getting the i/o timeout on dial tcp, you might be getting a different issue.

Since this issue is a little vague and we haven't had recent reports, I'm going to go ahead and close this.

@stevvooe stevvooe closed this Jan 6, 2016

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jun 5, 2016

Hello,
We are seeing this issue too, pretty much the same symptoms described by jzelinskie. However, DNS is clearly unrelated because we use IPs rather than names.

We use Go Docker library rather than command-line utility. In our data-center this only happens on two out of about a dozen ESXi machines (I'm saying about a dozen because machines execute very different workloads wrt Docker operations, the machines in question run nightly builds, each of which performs roughly 500 operations a day). We are seeing this issue intermittently, not very often. It may happen twice a day, or may wait for the next time for 3-4 days.

We ran several network tests while running nightlies, and we don't see any networking problems. Packet drops are within expected figures. At the same time we are doing a lot of networking against Apache httpd, and we don't see any problems there.

The exact error we are receiving is:

*url.Error 10.11.18.239: Get http://10.11.18.239:2375/_ping: dial tcp 10.11.18.239:2375: i/o timeout

Docker client is built from revision 8669ea01ba93139a51783ac17658dedd47538b9c.
Docker daemon runs

$ docker version
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:08:45 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:08:45 UTC 2015
 OS/Arch:      linux/amd64
$ docker info
Containers: 5
Images: 201
Storage Driver: devicemapper
 Pool Name: docker-8:1-752829-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 5.853 GB
 Data Space Total: 107.4 GB
 Data Space Available: 9.154 GB
 Metadata Space Used: 10.01 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.137 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-123.13.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 31.27 GiB
Name: loader6a
ID: ZXOZ:Y25P:V7YG:EQAJ:MIZA:TUEM:OFBV:4LCN:AZ6H:4ABH:7SAF:PKOV

What may be different in our setup is that we have 6 network interfaces available at any moment. Still, this shouldn't, in principle, confuse the client code, but looking at golang/go#6336 I just thought I'd mention it.

ghost commented Jun 5, 2016

Hello,
We are seeing this issue too, pretty much the same symptoms described by jzelinskie. However, DNS is clearly unrelated because we use IPs rather than names.

We use Go Docker library rather than command-line utility. In our data-center this only happens on two out of about a dozen ESXi machines (I'm saying about a dozen because machines execute very different workloads wrt Docker operations, the machines in question run nightly builds, each of which performs roughly 500 operations a day). We are seeing this issue intermittently, not very often. It may happen twice a day, or may wait for the next time for 3-4 days.

We ran several network tests while running nightlies, and we don't see any networking problems. Packet drops are within expected figures. At the same time we are doing a lot of networking against Apache httpd, and we don't see any problems there.

The exact error we are receiving is:

*url.Error 10.11.18.239: Get http://10.11.18.239:2375/_ping: dial tcp 10.11.18.239:2375: i/o timeout

Docker client is built from revision 8669ea01ba93139a51783ac17658dedd47538b9c.
Docker daemon runs

$ docker version
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:08:45 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:08:45 UTC 2015
 OS/Arch:      linux/amd64
$ docker info
Containers: 5
Images: 201
Storage Driver: devicemapper
 Pool Name: docker-8:1-752829-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 5.853 GB
 Data Space Total: 107.4 GB
 Data Space Available: 9.154 GB
 Metadata Space Used: 10.01 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.137 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-123.13.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 31.27 GiB
Name: loader6a
ID: ZXOZ:Y25P:V7YG:EQAJ:MIZA:TUEM:OFBV:4LCN:AZ6H:4ABH:7SAF:PKOV

What may be different in our setup is that we have 6 network interfaces available at any moment. Still, this shouldn't, in principle, confuse the client code, but looking at golang/go#6336 I just thought I'd mention it.

@jingxu97

This comment has been minimized.

Show comment
Hide comment
@jingxu97

jingxu97 Nov 4, 2016

It seems we have similar issue also,

run: error getting repository data: Get https://gcr.io/v1/repositories/google_containers/gci-mounter/images: dial tcp 74.125.70.82:443: i/o timeout

Could we reopen this?

jingxu97 commented Nov 4, 2016

It seems we have similar issue also,

run: error getting repository data: Get https://gcr.io/v1/repositories/google_containers/gci-mounter/images: dial tcp 74.125.70.82:443: i/o timeout

Could we reopen this?

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Nov 8, 2016

Contributor

@jingxu97 This issue is describes a mostly generic i/o timeout, which can have a number of causes. In most cases, this is an infrastructure issue, so I would recommend contacting google's support.

Contributor

stevvooe commented Nov 8, 2016

@jingxu97 This issue is describes a mostly generic i/o timeout, which can have a number of causes. In most cases, this is an infrastructure issue, so I would recommend contacting google's support.

@MingCHEN-Github

This comment has been minimized.

Show comment
Hide comment
@MingCHEN-Github

MingCHEN-Github Jan 18, 2017

My Solution. I got this kind of error when I'm trying to install Tensorflow in docker. Following the tutorial of tensorflow, i run the command sudo docker run -it -p 8888:8888 b.gcr.io/tensorflow/tensorflow, then i got the error Unable to find image 'gcr.io/tensorflow/tensorflow:latest' locally docker: Error response from daemon: Get https://gcr.io/v1/_ping: dial tcp 64.233.188.82:443: i/o timeout.. I guess it is because of GFW. I tried VPN but failed again. Finally, I tried pulling tensorflow image from Docker hub instead of https://gcr.io, Google Cloud Platform [https://gcr.io]. In terminal, I ran sudo docker run -it -p 8888:8888 tensorflow/tensorflow. It worked for me. Hope it provides insights for you guys!

MingCHEN-Github commented Jan 18, 2017

My Solution. I got this kind of error when I'm trying to install Tensorflow in docker. Following the tutorial of tensorflow, i run the command sudo docker run -it -p 8888:8888 b.gcr.io/tensorflow/tensorflow, then i got the error Unable to find image 'gcr.io/tensorflow/tensorflow:latest' locally docker: Error response from daemon: Get https://gcr.io/v1/_ping: dial tcp 64.233.188.82:443: i/o timeout.. I guess it is because of GFW. I tried VPN but failed again. Finally, I tried pulling tensorflow image from Docker hub instead of https://gcr.io, Google Cloud Platform [https://gcr.io]. In terminal, I ran sudo docker run -it -p 8888:8888 tensorflow/tensorflow. It worked for me. Hope it provides insights for you guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment