-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribution DNS Failover #45594
Comments
As per docker/cli#4296 (comment), I would say this is by design; any sort of fallback behavior would have to be implemented as part of the daemon itself (or in distribution/distribution), and given that the distribution spec is silent on what to do here, I don't think introducing a new behavior is the right move. |
That's fair, I guess perhaps I'm looking for guidance on my current setup then. Using dockerhub as an example, my current understanding is that if one of the three following records becomes unavailable, image pulls for a large number of folks would start failing.
Is this assumption accurate? How would you recommend mitigating against this possible failure scenario? I'm leaning towards putting both of my servers behind a load balancer that is capable of performing health checks but that wouldn't solve the issue stated above. |
I think in general, if an IP 'must' respond without fail, the solution is anycast routing. There are other options like low TTLs and DNS trickery as well; the main objective for a highly available registry is that the DNS name must resolve to an IP where a currently functional registry can be located. |
So I think this is the crux of my issue with how the Docker CLI currently behaves. Any production-ready registries, including those available over the internet, will employ a round robin DNS configuration for their domain. Using my example above, registry.hub.docker.com resolves to three distinct IP addresses.
In it's current state, one-third of 'docker pull/push' commands would fail if the facility that contains 34.194.164.123 were to catch on fire. These failures would continue to occur until either 1) The DNS record for registry.hub.docker.com were updated to only include the two "good" IP addresses and the clients received the DNS update or 2) The fire is put out and service is restored. Apache HTTPComponents 4+ (Java) resolved this issue by implementing retry logic around connection timeout parameters. Perhaps my inquiry goes deeper into underlying Go libraries that are being used. I'd love to hear your thoughts on my hypothetical scenario. |
That is by design -- all IP addresses are treated as equal by most software. Going out of your way to try and retry another response from the DNS packet is pretty involved, and I would generally go so far as to call it an anti-feature. This is how libc does DNS. resolv.conf has the same semantics (all nameservers are treated equally; you can't "fall through" to a second nameserver because the first one return an error; musl libc goes so far as to do lookups in parallel and pick the first to return). |
Hmm, I might have to take that back -- apparently the thinking in this area has moved on from
There might be a little more to this than my very systems-C biased experience/recollection indicates. I'm curious as to what @corhere thinks, and I'll need to spend some time figuring out what the Go stdlib actually intends to do (if it's not trying to do a simple |
Thanks :-)
I just wanted to clarify one point which I don't think I correctly articulated in my original post: I 100% agree that a retry should not be attempted if a server returns an error response (4xx, etc.). It is when a connection to the attempted server can not even be established where I would expect another address to be attempted. |
The documentation for Failing over to the next address because the server at one address is responsive but otherwise unable to complete the HTTP request to the client's satisfaction—whatever the reason—could be bad for server operators. In the event that the server is responding with failures because it is overloaded, failing over to the next DNS record would merely compound the problem by overloading the fallback server with a thundering herd of retried requests. |
This entire time I've been reasoning in terms of the transport layer, and the successful establishment of a TCP connection. I very much agree that the application layer should have no implications for the connection semantics. However, if our HTTP client is already retrying failed TCP handshakes with the next IP returned by DNS, it sounds like there might only be a docs issue here in the end. |
This issue is easily reproduced on my end by setting up a round robin DNS entry configured with one "good" server and one "bad" server. The docker pull/push commands time out after 15 seconds until the "bad" server is removed from the DNS entry after which the commands work without issue.
If this sort of configuration is working for both of you then perhaps it's the fact that I have an old-ish Docker binary (version 20.10.14 / go1.16.15)? |
@jcurtis789 what's bad about the "bad" server in your tests? |
In this particular scenario, the server is powered down so nothing is listening on port 80. I can also replicate by simply shutting down the HA Proxy instance in front of Artifactory, which accomplishes the same thing. Restoring power/starting up HA Proxy fixes the problem. |
Lines 167 to 170 in f510614
In the case of the registry domain name resolving to two addresses, the dialer will fail over to the second address after fifteen seconds. Lines 121 to 124 in f510614
After fifteen seconds, the http.Client timeout expires and the request fails, defeating the net.Dialer failover. Looks like a bug!
|
Nice find - thanks! From my perspective, 15 seconds is an order of magnitude higher than I would expect to wait for a connection to get established. Just my two cents, but I would expect that value to be 1-2 seconds at most to further prevent a degradation of service. Thanks again both of you for your time in investigating :-) |
Description
Hello,
I have an FQDN configured to an on-premise registry such that two servers answer to the FQDN (i.e. www.my-private-repo.com). These servers run a JFROG Artifactory repo, although I believe that to be insignificant for this issue.
When both servers are healthy, performing a docker push/docker pull works flawlessly. However, I noticed that if one server goes down (maintenance, otherwise) these operations fail.
Reproduce
To reproduce, I have configured a custom FQDN where one DNS entry is one of the real servers hosting my image repository and another is a real DNS record that resolves to a fake server.
nslookup www.my-private-repo.com
...
Address 1.2.3.4
...
Address 5.6.7.8
nslookup1.2.3.4
... name = my-fake-server
nslookup 5.6.7.8
... name = my-real-server
A 'docker pull' times out after 15 seconds. Throwing --debug doesn't provide any more information. When my-fake-server is removed from www.my-private-repo.com, pulls begin working again.
Expected behavior
I would expect the Docker CLI to realize one of the DNS entries is faulty (i.e. my-fake-server) via connection timeout, 502 or otherwise and attempt the request on another entry (i.e. my-real-server). Perhaps I am missing a configuration to do so.
Dockerhub (registry.hub.docker.com) DNS resolves to three separate IPs. If one of these were to become unavailable, I would expect similar behavior, but more widespread issues across the community unless I'm perhaps missing something.
docker version
docker info
Additional Info
Thank you!
The text was updated successfully, but these errors were encountered: