Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
api: add explicit DNS cache #7407
Conversation
|
!!build!! |
cmars
reviewed
May 26, 2017
First pass. I need to read this again more carefully, and read up on DNSCache to give it the review it deserves.
| + DNSCache | ||
| +} | ||
| + | ||
| +func (emptyDNSCache) Lookup(host string) []string { |
| - client, err := listener.Accept() | ||
| - if err != nil { | ||
| - return | ||
| + for i := 0; i < 3; i++ { |
cmars
May 26, 2017
Owner
I'm curious about this. How did this test get three connection attempts prior to this PR? Because there were three addresses set up in s.APIInfo(c)?
rogpeppe
May 26, 2017
Owner
Yup, that's right. But now there's logic to prevent dialing the same IP address twice, we need three separate addresses.
| + // If it is nil, net.DefaultResolver will be used. | ||
| + IPAddrResolver IPAddrResolver | ||
| + | ||
| + DNSCache DNSCache |
| // Clock is used as a time source for retries. | ||
| // If it is nil, clock.WallClock will be used. | ||
| Clock clock.Clock | ||
| } | ||
| +type IPAddrResolver interface { |
| + LookupIPAddr(ctx context.Context, host string) ([]net.IPAddr, error) | ||
| +} | ||
| + | ||
| +type DNSCache interface { |
cmars
approved these changes
May 26, 2017
I think the point of this is to re-resolve names in case the IP address has changed since we last cached the lookup; this will be more resilient than getting stuck on the first address we resolve in controllers.yaml.
LGTM with a suggestion for another test case, just to really exercise the "cache used" logic through a persistent condition of being unable to resolve the hostname. I'd also like someone from core to look at this.
| + c.Assert(dnsCache.Lookup("place1.example"), jc.DeepEquals, []string{"0.2.2.2"}) | ||
| +} | ||
| + | ||
| +func (s *apiclientSuite) TestWithUnresolvableAddr(c *gc.C) { |
cmars
May 26, 2017
Owner
Should we add a test for unresolvable addresses that uses the DNSCache & "cache used" logic as well?
rogpeppe
May 30, 2017
Owner
Added TestWithUnresolvableAddrAfterCacheFallback to address this request.
|
Just to ask, why do a DNS cache at all? The only original reason that I'm aware of is because of places like MAAS where we needed to check that we could use the hostname, by checking if DNS knew about it, and if we couldn't then we wouldn't include it in the saved-list-of-addressess-to-try. |
| + logger.Infof("ignoring IP address with zone %q", addr) | ||
| + continue | ||
| + } | ||
| + ips = append(ips, addr.IP.String()) |
wupeka
May 30, 2017
Member
Wouldn't it be saner to put those addresses on the list in '%' notation?
ip := addr.IP.String()
if addr.Zone != "" {
ip = fmt.Sprintf("%s%%%s", o[, addr.Zone)
}
ips.append(ips, ip)
(although I think that it wouldn't be a problem too, since I can't think of a way we'd get a zoned address from resolver)
rogpeppe
May 30, 2017
Owner
Currently the zoned IP addresses aren't parsable by any of our network code - they're not understood as valid IP addresses AFAIK, so I think it's best to exclude them unless there's a good reason not to.
jameinel
reviewed
May 30, 2017
the actual functionality seems generally ok, except for wondering why we're doing dns caching in the first place.
| // Technically when there's no CACert, we don't need this | ||
| // machinery, because we could just use http.DefaultTransport | ||
| // for everything, but it's easier just to leave it in place. | ||
| bakeryClient.Client.Transport = &hostSwitchingTransport{ | ||
| - primaryHost: apiHost, | ||
| + primaryHost: dialResult.addr, |
jameinel
May 30, 2017
Owner
// addr is the address used to connect to the API server.
sure sounds like it might have a :PORT in there.
Maybe we could update the docs to say it is either a hostname or an IP address?
Passing 'addr' to 'Host' is always something that feels weird to me, as there is no guarantee an "addr" is a hostname.
| // TODO(rogpeppe) We'd like to set Deadline here | ||
| // but that would break lots of tests that rely on | ||
| // setting a zero timeout. | ||
| netDialer := net.Dialer{} | ||
| dialer := &websocket.Dialer{ | ||
| NetDial: func(netw, addr string) (net.Conn, error) { | ||
| + if ipAddr != "" { |
jameinel
May 30, 2017
Owner
This sort of thing always feels bad to me.
We've done "foo.NetDial(X)" but under the covers we actually force that to be "net.Dial(Y)" and we never even check that 'addr' matches the object we're trying to substitute.
If we wanted to dial Y, why didn't we do so from the beginning?
I'm guessing this is again the hostname vs IP thing, but I don't quite understand why we need to replace DNS.
rogpeppe
May 30, 2017
Owner
We're not replacing DNS - we're just doing the DNS lookup separately so that we can do the dials concurrently. The logic inside net.Dial just tries the addresses sequentially, so if the first IP address refers to a controller that's down or unavailable, the dial might take a very long time to succeed even though another of the IP addresses is immediately available.
We need the TLS logic to understand that it's connecting to the pre-DNS-resolution address, but we want the net dial to go directly to the IP address that we've chosen.
| - select { | ||
| - case <-opts.Clock.After(opts.DialAddressInterval): | ||
| - case <-try.Dead(): | ||
| + ips := opts.DNSCache.Lookup(host) |
jameinel
May 30, 2017
Owner
It seems we always do this even if host is an IP address, is that true? Or is it just assumed that IPs would never be in the cache so the lookup is cheap/no-op and won't return ips?
| + cacheUsed = append(cacheUsed, addr) | ||
| + } else { | ||
| + var err error | ||
| + ips, err = lookupIPAddr(ctx, host, opts.IPAddrResolver) |
jameinel
May 30, 2017
Owner
this again seems less cheap in the case of host being an IP already, but maybe lookupIPAddr also has code for fast-pathing things that are already IPs?
rogpeppe
May 30, 2017
Owner
lookupIPAddr does have code for fast-pathing things that are already IPs, but you're absolutely right that this code may end up putting raw IP addrs into cacheUsed and into the cache itself, which is wrong. Fixed - good catch, thanks.
| + continue | ||
| + } | ||
| + opts.DNSCache.Add(host, ips) | ||
| + logger.Debugf("looked up %v -> %v", host, ips) |
jameinel
May 30, 2017
Owner
but that would mean that this would end up caching IP=>[IP].
It definitely doesn't make sense to put raw IP addrs into cacheUsed.
| - conn, err = d.opts.DialWebsocket(d.ctx, d.urlStr, &tlsConfig1) | ||
| + tlsConfig.RootCAs = nil | ||
| + tlsConfig.ServerName = d.serverName | ||
| + conn, err = d.opts.DialWebsocket(d.ctx, d.urlStr, tlsConfig, d.ipAddr) |
jameinel
May 30, 2017
Owner
This will conflict with the patch that Witold just landed to always return the original error, instead of sometimes returning the fallback path.
| + SkipLogin: true, | ||
| + CACert: jtesting.CACert, | ||
| + }, api.DialOpts{ | ||
| + Timeout: 5 * time.Second, |
jameinel
May 30, 2017
Owner
does this mean that this particular test takes 5s to retry 5 times before deciding to fallback to resolver instead of using the local cache?
Can we make it a bit faster (100ms/500ms)?
Maybe it doesn't because we also pass in fakeClock?
| + if err != nil { | ||
| + panic(err) | ||
| + } | ||
| + r[host] = []string{fmt.Sprintf("0.%[1]d.%[1]d.%[1]d", i+1)} |
jameinel
May 30, 2017
Owner
odd side-note, github's golang renderer doesn't understand this syntax and flags the '%' characters as being an error.
rogpeppe
May 30, 2017
Owner
Interesting. I wonder where we could raise an issue about that. I've seen other similar glitches elsewhere but didn't realise it was trying to flag an error.
This is a good question. If I didn't know about the MAAS case then I'd support dropping DNS caching entirely, but if we regularly encounter addresses that take a very long time to resolve then it probably still makes sense. I think that it's important to keep the original set of addresses returned from the API server intact so that when the addresses are used in another context (for example a controllers.yaml entry can be passed between machines and across network zones) all the original addresses are still available. An existing bug relates to the fact that we're dropping unresolved addresses from APIAddresses and leaving them in UnresolvedAPIAddresses, which seems intuitively right, but those two fields don't have an easily explainable relationship, so this PR and its followup are about simplifying that logic by leaving DNS resolution to the API dialing code exclusively, but providing a callback mechanism (DNSCache) so that we can remember DNS entries when needed. I'd be interested to have a chat about this - I'd love to make the code simpler if it were reasonable to do so. |
jameinel
approved these changes
May 30, 2017
I'm not sure that we want to be caching DNS at all, but I don't think this makes what we're doing particularly worse. So we'll have a chat out-of-band towards where we want to be.
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
rogpeppe commentedMay 26, 2017
The API connection logic in juju.NewAPIConnection implements
a half-hearted DNS cache by resolving DNS addresses itself
and storing them in the controllers.yaml fields.
This obscures the actual addresses that have been
returned from the controller.
Instead of that, we make api.Open responsible for
all DNS resolution and pass a DNS cache interface
into DialOpts so that the NewAPIConnection logic
can still do its own DNS caching.
This means that the API connection logic always knows
the symbolic host name of the controller that it's
dialing, which is important when using public certificate
validation.
It also means that even when a Go client calls api.Open with
a single host name, it will get the benefit of concurrent
dialing to each of the resolved IP addresses.
This is part 1 of 2 - the next PR will change the juju.NewAPIConnection
logic to use the DNSCache machinery.
QA check that API connections still work.