api: add explicit DNS cache #7407

Merged
merged 1 commit into from May 30, 2017

Conversation

Projects
None yet
5 participants
Owner

rogpeppe commented May 26, 2017

The API connection logic in juju.NewAPIConnection implements
a half-hearted DNS cache by resolving DNS addresses itself
and storing them in the controllers.yaml fields.
This obscures the actual addresses that have been
returned from the controller.

Instead of that, we make api.Open responsible for
all DNS resolution and pass a DNS cache interface
into DialOpts so that the NewAPIConnection logic
can still do its own DNS caching.

This means that the API connection logic always knows
the symbolic host name of the controller that it's
dialing, which is important when using public certificate
validation.

It also means that even when a Go client calls api.Open with
a single host name, it will get the benefit of concurrent
dialing to each of the resolved IP addresses.

This is part 1 of 2 - the next PR will change the juju.NewAPIConnection
logic to use the DNSCache machinery.

QA check that API connections still work.

Owner

rogpeppe commented May 26, 2017

!!build!!

First pass. I need to read this again more carefully, and read up on DNSCache to give it the review it deserves.

+ DNSCache
+}
+
+func (emptyDNSCache) Lookup(host string) []string {
@cmars

cmars May 26, 2017

Owner

Could add godoc comments "Lookup implements DNSCache" here and below.

@rogpeppe

rogpeppe May 30, 2017

Owner

could do, but I think the code is clear enough as is tbh.

- client, err := listener.Accept()
- if err != nil {
- return
+ for i := 0; i < 3; i++ {
@cmars

cmars May 26, 2017

Owner

I'm curious about this. How did this test get three connection attempts prior to this PR? Because there were three addresses set up in s.APIInfo(c)?

@rogpeppe

rogpeppe May 26, 2017

Owner

Yup, that's right. But now there's logic to prevent dialing the same IP address twice, we need three separate addresses.

api/interface.go
+ // If it is nil, net.DefaultResolver will be used.
+ IPAddrResolver IPAddrResolver
+
+ DNSCache DNSCache
@cmars

cmars May 26, 2017

Owner

godoc please

api/interface.go
// Clock is used as a time source for retries.
// If it is nil, clock.WallClock will be used.
Clock clock.Clock
}
+type IPAddrResolver interface {
api/interface.go
+ LookupIPAddr(ctx context.Context, host string) ([]net.IPAddr, error)
+}
+
+type DNSCache interface {

cmars approved these changes May 26, 2017

I think the point of this is to re-resolve names in case the IP address has changed since we last cached the lookup; this will be more resilient than getting stuck on the first address we resolve in controllers.yaml.

LGTM with a suggestion for another test case, just to really exercise the "cache used" logic through a persistent condition of being unable to resolve the hostname. I'd also like someone from core to look at this.

+ c.Assert(dnsCache.Lookup("place1.example"), jc.DeepEquals, []string{"0.2.2.2"})
+}
+
+func (s *apiclientSuite) TestWithUnresolvableAddr(c *gc.C) {
@cmars

cmars May 26, 2017

Owner

Should we add a test for unresolvable addresses that uses the DNSCache & "cache used" logic as well?

@rogpeppe

rogpeppe May 26, 2017

Owner

Doesn't TestFallbackToIPLookupWhenCacheOutOfDate fit that bill?

@rogpeppe

rogpeppe May 30, 2017

Owner

Added TestWithUnresolvableAddrAfterCacheFallback to address this request.

Owner

jameinel commented May 30, 2017

Just to ask, why do a DNS cache at all? The only original reason that I'm aware of is because of places like MAAS where we needed to check that we could use the hostname, by checking if DNS knew about it, and if we couldn't then we wouldn't include it in the saved-list-of-addressess-to-try.
(Often maas hosts will return their maas name 'blueberry-pie.maas' which is only valid if you are using MAAS for your DNS resolution.)
Aside from that, why not just say "ok, we can resolve foo.jaas, so leave foo.jaas as the host to connect to". That also handles things like "what hostname should the cert be signed for". Otherwise we end up connecting to 55.xxxx and then have to do a reverse lookup to figure out "oh, that was actually foo.jaas".

+ logger.Infof("ignoring IP address with zone %q", addr)
+ continue
+ }
+ ips = append(ips, addr.IP.String())
@wupeka

wupeka May 30, 2017

Member

Wouldn't it be saner to put those addresses on the list in '%' notation?
ip := addr.IP.String()
if addr.Zone != "" {
ip = fmt.Sprintf("%s%%%s", o[, addr.Zone)
}
ips.append(ips, ip)

(although I think that it wouldn't be a problem too, since I can't think of a way we'd get a zoned address from resolver)

@rogpeppe

rogpeppe May 30, 2017

Owner

Currently the zoned IP addresses aren't parsable by any of our network code - they're not understood as valid IP addresses AFAIK, so I think it's best to exclude them unless there's a good reason not to.

the actual functionality seems generally ok, except for wondering why we're doing dns caching in the first place.

// Technically when there's no CACert, we don't need this
// machinery, because we could just use http.DefaultTransport
// for everything, but it's easier just to leave it in place.
bakeryClient.Client.Transport = &hostSwitchingTransport{
- primaryHost: apiHost,
+ primaryHost: dialResult.addr,
@jameinel

jameinel May 30, 2017

Owner

// addr is the address used to connect to the API server.

sure sounds like it might have a :PORT in there.
Maybe we could update the docs to say it is either a hostname or an IP address?
Passing 'addr' to 'Host' is always something that feels weird to me, as there is no guarantee an "addr" is a hostname.

api/apiclient.go
// TODO(rogpeppe) We'd like to set Deadline here
// but that would break lots of tests that rely on
// setting a zero timeout.
netDialer := net.Dialer{}
dialer := &websocket.Dialer{
NetDial: func(netw, addr string) (net.Conn, error) {
+ if ipAddr != "" {
@jameinel

jameinel May 30, 2017

Owner

This sort of thing always feels bad to me.
We've done "foo.NetDial(X)" but under the covers we actually force that to be "net.Dial(Y)" and we never even check that 'addr' matches the object we're trying to substitute.
If we wanted to dial Y, why didn't we do so from the beginning?
I'm guessing this is again the hostname vs IP thing, but I don't quite understand why we need to replace DNS.

@rogpeppe

rogpeppe May 30, 2017

Owner

We're not replacing DNS - we're just doing the DNS lookup separately so that we can do the dials concurrently. The logic inside net.Dial just tries the addresses sequentially, so if the first IP address refers to a controller that's down or unavailable, the dial might take a very long time to succeed even though another of the IP addresses is immediately available.

We need the TLS logic to understand that it's connecting to the pre-DNS-resolution address, but we want the net dial to go directly to the IP address that we've chosen.

- select {
- case <-opts.Clock.After(opts.DialAddressInterval):
- case <-try.Dead():
+ ips := opts.DNSCache.Lookup(host)
@jameinel

jameinel May 30, 2017

Owner

It seems we always do this even if host is an IP address, is that true? Or is it just assumed that IPs would never be in the cache so the lookup is cheap/no-op and won't return ips?

api/apiclient.go
+ cacheUsed = append(cacheUsed, addr)
+ } else {
+ var err error
+ ips, err = lookupIPAddr(ctx, host, opts.IPAddrResolver)
@jameinel

jameinel May 30, 2017

Owner

this again seems less cheap in the case of host being an IP already, but maybe lookupIPAddr also has code for fast-pathing things that are already IPs?

@rogpeppe

rogpeppe May 30, 2017

Owner

lookupIPAddr does have code for fast-pathing things that are already IPs, but you're absolutely right that this code may end up putting raw IP addrs into cacheUsed and into the cache itself, which is wrong. Fixed - good catch, thanks.

+ continue
+ }
+ opts.DNSCache.Add(host, ips)
+ logger.Debugf("looked up %v -> %v", host, ips)
@jameinel

jameinel May 30, 2017

Owner

but that would mean that this would end up caching IP=>[IP].
It definitely doesn't make sense to put raw IP addrs into cacheUsed.

api/apiclient.go
- conn, err = d.opts.DialWebsocket(d.ctx, d.urlStr, &tlsConfig1)
+ tlsConfig.RootCAs = nil
+ tlsConfig.ServerName = d.serverName
+ conn, err = d.opts.DialWebsocket(d.ctx, d.urlStr, tlsConfig, d.ipAddr)
@jameinel

jameinel May 30, 2017

Owner

This will conflict with the patch that Witold just landed to always return the original error, instead of sometimes returning the fallback path.

+ SkipLogin: true,
+ CACert: jtesting.CACert,
+ }, api.DialOpts{
+ Timeout: 5 * time.Second,
@jameinel

jameinel May 30, 2017

Owner

does this mean that this particular test takes 5s to retry 5 times before deciding to fallback to resolver instead of using the local cache?
Can we make it a bit faster (100ms/500ms)?
Maybe it doesn't because we also pass in fakeClock?

@rogpeppe

rogpeppe May 30, 2017

Owner

It doesn't because we also pass in fakeClock :)

+ if err != nil {
+ panic(err)
+ }
+ r[host] = []string{fmt.Sprintf("0.%[1]d.%[1]d.%[1]d", i+1)}
@jameinel

jameinel May 30, 2017

Owner

odd side-note, github's golang renderer doesn't understand this syntax and flags the '%' characters as being an error.

@rogpeppe

rogpeppe May 30, 2017

Owner

Interesting. I wonder where we could raise an issue about that. I've seen other similar glitches elsewhere but didn't realise it was trying to flag an error.

Owner

rogpeppe commented May 30, 2017

Just to ask, why do a DNS cache at all? The only original reason that I'm aware of is because of places like MAAS where we needed to check that we could use the hostname, by checking if DNS knew about it, and if we couldn't then we wouldn't include it in the saved-list-of-addressess-to-try. (Often maas hosts will return their maas name 'blueberry-pie.maas' which is only valid if you are using MAAS for your DNS resolution.) Aside from that, why not just say "ok, we can resolve foo.jaas, so leave foo.jaas as the host to connect to". That also handles things like "what hostname should the cert be signed for". Otherwise we end up connecting to 55.xxxx and then have to do a reverse lookup to figure out "oh, that was actually foo.jaas".

This is a good question. If I didn't know about the MAAS case then I'd support dropping DNS caching entirely, but if we regularly encounter addresses that take a very long time to resolve then it probably still makes sense.

I think that it's important to keep the original set of addresses returned from the API server intact so that when the addresses are used in another context (for example a controllers.yaml entry can be passed between machines and across network zones) all the original addresses are still available. An existing bug relates to the fact that we're dropping unresolved addresses from APIAddresses and leaving them in UnresolvedAPIAddresses, which seems intuitively right, but those two fields don't have an easily explainable relationship, so this PR and its followup are about simplifying that logic by leaving DNS resolution to the API dialing code exclusively, but providing a callback mechanism (DNSCache) so that we can remember DNS entries when needed.

I'd be interested to have a chat about this - I'd love to make the code simpler if it were reasonable to do so.

api: add explicit DNS cache
The API connection logic in juju.NewAPIConnection implements
a half-hearted DNS cache by resolving DNS addresses itself
and storing them in the controllers.yaml fields.
This obscures the actual addresses that have been
returned from the controller.

Instead of that, we make api.Open responsible for
all DNS resolution and pass a DNS cache interface
into DialOpts so that the NewAPIConnection logic
can still do its own DNS caching.

This means that the API connection logic always knows
the symbolic host name of the controller that it's
dialing, which is important when using public certificate
validation.

It also means that even when a Go client calls api.Open with
a single host name, it will get the benefit of concurrent
dialing to each of the resolved IP addresses.

I'm not sure that we want to be caching DNS at all, but I don't think this makes what we're doing particularly worse. So we'll have a chat out-of-band towards where we want to be.

Owner

rogpeppe commented May 30, 2017

$$merge$$

Contributor

jujubot commented May 30, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 0cb801b into juju:develop May 30, 2017

1 check passed

github-check-merge-juju Built PR, ran unit tests, and tested LXD deploy. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment