Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose TTLs on lookups #444

Merged
merged 10 commits into from
May 3, 2018
Merged

Conversation

hawkw
Copy link
Contributor

@hawkw hawkw commented May 1, 2018

This PR changes the trust-dns-resolver API to expose the TTLs of DNS
lookups, which is needed downstream in Conduit (linkerd/linkerd2#711).

Reviewers should note the following:

  • In some cases, Lookups were constructed in places where the TTL is not
    known or there is no TTL (e.g. for localhost in lookup_state.rs).
    Therefore, I made the TTL field on Lookup optional, and added a new
    Lookup::new_with_ttl constructor, to avoid breaking existing code.
  • When appending two Lookups, if both have known TTLs, the returned
    Lookup has the lower of the two TTLs. Thus, the appended Lookup
    should time out when the shorter-lived of either of its constituents
    does. This seemed like the correct behaviour to me, so that append
    doesn't become a way for short-lived Lookups to escape their TTLs,
    but I wasn't sure about this --- please let me know if this intuition
    is incorrect!
  • I've added a simple unit test for the added TTL function in lookup.rs.

Signed-off-by: Eliza Weisman eliza@buoyant.io

hawkw added 3 commits May 1, 2018 15:42
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@briansmith
Copy link
Contributor

  1. It seems better for Trust-DNS to expose an expiration time (a deadline, an Instant) instead of the TTL (a Duration). In the case where the response is not from the internal cache then the time from which it should be measured is the time the query was sent. In the case where the response is from the internal cache then the time from which it should be measured is the time of the original query. How is the application supposed to know which of those two times to use?

  2. In the case the Trust-DNS cache is active, how would the caller use this API to guarantee that its next request's response isn't served from the cache?

@@ -62,7 +81,14 @@ impl Lookup {
rdatas.extend_from_slice(&*self.rdatas);
rdatas.extend_from_slice(&*other.rdatas);

Self::new(Arc::new(rdatas))
// If both lookups have TTLs, choose the lower one.
match (self.ttl(), other.ttl()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we need to compare the expiration times, not the TTLs, here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what you want to track here. The raw TTL, or the time in which the the record is no longer valid. The RFCs definitely recommend the latter.

Copy link
Member

@bluejekyll bluejekyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great. Thanks for the PR! I’ll merge this in once the tests pass.


Edit: I missed @briansmith comments. let's work that out.

).wait()
.unwrap()
.ttl(),
Some(Duration::from_secs(86400))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a test for CNAME, in particular where the CNAME expires earlier than the A records.

It would also be good to have a test for NXDOMAIN that shows how this works for NXDOMAIN. In particular, "The remaining of the current meanings, of being the TTL to be used for negative responses, is the new defined meaning of the SOA minimum field." from https://tools.ietf.org/html/rfc2308; see https://tools.ietf.org/html/rfc2308#section-5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some tests for TTLs as they relate to CNAME: https://github.com/bluejekyll/trust-dns/blob/master/resolver/src/lookup_state.rs#L946

We might have a gap on the NXDOMAIN/NoData responses, in regards to TTL.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@codecov
Copy link

codecov bot commented May 1, 2018

Codecov Report

Merging #444 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #444     +/-   ##
=========================================
+ Coverage   86.54%   86.64%   +0.1%     
=========================================
  Files         133      133             
  Lines       13391    13399      +8     
=========================================
+ Hits        11589    11610     +21     
+ Misses       1802     1789     -13
Impacted Files Coverage Δ
resolver/src/lookup_ip.rs 90.37% <ø> (ø) ⬆️
resolver/src/hosts.rs 95.53% <100%> (ø) ⬆️
resolver/src/lookup.rs 68.71% <100%> (+2.23%) ⬆️
resolver/src/dns_lru.rs 100% <100%> (ø) ⬆️
resolver/src/resolver_future.rs 72.05% <100%> (ø) ⬆️
server/src/config.rs 83.33% <0%> (+1.05%) ⬆️
client/src/rr/dnssec/signer.rs 98.95% <0%> (+1.36%) ⬆️
resolver/src/name_server_pool.rs 86.92% <0%> (+1.93%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97afd93...1dfc697. Read the comment docs.

@@ -63,7 +63,7 @@ impl DnsLru {
let ttl_until = now + ttl;

// insert into the LRU
let lookup = Lookup::new(Arc::new(rdatas));
let lookup = Lookup::new_with_ttl(Arc::new(rdatas), ttl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @briansmith recommendation, it seems ttl_until is what is wanted here, no? I missed that before

@hawkw
Copy link
Contributor Author

hawkw commented May 1, 2018

@briansmith

It seems better for Trust-DNS to expose an expiration time (a deadline, an Instant) instead of the TTL (a Duration). In the case where the response is not from the internal cache then the time from which it should be measured is the time the query was sent. In the case where the response is from the internal cache then the time from which it should be measured is the time of the original query. How is the application supposed to know which of those two times to use?

I agree, that seems much more correct. In 75cbbcd and ffc9cae, I've modified this PR so that the new field represents the expiration time for the lookup rather than the TTL.

@@ -63,7 +63,7 @@ impl DnsLru {
let ttl_until = now + ttl;

// insert into the LRU
let lookup = Lookup::new(Arc::new(rdatas));
let lookup = Lookup::new_with_ttl(Arc::new(rdatas), ttl_until);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest renaming things to be consistent. Your new API functions is called valid_until. I suggest renaming the ttl_util variable here the same thing. I also suggest renaming Lookup::new_with_ttl to Lookup::new_valid_until. Then it becomes easier to understand the code holistically.

}

/// Return a new instance with the given rdatas and TTL.
pub fn new_with_deadline(rdatas: Arc<Vec<RData>>, ttl: Instant,) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ttl should be renamed valid_until.

(Some(my_ttl), Some(other_ttl)) =>
Self::new_with_deadline(Arc::new(rdatas), min(my_ttl, other_ttl)),
(Some(ttl), None) | (None, Some(ttl)) =>
Self::new_with_deadline(Arc::new(rdatas), ttl),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear that this logic is correct when mixing Some(ttl) and None. What's the rationale here? Is it even possible that this could happen for lookup_ip or is this something that could only happen for other kinds of queries?

OTOH maybe RFC 2308 means we should always have a TTL; if not from the record, then from the overall response? I'm not sure since I don't know all the DNS details.

Regardless, we should write down the reasoning, at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the only time Lookup::append is called on a Lookup that doesn't have a TTL is here:
https://github.com/bluejekyll/trust-dns/blob/97afd93b007e0ba4140be4f09ba7245f821f1fa9/resolver/src/hosts.rs#L63-L76

In this case, it seems that the empty lookups are only created as placeholders and should be immediately overwritten with the new lookup, so in this case, the logic seems correct. Everywhere else where Lookup::append is called, the lookups are actual responses to a DNS query and should always have a TTL.

I'd be happy to add a comment explaining this in the method, if you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hosts is an interesting case. It's driven by a file, as opposed to an actual lookup. I wonder if we could remove the optional, and just pick the max_ttl in this case. At the moment, we don't support live reloading of the hosts file. So if this is indeed the only case, then I say we get rid of it by giving it a better default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the other main case whereLookups are constructed without a TTL are the loopbacks. Do you think it's okay for localhost to appear to have MAX_TTL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double checked the code, yes, I believe this is safe. My only concern was if this could have some effect on the LRU cache, but I think both of these code paths return before cache is either used. So we don't have to worry about what that would mean.

So yes, I think we can safely simplify this to default to MAX_TTL and thus get rid of the Option on TTL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluejekyll okay, sounds good! I'll make that change shortly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluejekyll: changed in 526b8d7

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
pub fn new(rdatas: Arc<Vec<RData>>) -> Self {
Lookup { rdatas }
/// Return new instance with given rdatas and the maximum TTL.
pub fn new(rdatas: Arc<Vec<RData>>,) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should switch every caller to use new_with_deadline and have them pass in valid_until. That way we don't accidentally default when we shouldn't. Is that practical or is there something that makes that awkward?

(NIT: Are we really going to do this trailing comma in arguments thing in rustfmt?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should switch every caller to use new_with_deadline and have them pass in valid_until. That way we don't accidentally default when we shouldn't. Is that practical or is there something that makes that awkward?

I don't think there's anything that makes that particularly difficult. I originally chose to leave the new constructor as-is and make it default because I wanted to keep the diff small, but I'm happy to change this so the caller is responsible for defaulting if you think that's better.

/// Return new instance with given rdatas and the maximum TTL.
pub fn new(rdatas: Arc<Vec<RData>>,) -> Self {
let valid_until =
Instant::now() + Duration::from_secs(MAX_TTL as u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an Instant::MAX or something like it that can be used instead?
Instant::now() is relatively expensive (potentially a syscall). MAX_TTL is for parsing/validating input on the wire.

If there's no Instant::MAX or equivalent then maybe we should use something like enum Validity { ValidUntil(Instant), Forever }, isomorphic to Option, with comparison semantics such that ValidUntil(x) < Forever for all x.

Copy link
Contributor Author

@hawkw hawkw May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briansmith

If there's no Instant::MAX or equivalent then maybe we should use something like enum Validity { ValidUntil(Instant), Forever }, isomorphic to Option, with comparison semantics such that ValidUntil(x) < Forever for all x.

I like this approach (and you're correct that there's no Instant::MAX).

However, I would like to get @bluejekyll's opinion on whether we should do something like this rather than using MAX_TTL (which he suggested in https://github.com/bluejekyll/trust-dns/pull/444#discussion_r185392627) before I make more changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I’m fine with MAX_TTL. I understand the concern @briansmith has, but I don’t feel like we need to over complicate this. As long as we guarantee that the correct TTL is used on insert.

Forcing the caller to pick a good value, rather than a decent default, seems potentially error prone as well. We could change ‘new’ to ‘with_max_ttl’ to make it more explicit.

@hawkw
Copy link
Contributor Author

hawkw commented May 2, 2018

@bluejekyll It looks to me like the test failures on the most recent build of this branch are unrelated to my changes, do you agree that seems to be the case?

@bluejekyll
Copy link
Member

bluejekyll commented May 2, 2018

I just kicked them off again. There are a couple of tests that have some races between server startup and the test running. I need to track those down and fix them.

Not your issue.

Copy link
Contributor

@briansmith briansmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@bluejekyll
Copy link
Member

Thanks for the great work and the contribution!

@bluejekyll bluejekyll merged commit 1e43522 into hickory-dns:master May 3, 2018
@hawkw
Copy link
Contributor Author

hawkw commented May 3, 2018

@bluejekyll thank you (and @briansmith) for all your guidance and feedback! :)

hawkw referenced this pull request in linkerd/linkerd2 Jun 1, 2018
Closes #711. Depends on #967.

This PR changes the proxy's `destination` module to honor the TTLs associated
with DNS lookups, now that bluejekyll/trust-dns#444 has been merged and we can
access this information from the Trust-DNS Resolver API. 

The `destination::background::DestinationSet` type has been modified so that, 
when a successful result is received for a DNS query, the DNS server will be 
polled again after the deadline associated with that query, rather than after
a fixed deadline. The fixed deadline is still used to determine when to poll
again for negative DNS responses or for errors.

Furthermore, Conduit now accepts an optional CONDUIT_PROXY_DNS_MIN_TTL 
environment variable that will configure a minimum TTL for DNS results. If the
deadline of a DNS response gives it a TTL shorter than the configured minimum,
Conduit will not poll DNS again until after that minimum TTL is elapsed. By
default, there is no minimum value set, as this feature is intended primarily
for when Conduit is run locally for development purposes.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
olix0r referenced this pull request in linkerd/linkerd2-proxy Jul 7, 2018
Closes #711. Depends on #967.

This PR changes the proxy's `destination` module to honor the TTLs associated
with DNS lookups, now that bluejekyll/trust-dns#444 has been merged and we can
access this information from the Trust-DNS Resolver API. 

The `destination::background::DestinationSet` type has been modified so that, 
when a successful result is received for a DNS query, the DNS server will be 
polled again after the deadline associated with that query, rather than after
a fixed deadline. The fixed deadline is still used to determine when to poll
again for negative DNS responses or for errors.

Furthermore, Conduit now accepts an optional CONDUIT_PROXY_DNS_MIN_TTL 
environment variable that will configure a minimum TTL for DNS results. If the
deadline of a DNS response gives it a TTL shorter than the configured minimum,
Conduit will not poll DNS again until after that minimum TTL is elapsed. By
default, there is no minimum value set, as this feature is intended primarily
for when Conduit is run locally for development purposes.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants