Do not retry the same name server on a negative response #1589

peterthejohnston · 2021-11-17T22:24:30Z

Currently in Trust-DNS, there are two mechanisms that allow failed queries to be retried by the resolver:

the name server pool retries negative responses against other name servers in the resolver's pool, depending on its configuration
the RetryDnsHandle reattempts queries against the name server pool if they fail

The name server pool retries basically any unsuccessful response against fallback name servers, unless it gets a trusted NoRecordsFound error, which occurs when NameServerConfig::trust_nx_responses is true for that server and the resolver received an empty NXDomain response.

The RetryDnsHandle uses the RetryableError trait to determine if an error should be retried. However, the implementation of RetryableError::should_retry for ResolveError uses the same criteria as the name server pool, which I think is not the desired behavior. This leads to the same query being retried to the same name server when it shouldn't be.

For example (this was real behavior observed when testing the resolver on a device running Fuchsia):

let's say a resolver has 3 name servers configured
a query gets a NODATA response from the first name server, and the name server pool retries the query on the other ones, getting the same response
the RetryDnsHandle now retries that entire query over the whole name server pool again, because it got an error for which RetryableError::should_retry is true. This happens ResolverOpts::attempts number of times.

in effect, we send this query 3 (# of name servers) * 3 (# of total attempts) = 9 times, 3 times to each name server. The name server pool is used correctly here to retry on a negative response; however, the RetryDnsHandle should probably only be used on IO errors (e.g. we failed to connect to a given server) or other errors on which it's reasonable to ask the same name server again. If we successfully get a negative response from a server, e.g. a NODATA response, it doesn't make sense to expect an OK response when we retry, so we should not be retrying the query to that same name server.

The desired end state is one where, if the resolver encounters no IO errors, only one query is made to each name server in the pool, at most.

codecov · 2021-11-17T22:43:02Z

Codecov Report

Merging #1589 (4471bdf) into main (fd500e4) will decrease coverage by 0.00%.
The diff coverage is 66.67%.

@@            Coverage Diff             @@
##             main    #1589      +/-   ##
==========================================
- Coverage   79.49%   79.49%   -0.00%     
==========================================
  Files         180      180              
  Lines       17971    17973       +2     
==========================================
+ Hits        14286    14287       +1     
- Misses       3685     3686       +1

djc

These changes look good to me. Looking at the changes here, it's difficult to understand how the trust factor still plays into it. Are there sufficient comments that discuss the interaction of name server pool retries and RetryDnsHandle retries? Now that you have state in your head, would be good to make sure it's sufficiently documented.

djc · 2021-11-19T14:27:36Z

(Might also be good to rebase this on top of current main to deal with the tokio issue in CI.)

peterthejohnston · 2021-11-19T21:59:07Z

Looking at the changes here, it's difficult to understand how the trust factor still plays into it.

That's a good point. The trust_nx_responses config is only relevant to the name server pool. Here's how it threads through:

when the name server gets a response, it passes trust_nx_responses to ResolveError::from_response here
ResolveError::from_response uses the parameter to decide whether an nx error should be marked trusted here
finally, the name server pool returns a error immediately rather than trying other name servers if it is trusted here

Now that you have state in your head, would be good to make sure it's sufficiently documented.

That's a great point, I don't think their interaction is very well documented, partly because they are in separate crates (the name server pool is part of the resolver and the RetryDnsHandle is a generic retry wrapper in the proto crate). Where do you think would be a good place to add documentation about their interaction? I'll add some more detail to the doc comments on the respective types as well, to clarify the purpose of retry behavior on each.

peterthejohnston · 2021-11-23T17:39:54Z

I've rebased on top of current main and added some more documentation on RetryDnsHandle and NameServerConfig::trust_nx_responses. PTAL!

djc · 2021-11-24T17:17:42Z

@peterthejohnston the documentation changes here look great, thanks!

djc reviewed Nov 19, 2021

View reviewed changes

peterthejohnston force-pushed the main branch 2 times, most recently from 88253e3 to 336632b Compare November 23, 2021 17:38

Do not retry the same name server on a negative response

4471bdf

peterthejohnston force-pushed the main branch from 336632b to 4471bdf Compare November 23, 2021 18:08

bluejekyll approved these changes Nov 24, 2021

View reviewed changes

bluejekyll merged commit a884254 into hickory-dns:main Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not retry the same name server on a negative response #1589

Do not retry the same name server on a negative response #1589

peterthejohnston commented Nov 17, 2021

codecov bot commented Nov 17, 2021 •

edited

Loading

djc left a comment

djc commented Nov 19, 2021 •

edited

Loading

peterthejohnston commented Nov 19, 2021

peterthejohnston commented Nov 23, 2021

djc commented Nov 24, 2021

Do not retry the same name server on a negative response #1589

Do not retry the same name server on a negative response #1589

Conversation

peterthejohnston commented Nov 17, 2021

codecov bot commented Nov 17, 2021 • edited Loading

Codecov Report

djc left a comment

Choose a reason for hiding this comment

djc commented Nov 19, 2021 • edited Loading

peterthejohnston commented Nov 19, 2021

peterthejohnston commented Nov 23, 2021

djc commented Nov 24, 2021

codecov bot commented Nov 17, 2021 •

edited

Loading

djc commented Nov 19, 2021 •

edited

Loading