Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not retry the same name server on a negative response #1589

Merged
merged 1 commit into from
Nov 24, 2021

Conversation

peterthejohnston
Copy link
Contributor

Currently in Trust-DNS, there are two mechanisms that allow failed queries to be retried by the resolver:

  • the name server pool retries negative responses against other name servers in the resolver's pool, depending on its configuration
  • the RetryDnsHandle reattempts queries against the name server pool if they fail

The name server pool retries basically any unsuccessful response against fallback name servers, unless it gets a trusted NoRecordsFound error, which occurs when NameServerConfig::trust_nx_responses is true for that server and the resolver received an empty NXDomain response.

The RetryDnsHandle uses the RetryableError trait to determine if an error should be retried. However, the implementation of RetryableError::should_retry for ResolveError uses the same criteria as the name server pool, which I think is not the desired behavior. This leads to the same query being retried to the same name server when it shouldn't be.

For example (this was real behavior observed when testing the resolver on a device running Fuchsia):

  • let's say a resolver has 3 name servers configured
  • a query gets a NODATA response from the first name server, and the name server pool retries the query on the other ones, getting the same response
  • the RetryDnsHandle now retries that entire query over the whole name server pool again, because it got an error for which RetryableError::should_retry is true. This happens ResolverOpts::attempts number of times.

in effect, we send this query 3 (# of name servers) * 3 (# of total attempts) = 9 times, 3 times to each name server. The name server pool is used correctly here to retry on a negative response; however, the RetryDnsHandle should probably only be used on IO errors (e.g. we failed to connect to a given server) or other errors on which it's reasonable to ask the same name server again. If we successfully get a negative response from a server, e.g. a NODATA response, it doesn't make sense to expect an OK response when we retry, so we should not be retrying the query to that same name server.

The desired end state is one where, if the resolver encounters no IO errors, only one query is made to each name server in the pool, at most.

@codecov
Copy link

codecov bot commented Nov 17, 2021

Codecov Report

Merging #1589 (4471bdf) into main (fd500e4) will decrease coverage by 0.00%.
The diff coverage is 66.67%.

@@            Coverage Diff             @@
##             main    #1589      +/-   ##
==========================================
- Coverage   79.49%   79.49%   -0.00%     
==========================================
  Files         180      180              
  Lines       17971    17973       +2     
==========================================
+ Hits        14286    14287       +1     
- Misses       3685     3686       +1     

Copy link
Collaborator

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. Looking at the changes here, it's difficult to understand how the trust factor still plays into it. Are there sufficient comments that discuss the interaction of name server pool retries and RetryDnsHandle retries? Now that you have state in your head, would be good to make sure it's sufficiently documented.

@djc
Copy link
Collaborator

djc commented Nov 19, 2021

(Might also be good to rebase this on top of current main to deal with the tokio issue in CI.)

@peterthejohnston
Copy link
Contributor Author

Looking at the changes here, it's difficult to understand how the trust factor still plays into it.

That's a good point. The trust_nx_responses config is only relevant to the name server pool. Here's how it threads through:

  • when the name server gets a response, it passes trust_nx_responses to ResolveError::from_response here
  • ResolveError::from_response uses the parameter to decide whether an nx error should be marked trusted here
  • finally, the name server pool returns a error immediately rather than trying other name servers if it is trusted here

Now that you have state in your head, would be good to make sure it's sufficiently documented.

That's a great point, I don't think their interaction is very well documented, partly because they are in separate crates (the name server pool is part of the resolver and the RetryDnsHandle is a generic retry wrapper in the proto crate). Where do you think would be a good place to add documentation about their interaction? I'll add some more detail to the doc comments on the respective types as well, to clarify the purpose of retry behavior on each.

@peterthejohnston
Copy link
Contributor Author

I've rebased on top of current main and added some more documentation on RetryDnsHandle and NameServerConfig::trust_nx_responses. PTAL!

@bluejekyll bluejekyll merged commit a884254 into hickory-dns:main Nov 24, 2021
@djc
Copy link
Collaborator

djc commented Nov 24, 2021

@peterthejohnston the documentation changes here look great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants