Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

always retry on DNS error response from name server #1526

Merged
merged 3 commits into from
Aug 4, 2021

Conversation

peterthejohnston
Copy link
Contributor

@peterthejohnston peterthejohnston commented Jul 26, 2021

This PR makes the two changes proposed in #1519:

  • In the resolver, always try other name servers when we receive a response with an error code. (i.e. only fail the query terminally if we have already tried all of them.)
  • Eventually return the error received from the first name server, so in the case of a "legitimate" error, we prefer the response from the highest priority server we queried.

I do have two outstanding questions:

  • For some reason, ResponseCode::BADMODE does not lead to a retry in this this test. Any idea what's going on here, or whether ResponseCode::BADMODE has some kind of special treatment I'm unaware of?
  • Where can I add an integration test for the "return first error received" behavior? In name_server_pool_tests? If so, I think I'll need to pull in RetryDnsHandle from the trust-dns-proto crate there.

resolves #1519

@codecov
Copy link

codecov bot commented Jul 26, 2021

Codecov Report

Merging #1526 (e167528) into main (dcc5289) will decrease coverage by 0.08%.
The diff coverage is 10.53%.

@@            Coverage Diff             @@
##             main    #1526      +/-   ##
==========================================
- Coverage   83.55%   83.47%   -0.08%     
==========================================
  Files         171      171              
  Lines       16879    16896      +17     
==========================================
  Hits        14103    14103              
- Misses       2776     2793      +17     

@bluejekyll
Copy link
Member

bluejekyll commented Jul 27, 2021

For some reason, ResponseCode::BADMODE does not lead to a retry in this this test. Any idea what's going on here, or whether ResponseCode::BADMODE has some kind of special treatment I'm unaware of?

I'm not aware of anything with BADMODE specifically, BADVERS is awkward because that's the first value of the extended ResponseCode type that has additional bits added from EDNS, so that shouldn't matter here. But, what could be happening here (I'll need to review your code and the tests more closely) is that BADMODE requires EDNS since it falls outside the original 4 bit response code and lands in the extended ResponseCode space. So there may be an issue in the tests related to that... not sure.

Where can I add an integration test for the "return first error received" behavior? In name_server_pool_tests? If so, I think I'll need to pull in RetryDnsHandle from the trust-dns-proto crate there.

I can't think of a reason why moving RetryDnsHandle into proto would be a problem. I don't think it will pull in any external dependencies. That's the primary reason for the separation of all of these crates. (edit: I think I misunderstood your comment here. The integration tests depend on all crates I think, so I don't see a problem sharing tests across the various use cases)

Copy link
Member

@bluejekyll bluejekyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me.

Copy link
Member

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, though I still have some suggestions for improvements.

crates/proto/src/xfer/retry_dns_handle.rs Outdated Show resolved Hide resolved
crates/proto/src/xfer/retry_dns_handle.rs Outdated Show resolved Hide resolved
crates/resolver/src/error.rs Outdated Show resolved Hide resolved
@peterthejohnston
Copy link
Contributor Author

Also, clippy is complaining on the latest commit but it's unrelated to these changes, so maybe that can be resolved in a separate change.

@peterthejohnston
Copy link
Contributor Author

peterthejohnston commented Jul 29, 2021

Ok, after writing an integration test and doing some investigation, it appears that I had some misconceptions about retry behavior. I thought that the "fallback" behavior that I'm trying to modify was performed by RetryDnsHandle; in fact, NameServerPool is responsible for this and cycles through name servers when an error is received here, always returning the last/most recent error received of the highest specificity.

So, my changes to RetryDnsHandle are actually ill-conceived. The change to "return the error from the highest priority name server" should actually happen in name_server_pool.rs, so I've updated the PR to reflect that. Also, I think this makes more sense in context: when cycling through our pool of name servers, rather than always updating the error unless it is a less specific error, instead we only update the error if it is a more specific error, which allows us to return the error that is

  • first, the most specific
  • second, from the highest priority name server (i.e. received earliest).

Sorry about the thrash, let me know what you think.

@peterthejohnston peterthejohnston force-pushed the retry-on-errors branch 2 times, most recently from 4412b87 to fc339bc Compare July 29, 2021 20:10
@peterthejohnston
Copy link
Contributor Author

peterthejohnston commented Jul 29, 2021

Also, you were right @bluejekyll, only the lower 4 bits of the response code were being set in the header. This wasn't caught for the other errors that use more than 4 bits because when truncated they ended being retryable errors as well; BADMODE (19 i.e. 10011b) turned into NxDOMAIN (3 i.e. 11b).

I've removed all the errors that require EDNS from that test.

@djc
Copy link
Member

djc commented Jul 30, 2021

So, my changes to RetryDnsHandle are actually ill-conceived. The change to "return the error from the highest priority name server" should actually happen in name_server_pool.rs, so I've updated the PR to reflect that. Also, I think this makes more sense in context: when cycling through our pool of name servers, rather than always updating the error unless it is a less specific error, instead we only update the error if it is a more specific error, which allows us to return the error
that is

This makes sense to me. 👍

Copy link
Member

@bluejekyll bluejekyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good. I think there are possibly some error codes that we're over-correcting for, like YXRRSET, which I think is only relevant to dynamic DNS if I remember correctly, but since this is the resolver, I think that's fine.

Thanks for catching the area where this change was needed. I haven't looked at this code in depth in a while, so I missed that nuance between the retry_dns_handle and name_server_pool. Nice work!

@bluejekyll
Copy link
Member

The cleanliness target failing here should be resolved in #1527

@peterthejohnston
Copy link
Contributor Author

I think there are possibly some error codes that we're over-correcting for, like YXRRSET, which I think is only relevant to dynamic DNS if I remember correctly, but since this is the resolver, I think that's fine.

OK, sounds good. That makes sense.

Thanks for catching the area where this change was needed. I haven't looked at this code in depth in a while, so I missed that nuance between the retry_dns_handle and name_server_pool. Nice work!

NP, thanks!

@peterthejohnston
Copy link
Contributor Author

I've updated a relevant doc comment and the PR you mentioned @bluejekyll did resolve the cleanliness target, so if this looks good it should be ready to merge.

(I think maybe the low code coverage stat for the diff is a red herring because the test coverage for the diff is added in a different crate—tests.)

@peterthejohnston
Copy link
Contributor Author

Also, I was wondering: what is the process/cadence/criteria for putting out a new point release? We were hoping in Fuchsia to be able to pull in this change relatively soon, without having to diverge from upstream. Could I file an issue to tag a v0.20.4 release?

@bluejekyll
Copy link
Member

I was planning to cut an alpha release of 0.21.0, I was actually waiting for this PR to land for that. Do you need this to be in a 0.20.0 release? That would require a backport from main to the 0.20 branch at this point (there are some API diffs in main that require a 0.21 version bump.

@peterthejohnston
Copy link
Contributor Author

Ah gotcha, I didn't realize that. Could you point me to the API changes between v0.20.x and v0.21.0?

@bluejekyll bluejekyll merged commit f08860c into hickory-dns:main Aug 4, 2021
@bluejekyll
Copy link
Member

@peterthejohnston, I need to review all the PR's that landed, and add a few more, but this is what I had collected so far: https://github.com/bluejekyll/trust-dns/blob/main/CHANGELOG.md#0210-unreleased

@bluejekyll
Copy link
Member

And thank you for getting this PR in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Always fall back to other name servers on DNS error response codes
3 participants