always retry on DNS error response from name server #1526

peterthejohnston · 2021-07-26T17:10:25Z

This PR makes the two changes proposed in #1519:

In the resolver, always try other name servers when we receive a response with an error code. (i.e. only fail the query terminally if we have already tried all of them.)
Eventually return the error received from the first name server, so in the case of a "legitimate" error, we prefer the response from the highest priority server we queried.

I do have two outstanding questions:

For some reason, ResponseCode::BADMODE does not lead to a retry in this this test. Any idea what's going on here, or whether ResponseCode::BADMODE has some kind of special treatment I'm unaware of?
Where can I add an integration test for the "return first error received" behavior? In name_server_pool_tests? If so, I think I'll need to pull in RetryDnsHandle from the trust-dns-proto crate there.

resolves #1519

codecov · 2021-07-26T17:23:20Z

Codecov Report

Merging #1526 (e167528) into main (dcc5289) will decrease coverage by 0.08%.
The diff coverage is 10.53%.

@@            Coverage Diff             @@
##             main    #1526      +/-   ##
==========================================
- Coverage   83.55%   83.47%   -0.08%     
==========================================
  Files         171      171              
  Lines       16879    16896      +17     
==========================================
  Hits        14103    14103              
- Misses       2776     2793      +17

bluejekyll · 2021-07-27T16:16:32Z

For some reason, ResponseCode::BADMODE does not lead to a retry in this this test. Any idea what's going on here, or whether ResponseCode::BADMODE has some kind of special treatment I'm unaware of?

I'm not aware of anything with BADMODE specifically, BADVERS is awkward because that's the first value of the extended ResponseCode type that has additional bits added from EDNS, so that shouldn't matter here. But, what could be happening here (I'll need to review your code and the tests more closely) is that BADMODE requires EDNS since it falls outside the original 4 bit response code and lands in the extended ResponseCode space. So there may be an issue in the tests related to that... not sure.

Where can I add an integration test for the "return first error received" behavior? In name_server_pool_tests? If so, I think I'll need to pull in RetryDnsHandle from the trust-dns-proto crate there.

I can't think of a reason why moving RetryDnsHandle into proto would be a problem. I don't think it will pull in any external dependencies. That's the primary reason for the separation of all of these crates. (edit: I think I misunderstood your comment here. The integration tests depend on all crates I think, so I don't see a problem sharing tests across the various use cases)

bluejekyll

This all looks good to me.

djc

Looks pretty good to me, though I still have some suggestions for improvements.

crates/proto/src/xfer/retry_dns_handle.rs

crates/resolver/src/error.rs

peterthejohnston · 2021-07-29T16:09:00Z

Also, clippy is complaining on the latest commit but it's unrelated to these changes, so maybe that can be resolved in a separate change.

peterthejohnston · 2021-07-29T19:04:07Z

Ok, after writing an integration test and doing some investigation, it appears that I had some misconceptions about retry behavior. I thought that the "fallback" behavior that I'm trying to modify was performed by RetryDnsHandle; in fact, NameServerPool is responsible for this and cycles through name servers when an error is received here, always returning the last/most recent error received of the highest specificity.

So, my changes to RetryDnsHandle are actually ill-conceived. The change to "return the error from the highest priority name server" should actually happen in name_server_pool.rs, so I've updated the PR to reflect that. Also, I think this makes more sense in context: when cycling through our pool of name servers, rather than always updating the error unless it is a less specific error, instead we only update the error if it is a more specific error, which allows us to return the error that is

first, the most specific
second, from the highest priority name server (i.e. received earliest).

Sorry about the thrash, let me know what you think.

peterthejohnston · 2021-07-29T20:10:48Z

Also, you were right @bluejekyll, only the lower 4 bits of the response code were being set in the header. This wasn't caught for the other errors that use more than 4 bits because when truncated they ended being retryable errors as well; BADMODE (19 i.e. 10011b) turned into NxDOMAIN (3 i.e. 11b).

I've removed all the errors that require EDNS from that test.

djc · 2021-07-30T06:57:06Z

So, my changes to RetryDnsHandle are actually ill-conceived. The change to "return the error from the highest priority name server" should actually happen in name_server_pool.rs, so I've updated the PR to reflect that. Also, I think this makes more sense in context: when cycling through our pool of name servers, rather than always updating the error unless it is a less specific error, instead we only update the error if it is a more specific error, which allows us to return the error
that is

This makes sense to me. 👍

bluejekyll

This all looks good. I think there are possibly some error codes that we're over-correcting for, like YXRRSET, which I think is only relevant to dynamic DNS if I remember correctly, but since this is the resolver, I think that's fine.

Thanks for catching the area where this change was needed. I haven't looked at this code in depth in a while, so I missed that nuance between the retry_dns_handle and name_server_pool. Nice work!

bluejekyll · 2021-08-01T16:38:32Z

The cleanliness target failing here should be resolved in #1527

peterthejohnston · 2021-08-03T15:59:55Z

I think there are possibly some error codes that we're over-correcting for, like YXRRSET, which I think is only relevant to dynamic DNS if I remember correctly, but since this is the resolver, I think that's fine.

OK, sounds good. That makes sense.

Thanks for catching the area where this change was needed. I haven't looked at this code in depth in a while, so I missed that nuance between the retry_dns_handle and name_server_pool. Nice work!

NP, thanks!

peterthejohnston · 2021-08-03T16:19:39Z

I've updated a relevant doc comment and the PR you mentioned @bluejekyll did resolve the cleanliness target, so if this looks good it should be ready to merge.

(I think maybe the low code coverage stat for the diff is a red herring because the test coverage for the diff is added in a different crate—tests.)

peterthejohnston · 2021-08-04T16:27:03Z

Also, I was wondering: what is the process/cadence/criteria for putting out a new point release? We were hoping in Fuchsia to be able to pull in this change relatively soon, without having to diverge from upstream. Could I file an issue to tag a v0.20.4 release?

bluejekyll · 2021-08-04T21:01:41Z

I was planning to cut an alpha release of 0.21.0, I was actually waiting for this PR to land for that. Do you need this to be in a 0.20.0 release? That would require a backport from main to the 0.20 branch at this point (there are some API diffs in main that require a 0.21 version bump.

peterthejohnston · 2021-08-04T21:13:51Z

Ah gotcha, I didn't realize that. Could you point me to the API changes between v0.20.x and v0.21.0?

bluejekyll · 2021-08-04T22:35:24Z

@peterthejohnston, I need to review all the PR's that landed, and add a few more, but this is what I had collected so far: https://github.com/bluejekyll/trust-dns/blob/main/CHANGELOG.md#0210-unreleased

bluejekyll · 2021-08-04T22:35:54Z

And thank you for getting this PR in!

retry on all DNS error response codes

58b0afe

bluejekyll approved these changes Jul 27, 2021

View reviewed changes

djc reviewed Jul 29, 2021

View reviewed changes

crates/proto/src/xfer/retry_dns_handle.rs Outdated Show resolved Hide resolved

crates/proto/src/xfer/retry_dns_handle.rs Outdated Show resolved Hide resolved

crates/resolver/src/error.rs Outdated Show resolved Hide resolved

peterthejohnston force-pushed the retry-on-errors branch 2 times, most recently from 4412b87 to fc339bc Compare July 29, 2021 20:10

bluejekyll approved these changes Aug 1, 2021

View reviewed changes

return error response from highest priority name server

724ddff

peterthejohnston force-pushed the retry-on-errors branch from 4199b3c to 724ddff Compare August 3, 2021 15:59

Merge branch 'main' into retry-on-errors

e167528

bluejekyll merged commit f08860c into hickory-dns:main Aug 4, 2021

peterthejohnston mentioned this pull request Sep 23, 2021

correct behavior around trust_nx_responses #1556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

always retry on DNS error response from name server #1526

always retry on DNS error response from name server #1526

peterthejohnston commented Jul 26, 2021 •

edited

Loading

codecov bot commented Jul 26, 2021 •

edited

Loading

bluejekyll commented Jul 27, 2021 •

edited

Loading

bluejekyll left a comment

djc left a comment

peterthejohnston commented Jul 29, 2021

peterthejohnston commented Jul 29, 2021 •

edited

Loading

peterthejohnston commented Jul 29, 2021 •

edited

Loading

djc commented Jul 30, 2021

bluejekyll left a comment

bluejekyll commented Aug 1, 2021

peterthejohnston commented Aug 3, 2021

peterthejohnston commented Aug 3, 2021

peterthejohnston commented Aug 4, 2021

bluejekyll commented Aug 4, 2021

peterthejohnston commented Aug 4, 2021

bluejekyll commented Aug 4, 2021

bluejekyll commented Aug 4, 2021

always retry on DNS error response from name server #1526

always retry on DNS error response from name server #1526

Conversation

peterthejohnston commented Jul 26, 2021 • edited Loading

codecov bot commented Jul 26, 2021 • edited Loading

Codecov Report

bluejekyll commented Jul 27, 2021 • edited Loading

bluejekyll left a comment

Choose a reason for hiding this comment

djc left a comment

Choose a reason for hiding this comment

peterthejohnston commented Jul 29, 2021

peterthejohnston commented Jul 29, 2021 • edited Loading

peterthejohnston commented Jul 29, 2021 • edited Loading

djc commented Jul 30, 2021

bluejekyll left a comment

Choose a reason for hiding this comment

bluejekyll commented Aug 1, 2021

peterthejohnston commented Aug 3, 2021

peterthejohnston commented Aug 3, 2021

peterthejohnston commented Aug 4, 2021

bluejekyll commented Aug 4, 2021

peterthejohnston commented Aug 4, 2021

bluejekyll commented Aug 4, 2021

bluejekyll commented Aug 4, 2021

peterthejohnston commented Jul 26, 2021 •

edited

Loading

codecov bot commented Jul 26, 2021 •

edited

Loading

bluejekyll commented Jul 27, 2021 •

edited

Loading

peterthejohnston commented Jul 29, 2021 •

edited

Loading

peterthejohnston commented Jul 29, 2021 •

edited

Loading