Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can no longer detect NXDOMAIN/empty answer in Resolver queries #1171

Closed
glts opened this issue Jul 23, 2020 · 14 comments · Fixed by #1197
Closed

Can no longer detect NXDOMAIN/empty answer in Resolver queries #1171

glts opened this issue Jul 23, 2020 · 14 comments · Fixed by #1197

Comments

@glts
Copy link
Contributor

glts commented Jul 23, 2020

I have been using the synchronous Resolver to do DNS lookups and switching on ResolveErrorKind::NoRecordsFound { .. } to detect NXDOMAIN errors.

This no longer works, instead I now get an untyped/generic protocol error:

Err(ResolveError { kind: Proto(ProtoError { kind: Message("Nameserver responded with NXDomain"), backtrack: None }), backtrack: None })

I believe #1086 changed this, a breaking change. Is this intentional? How does one recognise NXDOMAIN/empty answers in the Resolver API?

To reproduce:

  • make a DNS query for a nonexistent domain or one that returns an empty answer
  • compare returned answer in Trust-DNS Resolver 0.19.5 with current master (d67acf9)
@bluejekyll
Copy link
Member

We can make that more strongly typed. Not sure if we missed that in the review or if this is a new code path, but that’s probably the easiest fix for this.

@glts
Copy link
Contributor Author

glts commented Jul 23, 2020

To give just a little bit of context, my application is SPF protocol implementation. For the SPF protocol I must be able to distinguish both the NXDOMAIN and the empty answer results from other errors.

@bluejekyll
Copy link
Member

Understood. There are two error codes here, NxDomain and NoError. No error is the empty response, NxDomain is the authoritative non-existent domain. All we need to do is associate the proper error code to a real type in the errors.

olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Aug 27, 2020
Due to hickory-dns/hickory-dns#1171, we cannot trust the DNS library
to return typed NXDOMAIN errors.

This change detects these errors by matching the error string. The
default TTL is updated to 5s.

Co-authored-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
@zaharidichev
Copy link

@bluejekyll what issue does #1171 solve exactly? It seems to me that we can not have the NxDomain and NoError codes as proto errors and just leave them be processed by this logic so we get a NoRecordsFound error. I am fine with putting up a PR for that. Does this approach sound ok to you?

@bluejekyll
Copy link
Member

bluejekyll commented Aug 28, 2020

@zaharidichev I'm sorry, I'm not fully understanding the way in which your suggesting to fix this. If you have a PR ready, I'm happy to take a look.

One of the issues here is really a question of what Error means in this context. I'm not entirely certain we should surface an error to the caller of the lookup as an NXDomain is a valid response from the upstream server. Right now, we don't pass that information back to the lookup caller at all. So I think the first question we should decide on is, should this be an error returned via the Result on the Resolver::lookup, or should we add a new field on Lookup that returns the underlying response code from the request?

@bluejekyll
Copy link
Member

For a little more context here. There are a number of things that make returning NXDomain accurately from the lookup call difficult. Off the top of my head:

  • Misconfigured internal DNS servers some times return NXDomain when they are not in fact authorities, see Spurious resolution failure with concurrent requests #933 or conversely are inappropriately using unregistered public domains
  • The search list for resolution of non-fqdn (fully-qualified-domain-names) will continue on NXDomain, b/c we have to.

So for the API change, I think we need to incorporate this, and also anticipate that NXDomain will only be semi-accurate for fqdns...

@zaharidichev
Copy link

@bluejekyll Sorry I might be missing a big part of the context here but it seems to me that before this commit NameServer::inner_send was returning the response instead of an error when encountering NXDomain. Later that response was processed by CachingClient::inner_lookup by looking at the response code. Are you saying that all this logic should not be there and instead the caller should be responsible for handling all that?

@bluejekyll
Copy link
Member

Maybe I'm overthinking the issue here. Would it be enough to surface the error in this case directly with the ResponseCode?

@zaharidichev
Copy link

Yes I think this is what needs to happen.

@bluejekyll
Copy link
Member

Let me work on a quick PR for this, perhaps there are some folks on this thread that would be interested in testing with it.

Also, please notice the flag distrust_nx_responses, that flag purposefully ignores NXDomain as an upstream permanent failure (which it should be), and instead regards it as a potential issue in upstream resolver configuration and continues lookups. Making that false, should return NXDomain properly, I hope.

@bluejekyll
Copy link
Member

@zaharidichev @glts please see #1197, I believe this will fix the issue.

I know remember why this wasn't correct before, I had punted on doing the type golf needed to support this. I need to take a pass to see if I can simplify any of the type signatures, but this should resolve the concerns here, I hope.

@zaharidichev
Copy link

@bluejekyll This is great. Thanks a lot for the fix! Is there any ETA on when a new release will be out including these changes?

@bluejekyll
Copy link
Member

I will publish today. There are a few library deps I want to update first.

@bluejekyll
Copy link
Member

@zaharidichev 0.20.0-alpha.2 has been published.

panthervis added a commit to panthervis/linkerd2-proxy that referenced this issue Oct 8, 2021
Due to hickory-dns/hickory-dns#1171, we cannot trust the DNS library
to return typed NXDOMAIN errors.

This change detects these errors by matching the error string. The
default TTL is updated to 5s.

Co-authored-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants