-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: call getaddrinfo directly and extract its return value #698
base: master
Are you sure you want to change the base?
Conversation
The issue at ooni/probe#2029 is fixed if we directly call getaddrinfo and correctly map its return code. While the main reason to propose this diff is to fix the above mentioned issue, we should note that this diff also paves the way for ooni/probe#1569. (Of course, regarding ooni/probe#1569, we don't have the Rcode when we're calling getaddrinfo, but the spirit of ooni/probe#1569 is that we should include the lowest-level error we have seen and, when we're calling getaddrinfo, such an error is getaddrinfo's retval.)
The problem is explained in ooni/probe#2029. I am also working on a more comprehensive fix for this issue in #698. This diff WILL NOT need forwardporting. It's just meant as an hotfix for the release/3.14 branch and it's not such that I'd be happy keeping it in release/3.15+.
Thanks for putting this together, it looks great! In general the approach seems reasonable. Similarly to what you wrote in your comment, I am a bit concerned about the increase in complexity in the codebase and the potential issues that can arise from different platforms and libraries. It's probably worth spending a bit of time testing this on various platforms and perhaps add some guards to our build tooling so that we don't make builds linking to untested/unsupported libc versions or platforms. Another issue we should consider is that, if I understand how this works correctly, depending on whether or not we have built with Do we perhaps want to add some field to the output data format that gives us an indication of which DNS resolution code was used to generate the the metric? Based on the comment here: ooni/probe#2029 (comment), I gather that we don't want to anyways ship this in the next release, so it's fine to spend a bit more time testing and discussing improvements to it. Is this correct? |
This cherry-picks 2b48dcf for the release/3.15 branch. Original commit message follows: - - - The problem is explained in ooni/probe#2029. I am also working on a more comprehensive fix for this issue in #698. This diff WILL NOT need forwardporting. It's just meant as an hotfix for the release/3.14 branch and it's not such that I'd be happy keeping it in release/3.15+.
Conflicts: internal/archival/resolver.go internal/archival/trace.go
After #764, the build for CGO_ENABLED=0 has been broken for miniooni: https://github.com/ooni/probe-cli/runs/6636995859?check_suite_focus=true Likewise, it's not possible to run tests with CGO_ENABLED=0. Additionally, @hellais previously raised a valid point in the review of #698: > Another issue we should consider is that, if I understand how > this works correctly, depending on whether or not we have built > with CGO_ENABLED=0 on or not, we are going to be measuring > things in a different way (using our cgo inspired getaddrinfo > implementation or using netgo). This might present issues when > analyzing or interpreting the data. > > Do we perhaps want to add some field to the output data format that > gives us an indication of which DNS resolution code was used to > generate the the metric? This comment is relevant to the current commit because #698 is the previous iteration of #764. So, while fixing the build and test issues, let us also distinguish between the CGO_ENABLED=1 and CGO_ENABLED=0 cases. Before this commit, OONI used "system" to indicate the case where we were using net.DefaultResolver. This behavior dates back to the Measurement Kit days. While it is true that ooni/probe-engine and ooni/probe-cli could have been using netgo in the past when we said "system" as the resolver, it also seems reasonable to continue to use "system" top indicate getaddrinfo. So, the choice here is basically to use "netgo" from now on to indicate the cases in which we were built with CGO_ENABLED=0. This change will need to be documented into ooni/spec along with the introduction of the `android_dns_cache_no_data` error.
After #764, the build for CGO_ENABLED=0 has been broken for miniooni: https://github.com/ooni/probe-cli/runs/6636995859?check_suite_focus=true Likewise, it's not possible to run tests with CGO_ENABLED=0. To make tests work with `CGO_ENABLED=0`, I needed to sacrifice some unit tests run for the CGO case. It is not fully clear to me what was happening here, but basically `getaddrinfo_cgo_test.go` was compiled with CGO being disabled, even though the ``//go:build cgo` flag was specified. Additionally, @hellais previously raised a valid point in the review of #698: > Another issue we should consider is that, if I understand how > this works correctly, depending on whether or not we have built > with CGO_ENABLED=0 on or not, we are going to be measuring > things in a different way (using our cgo inspired getaddrinfo > implementation or using netgo). This might present issues when > analyzing or interpreting the data. > > Do we perhaps want to add some field to the output data format that > gives us an indication of which DNS resolution code was used to > generate the the metric? This comment is relevant to the current commit because #698 is the previous iteration of #764. So, while fixing the build and test issues, let us also distinguish between the CGO_ENABLED=1 and CGO_ENABLED=0 cases. Before this commit, OONI used "system" to indicate the case where we were using net.DefaultResolver. This behavior dates back to the Measurement Kit days. While it is true that ooni/probe-engine and ooni/probe-cli could have been using netgo in the past when we said "system" as the resolver, it also seems reasonable to continue to use "system" top indicate getaddrinfo. So, the choice here is basically to use "netgo" from now on to indicate the cases in which we were built with CGO_ENABLED=0. This change will need to be documented into ooni/spec along with the introduction of the `android_dns_cache_no_data` error. ## Checklist - [x] I have read the [contribution guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md) - [x] reference issue for this pull request: ooni/probe#2029 - [x] if you changed anything related how experiments work and you need to reflect these changes in the ooni/spec repository, please link to the related ooni/spec pull request: ooni/spec#242
Thank you for your review! I ended up taking just the
Yes, this was a good advice. I spent extra time trying to figure out the possible configurations, which led me to improve my analysis of what actually happens and to improve the CI. See:
Yes, good point. So, I actually introduced in
support for distinguishing which resolver we're using. The approach has been quite simple: if we know we're calling Our aim is to use
Yes. For 3.14 and 3.15 we have been using a simplistic patch. Based on my analysis at ooni/probe#2029 (comment), the patch is not so correct. Android's BTW, this pull request now only contains a small diff that collects |
Convering the PR to draft since the code that remains to be merged in this diff is still a bit of a draft. |
Checklist
getaddrinfo_retval
fieldDescription
This pull request modifies our "system resolver" implementation as follows. If we're compiled with
CGO_ENABLED=0
, we keep using thenetgo
implementation of the system resolver. Otherwise, we'll attempt to link withlibc
and callgetaddrinfo
directly rather than using thecgo
system resolver implementation as the middle person.This new design solves two problems:
cgo
implementation does not correctly handle "no answer" on Android (which leads to "unknown failure" and, in turn, to the measurement being marked as failed, as documented in android: properly handle NXDOMAIN errors probe#2029).Implementation wise, the code I am using here is deeply based on the
cgo
implementation. (I've been quite precise in giving credit where due and in flagging which functions are under the BSD-3Clause license.)Here are the key feature highlights of this pull request:
getaddrinfo
handling code so to address Android issues;getaddrinfo
errors so we know the original return value;getaddrinfo
parallelism compared to the Go standard library (we don't need to be as general as the standard library is, therefore we can use a simpler implementation that requires to maintain less code);getaddrifo_retval
as an observation field.Now, let's discuss some annoyances of this diff:
getaddrinfo_retval
. The reason why historically we have three places is an ongoing refactoring, which is documented by measurex: unify data model with engine/netx probe#2035.EAI_NONAME
is only available on GNU/Linux if you-D__GNU_SOURCE
. Perhaps, I should just-D__GNU_SOURCE
? But, I'm not sure about musl. So, maybe we should consider experimenting a little more here and figuring out if there's a compact way of unifying Linux and Android here?getaddrinfo
(for example, see my doubts about regarding GNU libc versus musl libc)miniooni
. It now seems a bit annoying that the build we get when cross compiling does not havecgo
by default, so we're basically getting a default resolver (the one you get when you're not usingcgo
, which is callednetgo
) that does not record the return value ofgetaddrinfo
. But, then, maybe this is just telling us that we should completely bypass Go's resolver also in theCGO_ENABLED=0
case (i.e., mainly cross compiling), by borrowing from the Go stdlib some more code that helps us in reading/etc/resolv.conf
?getaddrinfo
- and perhaps this should also be tested?