Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: call getaddrinfo directly and extract its return value #698

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

bassosimone
Copy link
Contributor

@bassosimone bassosimone commented Feb 22, 2022

Checklist

Description

This pull request modifies our "system resolver" implementation as follows. If we're compiled with CGO_ENABLED=0, we keep using the netgo implementation of the system resolver. Otherwise, we'll attempt to link with libc and call getaddrinfo directly rather than using the cgo system resolver implementation as the middle person.

This new design solves two problems:

  1. that, to date, we're not collecting the raw result of getaddrinfo (which is one of the steps of websteps: include raw errors probe#2033);
  2. that, to date, the cgo implementation does not correctly handle "no answer" on Android (which leads to "unknown failure" and, in turn, to the measurement being marked as failed, as documented in android: properly handle NXDOMAIN errors probe#2029).

Implementation wise, the code I am using here is deeply based on the cgo implementation. (I've been quite precise in giving credit where due and in flagging which functions are under the BSD-3Clause license.)

Here are the key feature highlights of this pull request:

  1. we write platform-specific getaddrinfo handling code so to address Android issues;
  2. the same platform-specific code also wraps getaddrinfo errors so we know the original return value;
  3. we use a simpler model to handle getaddrinfo parallelism compared to the Go standard library (we don't need to be as general as the standard library is, therefore we can use a simpler implementation that requires to maintain less code);
  4. we modify all the code that saves results to include getaddrifo_retval as an observation field.

Now, let's discuss some annoyances of this diff:

  1. we need to modify three places in the codebase to propagate getaddrinfo_retval. The reason why historically we have three places is an ongoing refactoring, which is documented by measurex: unify data model with engine/netx probe#2035.
  2. this diff increases the complexity of testing the Android codebase. This occurs because there's now a split between the Android and the Linux implementations. This mainly stems from the fact that EAI_NONAME is only available on GNU/Linux if you -D__GNU_SOURCE. Perhaps, I should just -D__GNU_SOURCE? But, I'm not sure about musl. So, maybe we should consider experimenting a little more here and figuring out if there's a compact way of unifying Linux and Android here?
  3. this diff increases the complexity of the codebase, in that now it's our problem to deal with the differences of each libc when we're calling getaddrinfo (for example, see my doubts about regarding GNU libc versus musl libc)
  4. this diff probably requires us to change the way in which we cross compile miniooni. It now seems a bit annoying that the build we get when cross compiling does not have cgo by default, so we're basically getting a default resolver (the one you get when you're not using cgo, which is called netgo) that does not record the return value of getaddrinfo. But, then, maybe this is just telling us that we should completely bypass Go's resolver also in the CGO_ENABLED=0 case (i.e., mainly cross compiling), by borrowing from the Go stdlib some more code that helps us in reading /etc/resolv.conf?
  5. one may argue that a simpler fix was that of just intercepting the string emitted by Android in that specific case and have a somehow hackish diff to handle that (but I am not super happy about piling hacks over hacks, so I felt it was proper to try and solve the underlying problem rather than continuing to apply band aids)
  6. it may be that we'll be facing increasing issues with i18n domain names and perhaps this is more likely to occur in Windows where I suppose we're using the ASCII version of getaddrinfo - and perhaps this should also be tested?

The issue at ooni/probe#2029 is fixed
if we directly call getaddrinfo and correctly map its return code.

While the main reason to propose this diff is to fix the above
mentioned issue, we should note that this diff also paves the way
for ooni/probe#1569.

(Of course, regarding ooni/probe#1569,
we don't have the Rcode when we're calling getaddrinfo, but the
spirit of ooni/probe#1569 is that we
should include the lowest-level error we have seen and, when we're
calling getaddrinfo, such an error is getaddrinfo's retval.)
bassosimone added a commit that referenced this pull request Feb 23, 2022
The problem is explained in ooni/probe#2029.

I am also working on a more comprehensive fix for this
issue in #698.

This diff WILL NOT need forwardporting. It's just meant as
an hotfix for the release/3.14 branch and it's not such that
I'd be happy keeping it in release/3.15+.
@hellais
Copy link
Member

hellais commented Feb 23, 2022

Thanks for putting this together, it looks great!

In general the approach seems reasonable. Similarly to what you wrote in your comment, I am a bit concerned about the increase in complexity in the codebase and the potential issues that can arise from different platforms and libraries.

It's probably worth spending a bit of time testing this on various platforms and perhaps add some guards to our build tooling so that we don't make builds linking to untested/unsupported libc versions or platforms.

Another issue we should consider is that, if I understand how this works correctly, depending on whether or not we have built with CGO_ENABLED=0 on or not, we are going to be measuring things in a different way (using our cgo inspired getaddrinfo implementation or using netgo). This might present issues when analyzing or interpreting the data.

Do we perhaps want to add some field to the output data format that gives us an indication of which DNS resolution code was used to generate the the metric?

Based on the comment here: ooni/probe#2029 (comment), I gather that we don't want to anyways ship this in the next release, so it's fine to spend a bit more time testing and discussing improvements to it. Is this correct?

bassosimone added a commit that referenced this pull request May 20, 2022
This cherry-picks 2b48dcf for
the release/3.15 branch. Original commit message follows:

- - -

The problem is explained in ooni/probe#2029.

I am also working on a more comprehensive fix for this
issue in #698.

This diff WILL NOT need forwardporting. It's just meant as
an hotfix for the release/3.14 branch and it's not such that
I'd be happy keeping it in release/3.15+.
Conflicts:
	internal/archival/resolver.go
	internal/archival/trace.go
bassosimone added a commit that referenced this pull request May 29, 2022
After #764, the build for
CGO_ENABLED=0 has been broken for miniooni:

https://github.com/ooni/probe-cli/runs/6636995859?check_suite_focus=true

Likewise, it's not possible to run tests with CGO_ENABLED=0.

Additionally, @hellais previously raised a valid point in the review
of #698:

> Another issue we should consider is that, if I understand how
> this works correctly, depending on whether or not we have built
> with CGO_ENABLED=0 on or not, we are going to be measuring
> things in a different way (using our cgo inspired getaddrinfo
> implementation or using netgo). This might present issues when
> analyzing or interpreting the data.
>
> Do we perhaps want to add some field to the output data format that
> gives us an indication of which DNS resolution code was used to
> generate the the metric?

This comment is relevant to the current commit because
#698 is the previous
iteration of #764.

So, while fixing the build and test issues, let us also distinguish
between the CGO_ENABLED=1 and CGO_ENABLED=0 cases.

Before this commit, OONI used "system" to indicate the case where
we were using net.DefaultResolver. This behavior dates back to the
Measurement Kit days. While it is true that ooni/probe-engine and
ooni/probe-cli could have been using netgo in the past when we
said "system" as the resolver, it also seems reasonable to continue
to use "system" top indicate getaddrinfo.

So, the choice here is basically to use "netgo" from now on to
indicate the cases in which we were built with CGO_ENABLED=0.

This change will need to be documented into ooni/spec along with
the introduction of the `android_dns_cache_no_data` error.
bassosimone added a commit that referenced this pull request May 30, 2022
After #764, the build for
CGO_ENABLED=0 has been broken for miniooni:

https://github.com/ooni/probe-cli/runs/6636995859?check_suite_focus=true

Likewise, it's not possible to run tests with CGO_ENABLED=0.

To make tests work with `CGO_ENABLED=0`, I needed to sacrifice some
unit tests run for the CGO case. It is not fully clear to me what was happening
here, but basically `getaddrinfo_cgo_test.go` was compiled with CGO
being disabled, even though the ``//go:build cgo` flag was specified.

Additionally, @hellais previously raised a valid point in the review
of #698:

> Another issue we should consider is that, if I understand how
> this works correctly, depending on whether or not we have built
> with CGO_ENABLED=0 on or not, we are going to be measuring
> things in a different way (using our cgo inspired getaddrinfo
> implementation or using netgo). This might present issues when
> analyzing or interpreting the data.
>
> Do we perhaps want to add some field to the output data format that
> gives us an indication of which DNS resolution code was used to
> generate the the metric?

This comment is relevant to the current commit because
#698 is the previous
iteration of #764.

So, while fixing the build and test issues, let us also distinguish
between the CGO_ENABLED=1 and CGO_ENABLED=0 cases.

Before this commit, OONI used "system" to indicate the case where
we were using net.DefaultResolver. This behavior dates back to the
Measurement Kit days. While it is true that ooni/probe-engine and
ooni/probe-cli could have been using netgo in the past when we
said "system" as the resolver, it also seems reasonable to continue
to use "system" top indicate getaddrinfo.

So, the choice here is basically to use "netgo" from now on to
indicate the cases in which we were built with CGO_ENABLED=0.

This change will need to be documented into ooni/spec along with
the introduction of the `android_dns_cache_no_data` error.

## Checklist

- [x] I have read the [contribution guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request: ooni/probe#2029
- [x] if you changed anything related how experiments work and you need to reflect these changes in the ooni/spec repository, please link to the related ooni/spec pull request: ooni/spec#242
@bassosimone
Copy link
Contributor Author

bassosimone commented May 30, 2022

Thanks for putting this together, it looks great!

Thank you for your review!

I ended up taking just the netxlite part of the diff, applying changes to improve testing, and merging it as part of #764.

In general the approach seems reasonable. Similarly to what you wrote in your comment, I am a bit concerned about the increase in complexity in the codebase and the potential issues that can arise from different platforms and libraries.

It's probably worth spending a bit of time testing this on various platforms and perhaps add some guards to our build tooling so that we don't make builds linking to untested/unsupported libc versions or platforms.

Yes, this was a good advice. I spent extra time trying to figure out the possible configurations, which led me to improve my analysis of what actually happens and to improve the CI. See:

Another issue we should consider is that, if I understand how this works correctly, depending on whether or not we have built with CGO_ENABLED=0 on or not, we are going to be measuring things in a different way (using our cgo inspired getaddrinfo implementation or using netgo). This might present issues when analyzing or interpreting the data.

Do we perhaps want to add some field to the output data format that gives us an indication of which DNS resolution code was used to generate the the metric?

Yes, good point. So, I actually introduced in

support for distinguishing which resolver we're using. The approach has been quite simple: if we know we're calling getaddrinfo, continue to name this "system" (as we have been doing since MK). Otherwise, call that "go", which may be "netgo" or actually getaddrinfo depending on the platform and how OONI has been built.

Our aim is to use getaddrinfo (aka "system") whenever possible. To this end, we need to avoid cross compiling, so we can explicitly link with libc. We're currently not cross compiling ooniprobe, but we cross compile miniooni, hence ooni/probe#2119.

Based on the comment here: ooni/probe#2029 (comment), I gather that we don't want to anyways ship this in the next release, so it's fine to spend a bit more time testing and discussing improvements to it. Is this correct?

Yes. For 3.14 and 3.15 we have been using a simplistic patch. Based on my analysis at ooni/probe#2029 (comment), the patch is not so correct. Android's getaddrinfo is really bad in terms of data quality. Because of this, I proposed ooni/probe#2120.

BTW, this pull request now only contains a small diff that collects getaddrinfo's return value, since the underlying mechanism for calling getaddrinfo has been merged in #764. I would like to keep this PR open and work towards merging the remainder of the diff $soon.

@bassosimone bassosimone marked this pull request as draft May 30, 2022 08:26
@bassosimone
Copy link
Contributor Author

Convering the PR to draft since the code that remains to be merged in this diff is still a bit of a draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants