-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wcth: compare old and new test helpers #1707
Comments
When a probe gets a local DNS failure, it will continue and nonetheless query the test helper without any IP address, just an empty list. This diff fixes the behavior of cmd/oohelper to do the same. Work part of ooni/probe#1707.
When a probe gets a local DNS failure, it will continue and nonetheless query the test helper without any IP address, just an empty list. This diff fixes the behavior of cmd/oohelper to do the same. Work part of ooni/probe#1707.
Here's a status update. In late July I run a first pass comparison of the whole content of the test lists between the old and the new test helper. To this end, I used the code at https://github.com/ooni/wcthcheck. The codename of this first comparison is {
"dns/match/exactly": 19518,
"dns/match/same_asn": 3362,
"dns/match/same_org": 122,
"dns/mismatch/cannot_map_to_asn": 6,
"dns/mismatch/new_data_none_addrs": 47,
"dns/mismatch/other": 431,
"dns/total": 23486,
"http_body_length/diff": 9333,
"http_body_length/diff/new_th_larger": 6368,
"http_body_length/diff/new_th_smaller": 2965,
"http_body_length/total": 23480,
"http_failure/match": 22363,
"http_failure/mismatch": 1117,
"http_failure/mismatch/new_is_none": 45,
"http_failure/mismatch/old_is_none": 292,
"http_failure/mismatch/other": 780,
"http_failure/total": 23480,
"http_headers/match/same_set": 19516,
"http_headers/mismatch/new_is_none": 1037,
"http_headers/mismatch/set_diff": 2927,
"http_headers/total": 23480,
"http_status_code/match": 21916,
"http_status_code/mismatch": 1564,
"http_status_code/total": 23480,
"http_title/match/both_empty": 2502,
"http_title/match/equal": 11180,
"http_title/mismatch/new_empty": 749,
"http_title/mismatch/old_empty": 935,
"http_title/mismatch/old_th_smaller": 7683,
"http_title/mismatch/other": 431,
"http_title/mismatch/total": 8114,
"http_title/total": 23480,
"processing/match/has_both_measurements": 23486,
"processing/mismatch/missing_newth_measurement": 7,
"processing/mismatch/missing_oldth_measurement": 4478,
"processing/succeeded": 27971,
"processing/total": 27971,
"tcp_connect/match": 23251,
"tcp_connect/mismatch": 235,
"tcp_connect/mismatch/added_ivp4": 60,
"tcp_connect/mismatch/different_ivp4": 185,
"tcp_connect/mismatch/removed_ivp4": 73,
"tcp_connect/total": 23486
} After this initial scanning, I started trying to figure out the reasons why there was such a difference. Generally speaking, what I did was to look into a subset of measurements to identify bugs. The first issue I figured out is that is that the After which I started looking into the subset of URLs that presents For this new comparison, I deployed the legacy backend on my local machine inside a Docker container running an While investigating, there was some serendipity. I assumed the legacy backend was using Subsequent changes to the new test helper landed at: ooni/probe-cli#504. This brings us to today. Here's the next-two-weeks agenda:
The latter will be done by censoring the network on the computer I am using via Jafar. |
Written status update. We can now move this issue to Sprint 48. |
Further investigating DNS differences (5/5)I solved more DNS related issues and branched off an issue I could not solve easily: #1823. This is the current DNS situation for most ~80% URLs in the test list:
The So, most of them were transient, we're not looking at this:
There are 17 entries were the old TH cannot handle IPs in the URL, which I think it's a behaviour that the new TH does not actually want to mirror. Then there are 8 failures that maybe we need to further investigate. Overall, though, the DNS part seems to be converging (I am continuing to scan the TLs but now I am near to cases where probably the domains are censored where I live because every measurement is taking such a long time). |
HTTP headersI also started looking into HTTP headers. I found more cases in which the set is disjoint, so I added more headers as exception, but then I also measured in which cases both sets ended up being empty. This is not looking good:
There are two issues here: (1) we have many HTTP failures for which we probably need to retry and (2) the more I add exception headers the more I increase the Okay, so this is what we have:
This is, in itself, a data quality issue. The two different implementations return different headers that matter according to Web Connectivity's heuristics in many cases. This confuses new probes a bit (unclear the extent). If we switch to the new TH, we now confuse new probes less and old probes more (which is less of a concern). So, okay, I think there is no need to dig this headers topic further. It's clear we cannot do much about this. The two implementations we are using are too different and too spaced in time for further action to be possible. |
HTTP status code (1/2)Regarding the status code, after correcting to avoid noise caused by failures, here's what I get:
It may be worth taking a look into the mismatch cases and see whether there's anything fundamental here. So, the first thing I noticed is that with the old TH there are URLs that return 403 and this becomes 200 with the new TH. An example is http://www.geocities.ws/marchrecce/, for which we have:
I created rules for both the 403 that becomes 200 and for the sub-case of Cloudflare and now I have:
So, clearly the Cloudflare check was not very selective and the 403 => 200 was much more selective. Okay, this is a data quality issue we're fixing by using the new TH ✔️ Now, what's next? we need to look into the remaining cases for which we have mismatch. After more refinements, here's what I have:
Some insights based on the above results:
One of the remaining cases (http://www.irna.ir/en/) // old TH
{
"tcp_connect": {
"178.216.249.78:80": {
"status": true,
"failure": null
},
"217.25.48.64:80": {
"status": true,
"failure": null
}
},
"http_request": {
"body_length": 35805,
"failure": null,
"title": "IRNA English",
"headers": {
"Cache-Control": "max-age=60",
"Content-Type": "text/html;charset=UTF-8",
"Date": "Fri, 15 Oct 2021 03:11:46 GMT",
"Expires": "Fri, 15 Oct 2021 03:12:06 GMT",
"Server": "nginx",
"Vary": "Accept-Encoding",
"X-Cache-Status": "HIT"
},
"status_code": 200
},
"dns": {
"failure": null,
"addrs": [
"217.25.48.64",
"178.216.249.78"
]
}
}
// New TH
{
"tcp_connect": {
"178.216.249.78:80": {
"status": true,
"failure": null
},
"217.25.48.64:80": {
"status": true,
"failure": null
}
},
"http_request": {
"body_length": 1918,
"failure": null,
"title": "\u0635\u0641\u062d\u0647\u0654 \u062f\u0631\u062e\u0648\u0627\u0633\u062a\u06cc \u0634\u0645\u0627 \u06cc\u0627\u0641\u062a
\u0646\u0634\u062f.",
"headers": {
"Content-Language": "en",
"Content-Length": "1918",
"Content-Type": "text/html;charset=UTF-8",
"Date": "Fri, 15 Oct 2021 03:11:34 GMT",
"Server": "nginx"
},
"status_code": 404
},
"dns": {
"failure": null,
"addrs": [
"217.25.48.64",
"178.216.249.78"
]
}
} I'm going to try and see whether there are more cases like this. (I didn't really see this coming!) Okay, after some refactoring, this is the final analysis for status code. I'll add some comments inline:
All these cases are interesting and are, additionally, a source of confusion anyway for new probes. That is, looking into what is happening here was out of curiosity and for the sake of data quality mostly. |
HTTP status code (2/2)
Well, actually, no, it isn't fascinating. It's just a data quality issue with the way in which webconnectivity works. I've explained why this is a systematic issue of webconnectivity in #1824. |
Review of the results and conclusionsThe objective of this comparison was to figure out whether replacing the old webconnectivity test helper (TH) with the new one would not reduce the quality of the data produced by new OONI Probes (i.e., OONI Probes using the Go engine). The rest of this report is organised as follows. We analyse the Go client to determine what matters. We explain how run the comparison. We document actions to reduce the diff in the JSONs produced by the two THS in varying conditions. We document some oddities we observed. We show the results and comment each aspect. We draw the conclusions. Analysis of the Go clientIn this section, we read through the source code of the Go client. The objective here is to figure out what specific fields emitted by the TH matter the most to determine the blocking status of a measurement. Because we've already discontinued the Python implementation of OONI Probe, we only focus on the Go codebase. As of v3.10.1, the webconnectivity implementation:
Because of the above facts, it seems it's really important for us to make sure we don't have DNS discrepancies between what the old TH and the new TH do DNS-wise. Other aspects are comparatively less important to equalize. MethodologyI've run repeatedly the old and the new TH with a nearly full version of the test list. To this end, I used the code at https://github.com/ooni/wcthcheck. Many commits in such a repository mentioned this issue. Also, there are comments in this issue explaining how I run the whole experiment. In a nutshell: I run the two THs on a Linux box, side by side and accessed them using I've also tried to use Jafar to provoke specific errors and see how the two test helpers reacted to them: #1707 (comment). Actions to reduce the diff between the new and the old THThe experiments I've run led to the following PRs that attempted to reduce the diff between the old and the new TH
OdditiesPerforming this analysis, I came across these interesting oddities that may have a data quality impact: ResultsHere we classify by "phase" of the experiment: DNS, TCP/IP, and HTTP.
The total number of URLs was 27971. We're missing 16 URLs with both, 1 is only missing with the new TH, and 81 are only missing with the old TH. Because the result encompass DNS, TCP/IP, and HTTP, re-running a measurement could in some cases cause failures in one aspect (e.g., DNS) while fixing other aspects in others (e.g., HTTP): (as we already know) the test lists contain many not-so-reliable websites. DNSSo, regarding the DNS we are in a very good position and we only have really few differences. The full classification of the 49 DNS mismatches between new and old TH is the following:
It's also worth noting that we have 17 cases in which the old TH could not handle the URL containing an IP address. They are not flagged as failures, rather this is a known bug, so it goes into a specific, distinct bucket. HTTP body lengthThe body length is complicated because we need to see the body but the THs generally do not return it. We have seen that the old TH sometimes encodes HTML tags, which may be a data quality issue because the new probe doesn't do it. See #1707 (comment) for more details on how I learned about this encoding issue. HTTP failure5574 of the 6608 failure mismatches occur because the new TH failure is (Ah, and I also showed in #1707 (comment) that the old TH does not validate TLS certificates, which is a potential data quality issue and source of differences: the Go probe will fail in this case.) HTTP headersI basically gave up comparing headers. We need to exclude the common set of headers that the probes ignore. The remaining headers are quite dependent on the two implementations being very different (the old TH uses Twisted 13.2.0, which is software released in November 2013). I tried to add more headers to the ignore list, which often led me to an empty intersection between headers. I took this fact as a signal that investigating headers differences was a waste of my time. That said, the Go probe uses the same codebase of the new TH, so using the new TH seems better here. HTTP status codeThe status code matched in 19699 cases out of 26465. This leaves us with 6766 mismatches. Of these, 6409 are caused by the old TH failing for HTTP (5474 cases) or the new TH failing (935 cases). The remaining cases are cases in which the status code changes by changing TH. I also showed a case, http://www.irna.ir/en/, in which the result is purely dependent on the IP address chosen by the specific TH to fetch the webpage: not following all endpoints has a data quality cost (#1825). HTTP titleWhen discussing the title, it's important to remember that the Go probe and the new TH use a similar codebase, therefore they should have more consistent results in this respect. I learned that the old TH strips non-latin characters from the title itself, which leads to it producing a smaller (6451 cases) or empty title (6309 cases). Summing these two cases we obtain 12760, which is most of the 13862 mismatching cases. There are also few cases where the new TH title is smaller or empty, which likely depend on the regexp used to extract the title (which is limited to a maximum number of characters in the Go codebase) or on the page size. Either way, doing something similar to what the probe does will be better. TCP connectI didn't spend much time looking into this topic. We don't directly string match strings from the TH in the probe and I had already gained confidence that we can handle the common cases using Jafar. ConclusionsI think it's fine to switch to the new TH. We will have to fix some small bugs, chiefly that our DNS timeout seems to small in one specific case and we cannot properly get a SERVFAIL (#1823). We have spent lots of time trying to reduce the difference in the JSON structure and that DNS looks similar. We should fix new Go clients to always perform the body length comparison (now both bodies should be truncated) as documented in #1827. We should fix the new Go clients to properly handle the status code value as documented in #1825. We should have issues for each possible data quality issue anyway and, on top of this, maybe we need a periodic process for monitoring the TH quality (#1829). |
Part of ooni/probe#1707 Backport from master; see ooni/probe#1839
Allows us to get http://www.isa.gov.il/Pages/default.aspx's one. Discovered when working on ooni/probe#1707. Backport from master; see ooni/probe#1839
Reducing the errors is not done in a perfect way. We have documented the most striking differences inside ooni/probe#1707 (comment) and some attempts to improve the situation further inside ooni/probe#1707 (comment). A better strategy for the future would be to introduce more specific timeout errors, such as dns_timeout_error, etc. More testing may be needed to further validate and compare the old and the new TH, but this requires Jafar improvements to more precisely simulate more complex censorship. Backport from master; see ooni/probe#1839
Matches the behavior that the legacy TH implements in this situation and reduces slightly the differences. See ooni/probe#1707 (comment) Backport from master; see ooni/probe#1839
See ooni/probe#1707 (comment) Backport from master; see ooni/probe#1839
We concluded with @hellais that it's okay to replace the old TH with the new TH. The most immediate next step was to create a release of oohelperd that was suitable for doing that, which I did in #1841. It remains to figure out what are the remaining issues (including addressing issues discovered as part of this one). But, we can also safely close this issue. |
When a probe gets a local DNS failure, it will continue and nonetheless query the test helper without any IP address, just an empty list. This diff fixes the behavior of cmd/oohelper to do the same. Work part of ooni/probe#1707.
Allows us to get http://www.isa.gov.il/Pages/default.aspx's one. Discovered when working on ooni/probe#1707.
Reducing the errors is not done in a perfect way. We have documented the most striking differences inside ooni/probe#1707 (comment) and some attempts to improve the situation further inside ooni/probe#1707 (comment). A better strategy for the future would be to introduce more specific timeout errors, such as dns_timeout_error, etc. More testing may be needed to further validate and compare the old and the new TH, but this requires Jafar improvements to more precisely simulate more complex censorship.
Matches the behavior that the legacy TH implements in this situation and reduces slightly the differences. See ooni/probe#1707 (comment)
This issue is about fetching all the URLs in all the test lists with both helpers and comparing the results. We want to understand in which cases the new test helper differs from the old test helper and why.
You may want to jump straight to the conclusions: #1707 (comment).
The text was updated successfully, but these errors were encountered: