openrightsgroup / cmp-issues Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPIC: make ooniprobe work with blocked #78
Comments
|
Comparing results between different probes is openrightsgroup/Blocking-Middleware#32 |
|
Hi, I wanted to pick this discussion back up to inform you of some of the progress we have made that may be relevant to you as well as list some possible ways of providing solutions to those problems you have listed. Starting from the first issue:
We currently have a new tool that is shipped as part of ooniprobe called oonideckgen. What this does is it generates a test deck to be used by the probe depending on the country. If you have a repository or an API we can query to get the list of URLs to test for blocked.co.uk, we could add them to oonideckgen so that when uk ooniprobe users run it they will also test the your list of URLs. The long term goal is to have support in oonibackend for orchestrating the probes (think remote scheduling of measurements) and the relevant ticket for that is this: https://trac.torproject.org/projects/tor/ticket/12551.
We recently developed a data pipeline for processing the collected ooniprobe reports. What this does is it processes all of the reports that are submitted to ooniprobe and puts them inside of a mongodb database. If that were available what we could do is expose an HTTP API from the ooni pipeline database that gives you all the results you need in a dataformat that is simple for you to process.
Is there a spec for how the PyProbe dataformat looks like? You can find the data format for the ooniprobe tests that probably interest you most here:
It would be awesome if you were to run on the machines that you currently use for testing also an instance of ooniprobe. That way you can also cross check that the measurements you are doing are correct (by doing them twice, once with ooniprobe, once with PyProbe). I also available to chat a bit more about some possible collaborations between our two projects either via chat or voice. |
|
Hi Arturo Thanks so much for getting in touch about recent developments in the OONI project. We're still keen to find ways of collaborating with you folks I think @dantheta and @anadahz would be best placed to comment on running OONIProbe in parallel with PyProbe. Capacity-wise we should be OK on our landline test-networks however there are throughput caps on our mobile networks so we may need to be careful here. An implementation of your ticket 12551 would be very useful! I'm still not clear about where the best point of integration between our systems would be though. Should ooniprobe results be reported to the Blocked database directly or via oonib? Should our middleware implement 12551 and push URLs for testing to ooniprobes directly? (Our middleware has a nice queueing system for orchestrating probes - but I don't know what it would take to adapt this for use with ooniprobe). Perhaps this might be a good subject for a call or IRC chat? At the moment we are only recording a few different statuses as the results of our tests. In order to use OONI results in Blocked I think we just need an OK / Blocked answer from ooniprobe. I guess this could be the "If the response bodies appear to match" field from the ooniprobe results? The Blocked DB spec is available here: https://wiki.openrightsgroup.org/wiki/Censorship_Monitoring_Project_DB#Results_Table |
|
Rather than sharing data between the two projects, perhaps the thing to do would be to link to each other's results, so that each can be presented in the context of its respective project. This way we wouldn't have to modify anything about how results are reported on either end - except to recognise when the other project has results for the URL in question and insert a link. |
|
Hi Arturo, Sorry it's taken me so long to respond to this thread! I've started to look at more integrated ways of getting candidate URLs from the blocked test scheduler (which are AMQP queues) into ooniprobe. I've had a pretty detailed look through the ooniprobe code, and I've found the NetTestCase.inputProcessor method, which looks like a pretty good place to yield test urls from a network service. The only trouble is, it would require quite a bit of refactoring to make it work asynchronously. Do you have any thoughts on that approach, or any other approaches that might be easier? |
|
Another approach that I thought of looking at, was to enter a queue wait at the end of oonicli.runWithDirector.post_director_start. Each incoming url task would call director.startNetTest to kick off a nettest task, and each URL gets its own report back to the ooni-backend. Does that sound more suitable? |
|
Hi @dantheta thanks for thinking a bit about this! I think that by far the easiest short term fix is to simply write fetching URLs from ORG as a new processor in oonideckgen. But I agree that this is sort of hackish as it doesn't very well apply to the dynamic and real time nature of the ORG inputs. The alternative would be to implement methods for downloading the URL list in the Note that you also need to edit https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettest.py#L178 to support setting the 'url', 'address' and 'hash' key for it. I think the idea of having inputs be content addressable is not so solid any more and we should just be relying on the transport being secure instead of doing this verification, so it's ok to remove that logic for checking hashes and just make all URLs equal. I am not sure if in your second suggestion you would like to have it so you don't have to manually trigger starting ooniprobe. To implement that I think the direction to go is to pick up I am more than happy to continue this discussion here or on IRC (irc.oftc.net #ooni) |
|
Oonid does sound like the kind of thing I was thinking of. There's a minor semantic difference, in that Oonid expects tests to be pushed to it, while the original blocked probes waited to pull tests from a queue, but that's not a huge deal. I remember seeing an API doc for pushing tests to ooniprobe. Is that still current? I've had a look on the ooni Trac page, but can't seem to turn up too much information on Oonid. Any pointers welcome! |
|
@dantheta development of oonid has stopped some time ago and the current version of it is probably broken. I would suggest you take the route of working on oonid only if you have plans to spend a considerable amount of effort implementing this. Otherwise I think the URL fetching is the quickest way of implementing what you need. |
|
Yes, I think you're right about the URL fetching. I've followed your suggestion and added a small amount of code to allow retrieval of URL lists from arbitrary URLs inside the deck. Seems to be working nicely! If we keep ooniprobe running in a loop, fetching the accumulated URLs in the queue on each iteration, we should be able to keep the system quite responsive. I'll push the work-in-progress code to a github fork in the next few days. Thanks! |
|
Hi @hellais , I've been playing about with a few different ideas, and wanted to check to see if you had a preference in some of the options we'd been trying out:
Option (1) is probably the lowest amount of code, but means that we have to run ooni-deckgen and ooni-probe in a loop on the target machine, effectively polling the middleware API for new URLs to test. As you point out, we'd have to deactivate some of ooni-deckgen's caching and validation code to make it re-fetch named input lists on each run. Running a polling loop also means that there is a delay between the submission of a URL on the frontend and having a test started on the probe machines. We could look at running multiple ooni-probe processes in parallel (on a staggered loop schedule) in order to improve responsiveness, but any hardcoded paths that ooni-deckgen saves to may give us problems when multiple processes are trying to run concurrently. Option (2) is a bit more dynamic than option 1. Since it doesn't save assets to the filesystem, there's no problem running multiple instances of the probe to improve responsiveness. This approach does make it harder to use a target URL list with multiple tests, though (since the URL list would have to be fetched once inside each test. Option (3) is the most elaborate change; we'd add a queue listener to ooni-probe, so that it stays running as a daemon process. When a URL arrives on the queue, it fires off a callback that sets up a deck with the blocking/http_request and the dns consistency test. While those are running, it listens on the queue again for the next input URL. This option gives the best responsiveness, but the main question I have is, would a modification like this would be useful in the wider OONI project? While it's not completely specific to the blocked project, it could help provide a way to orchestrate tests by pushing target URLs to remote probes (similar to the goals of oonid). Are any of these approaches preferable to the ooni project, do you think? I hope that makes sense! Cheers, Daniel. |
|
Hi @hellais Just an update on this - I've been looking more at probe performance on raspberry pi hardware, and the startup time for ooniprobe on that platform is quite prohibitive (a Pi takes 14 seconds just to load the twisted network library). It's going to be quite difficult to get the performance we're looking for if we're repeatedly executing ooniprobe processes. However, I have successfully integrated the blocked.org.uk queue driver into ooniprobe (which was option (3) in the list from the previous post), which has the advantage of using a long-running ooniprobe process and running multiple rounds of tests inside it. We can also parallelise by having multiple ooniprobes listening on the same queue, which helps a lot with our responsiveness objectives. We only added a couple of other things: HTTPS support for collectors and adding a few named environment variables to the report data that is sent back. These allow us to authenticate probes against the database, and place the results against the corresponding ISPs. I'll post a URL with the ooniprobe branch that we're working on in the near future. |
|
@dantheta that is awesome! Thanks for doing this. Regarding Pi performance, which model have you run this on? The Pi B+ is indeed very slow, but the Pi 2 B is supposed to be much better though I have not tried it out yet. Once we implement the daemonization of OONI you will be able to run measurements in a much more performant fashion probably. HTTPS collector support is something we have had on our wishlist for some time and it's great to hear that you have implemented it. |
|
I've tested it on the model B and the Pi2 - The model B does a single URL blocking/http_request test in 1m 30s, and the Pi 2 does the same test in 30s (I have both!). I've done a bit of profiling on the process, and a lot of the time seems largely unavoidable - most time is taken in Python, allocating memory (slowly!). HTTPS collector support was surprisingly easy - the Agent has SSL support built in, so it's mostly just changing the regexes to allow HTTPS urls. I do need to test it a bit more - it rejects self-signed certificates and it is incredibly hard to make Twisted SSL just ignore the certificate verification. The changes that have gone into runWithDirector go some way to making ooniprobe into a daemon - instead of runWithDirector being a one-shot function, it now toggles between waiting for a URL and running a test, all using twisted Defers. I've got a few more bits of tidying up, and then I'll post the Github URL. |
|
@dantheta ah yes I know the pain of getting Twisted to do custom SSL validation. I just pushed code for a little project I had sitting on my hard drive for some time that may be of use to you: https://github.com/hellais/sslpin I originally started writing it when we wanted to add SSL support to OONI so that we could have the nice self verifying property that tor hidden services and not have to rely on CAs. If you read the Readme you get an idea of how it works. Note: that the code was written for an older version of twisted and it does not work with the latest version because of the changes they made to the Agent, but you should still get an idea of how to implement custom SSL validation by looking at: If I have some spare time in the next days I will fix it to work also with the latest version of twisted, but to give you some pointers, what needs to be done is conditionally call |
|
The branch I'm working on is here: I don't think it's quite ready for merge review yet - there's a lot of duplication (mainly to avoid breaking things) which can probably be taken care of with judicious refactoring. |
|
@dantheta awesome, thanks. I will take a look at it and let you know what needs to be done to get it ready for merge into OONI. This is very useful also to have a better understanding on how the ORG architecture works and how organisations with similar requirements can use ooniprobe for their purpose. |
|
We've got a basic system diagram here: https://github.com/openrightsgroup/cmp-spec/blob/master/Queuing/queues.png The system has been enhanced a bit since the diagrams were made, but it covers the basics. Let me know if there's anything you'd like me to expand on. |
|
I've made a couple of further refinements to improve performance: The GeoIP and Tor connection routines have been amended so that they keep their settings for second and subsequent tests. This isn't something that would affect ooniprobe normally, since the program would exit after it had finished its test run, but since we're adapting it to run multiple tests we wanted to save some time by not repeating bits of the setup routine. I've also added a commandline flag that disables the serialization of the target URL page body. On large webpages, serializing the page body into a YAML document is a major chunk of the time taken - almost 20 seconds for a 0.5MB webpage on a Pi v2. The current blocked.org.uk system runs two streams of test URLs - there are the live submissions, where website users enter their test URLs, and then there's also a background reprocessing job, which submits previously tested URLs for periodic re-testing. What we're thinking of doing is running ooniprobe with a very stripped down configuration for the live submission URLs, in order to get the very fastest answer for the website users. This configuration would just run the blocking/http_requests test, with no bouncer or helpers (since that test doesn't require any) and no page response body. This runs very quickly indeed! The background reprocessing job would have more options enabled - we could run the DNS consistency test there, and submit the full page bodies in the reports, and could be submitted up to the OONI report repository. This part of the system is much less time-sensitive, and one or two minutes per test is entirely fine. The updated branch is here: https://github.com/dantheta/ooni-probe/tree/org_queue_simplified (it's called simplified because we'd also tried reading URLs off the queue in batches, in order to minimize the impact of the test deck setup/teardown time; once we'd made the time savings on GeoIP, Tor connection and bouncer querying, the system was sufficiently responsive even running one url per deck, so the batch approach was abandoned). |
|
Hi @dantheta, thank you so much for this excellent work I will today begin reviewing your PR. I rebased it against the current master and opened pull request #395 to discuss the changes and code. |
|
The queue driver has been accepted upstream (many thanks!). There's some work to do in how the results are uploaded back (the queue driver turns each received URL into a deck, which then creates a report for a single URL. This isn't optimal for ingestion through oonibackend). Having the middleware gather the results for submission into a single report would seem to be a viable option. I'll spin that out as a separate issue. |
There are several parts to this:
I think we need to plan this out in stages, raising new issues for each part, and probably in discussion with others from the ooni project.
The text was updated successfully, but these errors were encountered: