Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: make ooniprobe work with blocked #78

Closed
graphiclunarkid opened this issue Jul 16, 2014 · 21 comments
Closed

EPIC: make ooniprobe work with blocked #78

graphiclunarkid opened this issue Jul 16, 2014 · 21 comments

Comments

Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
@graphiclunarkid
Copy link
Member

@graphiclunarkid graphiclunarkid commented Jul 16, 2014

There are several parts to this:

  • Robust mechanism for sending URLs to ooniprobe for testing. (I know @dantheta started, and possibly completed, a wrapper script. It would be nice to do this in native ooni code though).
  • Accepting results from ooniprobe into the Blocking Middleware database.
  • Comparing ooniprobe results with PyProbe results
  • Sharing data between Blocked and ooni.

I think we need to plan this out in stages, raising new issues for each part, and probably in discussion with others from the ooni project.

@graphiclunarkid
Copy link
Member Author

@graphiclunarkid graphiclunarkid commented Jul 16, 2014

Comparing results between different probes is openrightsgroup/Blocking-Middleware#32

@hellais
Copy link

@hellais hellais commented Nov 6, 2014

Hi,

I wanted to pick this discussion back up to inform you of some of the progress we have made that may be relevant to you as well as list some possible ways of providing solutions to those problems you have listed.

Starting from the first issue:

  • Robust mechanism for sending URLs to ooniprobe for testing. (I know @dantheta started, and possibly completed, a wrapper script. It would be nice to do this in native ooni code though).

We currently have a new tool that is shipped as part of ooniprobe called oonideckgen. What this does is it generates a test deck to be used by the probe depending on the country. If you have a repository or an API we can query to get the list of URLs to test for blocked.co.uk, we could add them to oonideckgen so that when uk ooniprobe users run it they will also test the your list of URLs.
Currently we are using as a corpus for testing the lists provided by citizenlab.

The long term goal is to have support in oonibackend for orchestrating the probes (think remote scheduling of measurements) and the relevant ticket for that is this: https://trac.torproject.org/projects/tor/ticket/12551.

  • Accepting results from ooniprobe into the Blocking Middleware database.

We recently developed a data pipeline for processing the collected ooniprobe reports. What this does is it processes all of the reports that are submitted to ooniprobe and puts them inside of a mongodb database.
Is there a specification available for how the dataformat of you Blocking middleware database looks like?

If that were available what we could do is expose an HTTP API from the ooni pipeline database that gives you all the results you need in a dataformat that is simple for you to process.

  • Comparing ooniprobe results with PyProbe results

Is there a spec for how the PyProbe dataformat looks like? You can find the data format for the ooniprobe tests that probably interest you most here:
https://gitweb.torproject.org/ooni/spec.git/blob/HEAD:/test-specs/ts-002-dnsconsistency.md
https://gitweb.torproject.org/ooni/spec.git/blob/HEAD:/test-specs/ts-003-http-requests.md

  • Sharing data between Blocked and ooni.

It would be awesome if you were to run on the machines that you currently use for testing also an instance of ooniprobe. That way you can also cross check that the measurements you are doing are correct (by doing them twice, once with ooniprobe, once with PyProbe).

I also available to chat a bit more about some possible collaborations between our two projects either via chat or voice.

@graphiclunarkid
Copy link
Member Author

@graphiclunarkid graphiclunarkid commented Nov 19, 2014

Hi Arturo

Thanks so much for getting in touch about recent developments in the OONI project. We're still keen to find ways of collaborating with you folks 😄

I think @dantheta and @anadahz would be best placed to comment on running OONIProbe in parallel with PyProbe. Capacity-wise we should be OK on our landline test-networks however there are throughput caps on our mobile networks so we may need to be careful here.

An implementation of your ticket 12551 would be very useful! I'm still not clear about where the best point of integration between our systems would be though. Should ooniprobe results be reported to the Blocked database directly or via oonib? Should our middleware implement 12551 and push URLs for testing to ooniprobes directly? (Our middleware has a nice queueing system for orchestrating probes - but I don't know what it would take to adapt this for use with ooniprobe). Perhaps this might be a good subject for a call or IRC chat?

At the moment we are only recording a few different statuses as the results of our tests. In order to use OONI results in Blocked I think we just need an OK / Blocked answer from ooniprobe. I guess this could be the "If the response bodies appear to match" field from the ooniprobe results?

The Blocked DB spec is available here: https://wiki.openrightsgroup.org/wiki/Censorship_Monitoring_Project_DB#Results_Table

@graphiclunarkid
Copy link
Member Author

@graphiclunarkid graphiclunarkid commented Dec 29, 2014

Rather than sharing data between the two projects, perhaps the thing to do would be to link to each other's results, so that each can be presented in the context of its respective project. This way we wouldn't have to modify anything about how results are reported on either end - except to recognise when the other project has results for the URL in question and insert a link.

@dantheta
Copy link

@dantheta dantheta commented Feb 19, 2015

Hi Arturo,

Sorry it's taken me so long to respond to this thread!

I've started to look at more integrated ways of getting candidate URLs from the blocked test scheduler (which are AMQP queues) into ooniprobe.

I've had a pretty detailed look through the ooniprobe code, and I've found the NetTestCase.inputProcessor method, which looks like a pretty good place to yield test urls from a network service. The only trouble is, it would require quite a bit of refactoring to make it work asynchronously. Do you have any thoughts on that approach, or any other approaches that might be easier?
Ideally we'd like to keep a long-running ooniprobe process which can have URLs sent to it as individual test cases, and have the probe process return results as soon as they available, then keep waiting for more URLs. This might require quite a bit of work to change the internal test runner from iterating over a fixed list into something that reads asynchronous events from an incoming queue. Any suggestions or pointers welcome!

@dantheta
Copy link

@dantheta dantheta commented Feb 22, 2015

Another approach that I thought of looking at, was to enter a queue wait at the end of oonicli.runWithDirector.post_director_start. Each incoming url task would call director.startNetTest to kick off a nettest task, and each URL gets its own report back to the ooni-backend.

Does that sound more suitable?

@hellais
Copy link

@hellais hellais commented Feb 24, 2015

Hi @dantheta thanks for thinking a bit about this!

I think that by far the easiest short term fix is to simply write fetching URLs from ORG as a new processor in oonideckgen.

But I agree that this is sort of hackish as it doesn't very well apply to the dynamic and real time nature of the ORG inputs.

The alternative would be to implement methods for downloading the URL list in the Deck code. In particular look at the fetchAndVerifyNetTestInput method: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/deck.py#L276.

Note that you also need to edit https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettest.py#L178 to support setting the 'url', 'address' and 'hash' key for it.

I think the idea of having inputs be content addressable is not so solid any more and we should just be relying on the transport being secure instead of doing this verification, so it's ok to remove that logic for checking hashes and just make all URLs equal.

I am not sure if in your second suggestion you would like to have it so you don't have to manually trigger starting ooniprobe. To implement that I think the direction to go is to pick up oonid and finish implementing that.

I am more than happy to continue this discussion here or on IRC (irc.oftc.net #ooni)

@dantheta
Copy link

@dantheta dantheta commented Feb 27, 2015

Oonid does sound like the kind of thing I was thinking of. There's a minor semantic difference, in that Oonid expects tests to be pushed to it, while the original blocked probes waited to pull tests from a queue, but that's not a huge deal.

I remember seeing an API doc for pushing tests to ooniprobe. Is that still current? I've had a look on the ooni Trac page, but can't seem to turn up too much information on Oonid.

Any pointers welcome!

@hellais
Copy link

@hellais hellais commented Feb 27, 2015

@dantheta development of oonid has stopped some time ago and the current version of it is probably broken.
The code is still there so that somebody interested can pick it up again, but it is not officially supported, hence we don't have any documentation on it.
The installation of it is also disabled by default.

I would suggest you take the route of working on oonid only if you have plans to spend a considerable amount of effort implementing this. Otherwise I think the URL fetching is the quickest way of implementing what you need.

@dantheta
Copy link

@dantheta dantheta commented Feb 28, 2015

Yes, I think you're right about the URL fetching. I've followed your suggestion and added a small amount of code to allow retrieval of URL lists from arbitrary URLs inside the deck. Seems to be working nicely! If we keep ooniprobe running in a loop, fetching the accumulated URLs in the queue on each iteration, we should be able to keep the system quite responsive.

I'll push the work-in-progress code to a github fork in the next few days.

Thanks!

@dantheta
Copy link

@dantheta dantheta commented Mar 4, 2015

Hi @hellais ,

I've been playing about with a few different ideas, and wanted to check to see if you had a preference in some of the options we'd been trying out:

  1. Setting up a URL endpoint and input processor for ooni-deckgen; ooni-deckgen fetches the URLs to a file, and ooni-probe runs against it as a standard deck.

  2. Adding code to ooni-probe so that a deck can point to an API endpoint, and have ooni-probe retrieve the test URL list from there

  3. Add a queue client into ooni-probe, so that a long running ooni-probe process can wait for URLs to arrive, and kick off tests as soon as they come in

Option (1) is probably the lowest amount of code, but means that we have to run ooni-deckgen and ooni-probe in a loop on the target machine, effectively polling the middleware API for new URLs to test. As you point out, we'd have to deactivate some of ooni-deckgen's caching and validation code to make it re-fetch named input lists on each run. Running a polling loop also means that there is a delay between the submission of a URL on the frontend and having a test started on the probe machines. We could look at running multiple ooni-probe processes in parallel (on a staggered loop schedule) in order to improve responsiveness, but any hardcoded paths that ooni-deckgen saves to may give us problems when multiple processes are trying to run concurrently.

Option (2) is a bit more dynamic than option 1. Since it doesn't save assets to the filesystem, there's no problem running multiple instances of the probe to improve responsiveness. This approach does make it harder to use a target URL list with multiple tests, though (since the URL list would have to be fetched once inside each test.

Option (3) is the most elaborate change; we'd add a queue listener to ooni-probe, so that it stays running as a daemon process. When a URL arrives on the queue, it fires off a callback that sets up a deck with the blocking/http_request and the dns consistency test. While those are running, it listens on the queue again for the next input URL. This option gives the best responsiveness, but the main question I have is, would a modification like this would be useful in the wider OONI project? While it's not completely specific to the blocked project, it could help provide a way to orchestrate tests by pushing target URLs to remote probes (similar to the goals of oonid).

Are any of these approaches preferable to the ooni project, do you think?

I hope that makes sense!

Cheers,

Daniel.

@dantheta
Copy link

@dantheta dantheta commented Mar 17, 2015

Hi @hellais

Just an update on this - I've been looking more at probe performance on raspberry pi hardware, and the startup time for ooniprobe on that platform is quite prohibitive (a Pi takes 14 seconds just to load the twisted network library). It's going to be quite difficult to get the performance we're looking for if we're repeatedly executing ooniprobe processes.

However, I have successfully integrated the blocked.org.uk queue driver into ooniprobe (which was option (3) in the list from the previous post), which has the advantage of using a long-running ooniprobe process and running multiple rounds of tests inside it. We can also parallelise by having multiple ooniprobes listening on the same queue, which helps a lot with our responsiveness objectives.

We only added a couple of other things: HTTPS support for collectors and adding a few named environment variables to the report data that is sent back. These allow us to authenticate probes against the database, and place the results against the corresponding ISPs.

I'll post a URL with the ooniprobe branch that we're working on in the near future.

@hellais
Copy link

@hellais hellais commented Mar 17, 2015

@dantheta that is awesome!

Thanks for doing this.

Regarding Pi performance, which model have you run this on? The Pi B+ is indeed very slow, but the Pi 2 B is supposed to be much better though I have not tried it out yet.

Once we implement the daemonization of OONI you will be able to run measurements in a much more performant fashion probably.

HTTPS collector support is something we have had on our wishlist for some time and it's great to hear that you have implemented it.
Let me know when I can review the code and work on integrating it into upstream.

@dantheta
Copy link

@dantheta dantheta commented Mar 17, 2015

I've tested it on the model B and the Pi2 - The model B does a single URL blocking/http_request test in 1m 30s, and the Pi 2 does the same test in 30s (I have both!). I've done a bit of profiling on the process, and a lot of the time seems largely unavoidable - most time is taken in Python, allocating memory (slowly!).

HTTPS collector support was surprisingly easy - the Agent has SSL support built in, so it's mostly just changing the regexes to allow HTTPS urls. I do need to test it a bit more - it rejects self-signed certificates and it is incredibly hard to make Twisted SSL just ignore the certificate verification.

The changes that have gone into runWithDirector go some way to making ooniprobe into a daemon - instead of runWithDirector being a one-shot function, it now toggles between waiting for a URL and running a test, all using twisted Defers.

I've got a few more bits of tidying up, and then I'll post the Github URL.

@hellais
Copy link

@hellais hellais commented Mar 17, 2015

@dantheta ah yes I know the pain of getting Twisted to do custom SSL validation.

I just pushed code for a little project I had sitting on my hard drive for some time that may be of use to you: https://github.com/hellais/sslpin

I originally started writing it when we wanted to add SSL support to OONI so that we could have the nice self verifying property that tor hidden services and not have to rely on CAs.

If you read the Readme you get an idea of how it works.

Note: that the code was written for an older version of twisted and it does not work with the latest version because of the changes they made to the Agent, but you should still get an idea of how to implement custom SSL validation by looking at:
https://github.com/hellais/sslpin/blob/master/sslpin/contexts.py
https://github.com/hellais/sslpin/blob/master/sslpin/agent.py#L6

If I have some spare time in the next days I will fix it to work also with the latest version of twisted, but to give you some pointers, what needs to be done is conditionally call _wrapContextFactory ( < 14.0) or _policyForHTTPS.creatorForNetloc (<= 14.0 < 15.0) or usingEndpointFactory (>= 15.0) depending on which version of twisted you are targeting.

@dantheta
Copy link

@dantheta dantheta commented Mar 18, 2015

The branch I'm working on is here:
https://github.com/dantheta/ooni-probe/tree/org_queue

I don't think it's quite ready for merge review yet - there's a lot of duplication (mainly to avoid breaking things) which can probably be taken care of with judicious refactoring.

@hellais
Copy link

@hellais hellais commented Mar 19, 2015

@dantheta awesome, thanks. I will take a look at it and let you know what needs to be done to get it ready for merge into OONI.

This is very useful also to have a better understanding on how the ORG architecture works and how organisations with similar requirements can use ooniprobe for their purpose.

@dantheta
Copy link

@dantheta dantheta commented Mar 19, 2015

We've got a basic system diagram here: https://github.com/openrightsgroup/cmp-spec/blob/master/Queuing/queues.png

The system has been enhanced a bit since the diagrams were made, but it covers the basics. Let me know if there's anything you'd like me to expand on.

@dantheta
Copy link

@dantheta dantheta commented Apr 14, 2015

I've made a couple of further refinements to improve performance:

The GeoIP and Tor connection routines have been amended so that they keep their settings for second and subsequent tests. This isn't something that would affect ooniprobe normally, since the program would exit after it had finished its test run, but since we're adapting it to run multiple tests we wanted to save some time by not repeating bits of the setup routine.

I've also added a commandline flag that disables the serialization of the target URL page body. On large webpages, serializing the page body into a YAML document is a major chunk of the time taken - almost 20 seconds for a 0.5MB webpage on a Pi v2.

The current blocked.org.uk system runs two streams of test URLs - there are the live submissions, where website users enter their test URLs, and then there's also a background reprocessing job, which submits previously tested URLs for periodic re-testing.

What we're thinking of doing is running ooniprobe with a very stripped down configuration for the live submission URLs, in order to get the very fastest answer for the website users. This configuration would just run the blocking/http_requests test, with no bouncer or helpers (since that test doesn't require any) and no page response body. This runs very quickly indeed!

The background reprocessing job would have more options enabled - we could run the DNS consistency test there, and submit the full page bodies in the reports, and could be submitted up to the OONI report repository. This part of the system is much less time-sensitive, and one or two minutes per test is entirely fine.

The updated branch is here: https://github.com/dantheta/ooni-probe/tree/org_queue_simplified

(it's called simplified because we'd also tried reading URLs off the queue in batches, in order to minimize the impact of the test deck setup/teardown time; once we'd made the time savings on GeoIP, Tor connection and bouncer querying, the system was sufficiently responsive even running one url per deck, so the batch approach was abandoned).

@hellais
Copy link

@hellais hellais commented May 26, 2015

Hi @dantheta, thank you so much for this excellent work 👍

I will today begin reviewing your PR.

I rebased it against the current master and opened pull request #395 to discuss the changes and code.

@dantheta
Copy link

@dantheta dantheta commented Jun 11, 2016

The queue driver has been accepted upstream (many thanks!). There's some work to do in how the results are uploaded back (the queue driver turns each received URL into a deck, which then creates a report for a single URL. This isn't optimal for ingestion through oonibackend). Having the middleware gather the results for submission into a single report would seem to be a viable option. I'll spin that out as a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment