Skip to content
Joe Landers edited this page Feb 1, 2016 · 6 revisions

Here's a quick outline (by joe) of the ooni-pipeline process at the moment.

  • ooni-backend machines upload "raw reports" (yaml files) to the ooni-incoming S3 bucket. The following is run daily by cron:
    find /data/bouncer/archive -type f -print0 \
        | xargs -0 -I FILE \
                aws s3 mv FILE s3://ooni-incoming/yaml/
  • those reports get copied to the ooni-private S3 bucket (this 2nd step is for permissions trickery). Another daily cron job:
    date_bin=$(date -I)
    if [ -z "$date_bin" ]; then exit -1; fi
    
    # should be able to do this filtering with the aws --exclude and --include,
    # but I can't get that to work.
    
    aws s3 ls s3://ooni-incoming/yaml/ \
            | awk '{print $4}' \
            | grep '.yamloo$' \
            | xargs -I FILE \
                    aws s3 mv s3://ooni-incoming/yaml/FILE \
                              s3://ooni-private/reports-raw/yaml/$date_bin/
  • the invoke bins_to_sanitised_streams --unsanitised-dir "s3n://ooni-private/reports-raw" --sanitised-dir "s3n://ooni-public" --date-interval 2012-12-01-2016-01-01 --workers 32 command from this repo does some sanitisation and aggregates the reports by date (the folders ("bins") correspond to a pipeline date, not the report measurement date) into json streams in the ooni-public bucket.
  • the invoke streams_to_db --streams-dir "s3n://ooni-public/json" command reads the json streams and puts each report entry (there are many entries in a report) as a row into the postgres db.

We currently run the bins->streams on a c3.8xlarge (32 core) with 32 processes and 80GB EBS. (the S3 files get cached on-disk on their way in and out, so this can eat a lot of space). It takes about a day to run through the whole dataset.

The streams->db step, we run on a m4.xlarge with 1 process, and it also takes about a day to run. I haven't looked into what the speed bottleneck here is.

Clone this wiki locally