Following the incident in #396 we now have 867286 measurements stored inside of the ooniprobe-failed-reports-eu-central-1-1d24426a bucket.
During the call yesterday we were discussing several options for doing it and we seemed to be leaning towards making use of the existing fastpath as much as possible.
Doing some investigation into the fastpath code, this seems to be possible and it would not affect the size of buckets, contrary to what we thought during the call. The bucket timestamp is derived from the measurement_uid, which means that an older measurement_uid would lead to it having an older bucket date.
One thing that is critical, though, is that we MUST not run the ooniapi-uploader with the same configuration as it's running in the fastpath, otherwise this will cause the postcans and jsonl to be overriden due to a path conflict (see: https://github.com/ooni/devops/blob/main/ansible/roles/fastpath/templates/ooni_api_uploader.py#L190). What needs to happen instead is that the collector_id should be set to something unique (eg. s3) and the ooni_api_uploader should be run on a different host.
In summary, what we should do to reprocess measurements is as follows:
- Setup a new fastpath host with a different unique
collector_id
- Write a script that
- takes measurements from s3 and performs a POST to the localhttpfeeder as if they were sent directly from the OONI API setting the correct measurement_uid
- upon successful read delete the original file from s3
- Once all of these have been posted, wait for the ooniapi-uploader to run populating the buckets in s3
We can probably optimize step 2 a bit, by updating the reprocessor code and batching writes instead of doing them sequentially, but we can probably just take more time to do this and apply some kind of throttling on the requests.
If we throttle the requests to 50 per second, we should be done reprocessing data in ~2.5h.
Following the incident in #396 we now have 867286 measurements stored inside of the
ooniprobe-failed-reports-eu-central-1-1d24426abucket.During the call yesterday we were discussing several options for doing it and we seemed to be leaning towards making use of the existing fastpath as much as possible.
Doing some investigation into the fastpath code, this seems to be possible and it would not affect the size of buckets, contrary to what we thought during the call. The bucket timestamp is derived from the measurement_uid, which means that an older measurement_uid would lead to it having an older bucket date.
One thing that is critical, though, is that we MUST not run the
ooniapi-uploaderwith the same configuration as it's running in the fastpath, otherwise this will cause the postcans and jsonl to be overriden due to a path conflict (see: https://github.com/ooni/devops/blob/main/ansible/roles/fastpath/templates/ooni_api_uploader.py#L190). What needs to happen instead is that thecollector_idshould be set to something unique (eg.s3) and theooni_api_uploadershould be run on a different host.In summary, what we should do to reprocess measurements is as follows:
collector_idWe can probably optimize step 2 a bit, by updating the reprocessor code and batching writes instead of doing them sequentially, but we can probably just take more time to do this and apply some kind of throttling on the requests.
If we throttle the requests to 50 per second, we should be done reprocessing data in ~2.5h.