Skip to content

Data batching and indexing details

Eli Jones edited this page Jan 29, 2020 · 2 revisions

Before the Data Access API can provide researchers with their data everything uploaded by the apps must be processed and indexed.

Data Processing provides the following data guarantees:

  • All data are ensured to be sorted based on timestamp. (The app very occasionally produces out-of-sequence data points.)
  • All data are guaranteed to have the first two columns contain a human-readable timecode (in UTF, to avoid ambiguity) in addition to original unix millisecond timestamp.
  • All data are "binned" into the hour that they were recorded in so that data can be queried for on an hourly basis.

Data fixes

There are several "fixes" to various data streams. These include missing data components, values stored outside of the file in the original data set (for example in the file name), and incorrectly formatted data from the original source.

  • The Survey Timings data stream has the survey ID inserted into the lines, it is uploaded with the survey ID in the file name.
  • The Call Log data stream does not conform to the timestamping and associated column names of the other data, and is therefore corrected.
  • Device Identifiers does have millisecond precision in its timestamp, and the timestamp is pulled out of the file name.
  • Wifi scans all occur within the same instant (the data stream is a snapshot of the current local wifi networks) and was originally included in the file name, that timestamp needs to be inserted into every data point.
  • The (Android) App Log was originally an unstructured development tool, but it proved useful and was transitioned into a data stream. Fixes include file creation time inserted as a data point, certain necessary data points that are purely for development debugging are dropped, and (fixed in newer versions of the app) missing timestamps of any log messages are inserted in order to maintain consistency.

Operation

As files are uploaded and re-encrypted for storage on AWS S3 they are added to a database of to-be-processed files. The Data Processing Manager server checks every 6 minutes for new files and creates new data processing tasks using the python Celery framework. All tasks are scheduled to expire after 5 minutes and 30 seconds, it also ensures that no currently-running tasks receive a duplicate queue entry. This guarantees that there will be no overlapping data processing tasks for a particular study participant. Processing tasks pull in data from S3, applies fixes, enforces order, and then data are merged and deduplicated (see operational detail below) with existing data that was already processed (e.g. available via the Data Download API).

Important Operational Details For System Admins

  • The exact timing of tasks is normalized to "30 seconds before the beginning of the next 6 minute block" for new tasks to enqueue. It is not a naive 330 second wait.

  • Note that if you inspect the status of Celery's task queue you may see duplicate entries. This is because tasks are discarded only when a Celery worker gets to them and checks the expiry timestamp.

  • In addition to other Data guarantees all rows are deduplicated. As a result, if there is an operational error and an uploaded file gets processed more than once the output of data processing will be the same as if that file were only processed once. Note that this feature was introduced in 2016, data processed before that may not be deduplicated.

  • Under situations of high load, where more data is uploaded than can be processed in real time, the processing usually completes during off-peak times of day. This pattern commonly occurs during the evening. Our best guess is that participants in studies return home to stable wifi. This is even more pronounced on the night preceding a weekend (which isn't always Friday).

  • As of end of January 2019 data processing is rather more aggressive in processing data as it comes in, as a rusult compute utilization is much better. (There were also some pathological cases that could clog up the multiprocessing aspect the older method.) Based on metrics from one cluster with 43 gigabytes of uploaded data per a week a C5.large server uses less then 20% cpu (AWS monitoring statistic) to process all data.

Clone this wiki locally