Pre-calculate reports, store them in CouchDB for retrieval #27

nmalkin · 2012-07-13T22:23:33Z

Overview

As part of implementing #10, we're changing the architecture so that reports are pre-calculated (as much as possible) and stored in CouchDB. Then, when a request for a report is received, the server would just get the information from the database and pass it on to the user.

More details

Prior process

On every request, data was downloaded, the report was prepared, then sent back to the user.

Now

Separately from the server, data is downloaded and stored in CouchDB. (This is done, manually for now, using the script server/bin/update.) The server sets up views in CouchDB that map/reduce the data into a form ready for the report.
On user request, the data is retrieved from the appropriate view and sent back.

Benefit

Data download and report calculation only needs to happen once.

This means: - The fake data server generates data using that format. - The data downloaded from the server is expected to be in that format too The difference: KPIggybank's wraps the payload of KPI data into an object that has an ID and where the data itself is under the "value" field. (This is how the data is extracted from CouchDB.)

Turns out, it's needed in multiple places

Ensures that, whenever settings are requested, they have already been loaded.

Prior process: On every request, data was downloaded, the report was prepared, then sent back to the user. Now: Separately from the server, data is downloaded and stored in CouchDB. (This is done, manually for now, using the script server/bin/update.) The server sets up views in CouchDB that map/reduce the data into a form ready for the report. On user request, the data is retrieved from the appropriate view and sent back. Benefit: Data download and report calculation only needs to happen once. Scope: This commit only switches over the "new user flow over time" report. This is the first major part of implementing #10.

Uses database backend for cumulative requests; requests for segmented data get processed using legacy code.

Segmentation by OS

"Known" means the ones listed in the config file. Fixes the way we populate the database to be consistent with how data aggregation used to happen.

Due to code copy/pasting, segmentation values in the database were stored inside a size-one array (e.g., ["value"]). Though things weren't breaking, there is absolutely no reason for that here.

The report is also converted to display the mean number of sites logged in instead of the median (closes #29). The reason behind it (from #29): Report #1 is the median number of sites a user logs into with Persona. As part of migrating to CouchDB as the backend (#27), finding the median of the data series becomes a significantly harder technical challenge. (To do it in a map/reduce framework requires a quick-select algorithm, which there doesn't seem to be a good way to do in CouchDB.) Alternately, the median value for each day could be precalculated when data arrives and then stored in the database. However, this would require either a new database (cumbersome) or a change to the data format and code of the current one (very undesirable). Calculating the mean of the dataset, however, is much easier. While the median is a more sensible value to look at (it is less sensitive to outliers), it has been agreed, before, that this entire report is not hugely meaningful. The median value itself doesn't really say anything. The only way we'd use it is to watch the number and hope it trends up. In that case, however, the mean is just about as good: we can look at it and watch its trend.

The problem was that, in the data returned by the server, some days didn't have a value for every step. That is an indicator that 0 people completed the missing step, but that needed to be stated explicitly.

nmalkin added 6 commits July 12, 2012 15:51

Move date determination into data module

5c97188

Move date determination into new util module

652c09d

Turns out, it's needed in multiple places

Introduce CouchDB as a dependency

a4a477c

Load configurations synchronously

5dfb406

Ensures that, whenever settings are requested, they have already been loaded.

This was referenced Jul 13, 2012

Cache data, reports on the backend #10

Closed

How to avoid per-segmentation views in CouchDB? #28

Closed

nmalkin added 7 commits July 13, 2012 17:22

Partially migrate new user report (#2) to database backend

41be200

Uses database backend for cumulative requests; requests for segmented data get processed using legacy code.

Add first segmentation to db-backed new user report

d655a98

Segmentation by OS

Use database backend for all new user segmentations

0217800

Ensure only known segmentation values are stored in database

477c09c

"Known" means the ones listed in the config file. Fixes the way we populate the database to be consistent with how data aggregation used to happen.

Consolidate creation of views for new user report

6d31fe9

Unwrap segmentation values

af0f7c6

Due to code copy/pasting, segmentation values in the database were stored inside a size-one array (e.g., ["value"]). Though things weren't breaking, there is absolutely no reason for that here.

Convert "assertions" report to database backend

288d260

nmalkin mentioned this pull request Jul 17, 2012

Use mean instead of median for report #1 #29

Closed

nmalkin added 3 commits July 17, 2012 18:46

Remove code deprecated by switch to database backend

b936089

Fix #30: holes in data break user flow over time

45c16fe

The problem was that, in the data returned by the server, some days didn't have a value for every step. That is an indicator that 0 people completed the missing step, but that needed to be stated explicitly.

nmalkin closed this in 57412af Jul 25, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-calculate reports, store them in CouchDB for retrieval #27

Pre-calculate reports, store them in CouchDB for retrieval #27

nmalkin commented Jul 13, 2012

Pre-calculate reports, store them in CouchDB for retrieval #27

Pre-calculate reports, store them in CouchDB for retrieval #27

Conversation

nmalkin commented Jul 13, 2012

Overview

More details

Prior process

Now

Benefit