Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-calculate reports, store them in CouchDB for retrieval #27

Closed
wants to merge 16 commits into from

Conversation

nmalkin
Copy link
Owner

@nmalkin nmalkin commented Jul 13, 2012

Overview

As part of implementing #10, we're changing the architecture so that reports are pre-calculated (as much as possible) and stored in CouchDB. Then, when a request for a report is received, the server would just get the information from the database and pass it on to the user.

More details

Prior process

On every request, data was downloaded, the report was prepared, then sent back to the user.

Now

Separately from the server, data is downloaded and stored in CouchDB. (This is done, manually for now, using the script server/bin/update.) The server sets up views in CouchDB that map/reduce the data into a form ready for the report.
On user request, the data is retrieved from the appropriate view and sent back.

Benefit

Data download and report calculation only needs to happen once.

This means:
    - The fake data server generates data using that format.
    - The data downloaded from the server is expected to be in that
      format too

The difference:
    KPIggybank's wraps the payload of KPI data into an object that has
    an ID and where the data itself is under the "value" field.
    (This is how the data is extracted from CouchDB.)
Turns out, it's needed in multiple places
Ensures that, whenever settings are requested, they have already been loaded.
Prior process:
    On every request, data was downloaded, the report was prepared, then
    sent back to the user.

Now:
    Separately from the server, data is downloaded and stored in CouchDB.
        (This is done, manually for now, using the script server/bin/update.)
    The server sets up views in CouchDB that map/reduce the data into a form
        ready for the report.
    On user request, the data is retrieved from the appropriate view and
    sent back.

Benefit:
    Data download and report calculation only needs to happen once.

Scope:
    This commit only switches over the "new user flow over time" report.

This is the first major part of implementing #10.
Uses database backend for cumulative requests;
requests for segmented data get processed using legacy code.
"Known" means the ones listed in the config file.

Fixes the way we populate the database to be consistent with how data
aggregation used to happen.
Due to code copy/pasting, segmentation values in the database were
stored inside a size-one array (e.g., ["value"]). Though things weren't
breaking, there is absolutely no reason for that here.
The report is also converted to display the mean number of sites logged in
instead of the median (closes #29).

The reason behind it (from #29):

Report #1 is the median number of sites a user logs into with Persona.

As part of migrating to CouchDB as the backend (#27), finding the median
of the data series becomes a significantly harder technical challenge.
(To do it in a map/reduce framework requires a quick-select algorithm,
which there doesn't seem to be a good way to do in CouchDB.)

Alternately, the median value for each day could be precalculated when
data arrives and then stored in the database. However, this would
require either a new database (cumbersome) or a change to the data
format and code of the current one (very undesirable).

Calculating the mean of the dataset, however, is much easier.

While the median is a more sensible value to look at (it is less
sensitive to outliers), it has been agreed, before, that this entire
report is not hugely meaningful. The median value itself doesn't really
say anything. The only way we'd use it is to watch the number and hope
it trends up. In that case, however, the mean is just about as good: we
can look at it and watch its trend.
The problem was that, in the data returned by the server, some days
didn't have a value for every step.

That is an indicator that 0 people completed the missing step, but
that needed to be stated explicitly.
@nmalkin nmalkin closed this in 57412af Jul 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant