-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-calculate reports, store them in CouchDB for retrieval #27
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This means: - The fake data server generates data using that format. - The data downloaded from the server is expected to be in that format too The difference: KPIggybank's wraps the payload of KPI data into an object that has an ID and where the data itself is under the "value" field. (This is how the data is extracted from CouchDB.)
Turns out, it's needed in multiple places
Ensures that, whenever settings are requested, they have already been loaded.
Prior process: On every request, data was downloaded, the report was prepared, then sent back to the user. Now: Separately from the server, data is downloaded and stored in CouchDB. (This is done, manually for now, using the script server/bin/update.) The server sets up views in CouchDB that map/reduce the data into a form ready for the report. On user request, the data is retrieved from the appropriate view and sent back. Benefit: Data download and report calculation only needs to happen once. Scope: This commit only switches over the "new user flow over time" report. This is the first major part of implementing #10.
This was referenced Jul 13, 2012
Uses database backend for cumulative requests; requests for segmented data get processed using legacy code.
Segmentation by OS
"Known" means the ones listed in the config file. Fixes the way we populate the database to be consistent with how data aggregation used to happen.
Due to code copy/pasting, segmentation values in the database were stored inside a size-one array (e.g., ["value"]). Though things weren't breaking, there is absolutely no reason for that here.
The report is also converted to display the mean number of sites logged in instead of the median (closes #29). The reason behind it (from #29): Report #1 is the median number of sites a user logs into with Persona. As part of migrating to CouchDB as the backend (#27), finding the median of the data series becomes a significantly harder technical challenge. (To do it in a map/reduce framework requires a quick-select algorithm, which there doesn't seem to be a good way to do in CouchDB.) Alternately, the median value for each day could be precalculated when data arrives and then stored in the database. However, this would require either a new database (cumbersome) or a change to the data format and code of the current one (very undesirable). Calculating the mean of the dataset, however, is much easier. While the median is a more sensible value to look at (it is less sensitive to outliers), it has been agreed, before, that this entire report is not hugely meaningful. The median value itself doesn't really say anything. The only way we'd use it is to watch the number and hope it trends up. In that case, however, the mean is just about as good: we can look at it and watch its trend.
The problem was that, in the data returned by the server, some days didn't have a value for every step. That is an indicator that 0 people completed the missing step, but that needed to be stated explicitly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
As part of implementing #10, we're changing the architecture so that reports are pre-calculated (as much as possible) and stored in CouchDB. Then, when a request for a report is received, the server would just get the information from the database and pass it on to the user.
More details
Prior process
On every request, data was downloaded, the report was prepared, then sent back to the user.
Now
Separately from the server, data is downloaded and stored in CouchDB. (This is done, manually for now, using the script
server/bin/update
.) The server sets up views in CouchDB that map/reduce the data into a form ready for the report.On user request, the data is retrieved from the appropriate view and sent back.
Benefit
Data download and report calculation only needs to happen once.