-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs with Update Data button on experiment results #1373
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Your preview environment pr-1373-bttf has been deployed. Preview environment endpoints are available at: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Completely rewrites query running logic to be more robust and fault tolerant. Should hopefully fix various bugs with the current "Update Data" button on experiments.
Changes
QueryRunner classes
Every set of queries we run now has a dedicated class that extends
QueryRunner
on the back-end. This class does a number of things:Persist analysis errors
Before, only query execution errors were persisted in Mongo. If there was an error afterwards with the stats engine for example, it was only shown briefly on the front-end. If the query was executed in a background job, analysis errors were never surfaced at all.
Now, both analysis and query errors are persisted in Mongo.
expireOldQueries Job
If the Node.js process is killed while a query is being executed, it could be left in a perpetual
running
state in Mongo. There is now a dedicated Agenda job that marks queries asfailed
if they haven't had a heartbeat in a while and are still marked asrunning
.It also updates any models (snapshots, reports, metric, etc.) that are referencing these queries.
Simplified RunQueriesButton
Before, the
RunQueriesButton
front-end component was actually controlling the data pipeline. This means, if you closed the window while queries were running, it would never actually finish doing the analysis. It also means, if multiple people had the same page open while queries were running, there would be race conditions.Now, all of the data pipeline logic lives in the back-end within the QueryRunner classes. This front-end component is now much simpler. When queries are running, it is now only responsible for showing a progress indicator, periodically telling the page to check for updates (calling
mutate
), and providing a "cancel" link.Better query caching layer
When starting a new query, we first check to see if there was an identical one that was started recently. If it's already finished, we use the result immediately. Otherwise, we set up a listener to wait for the query to finish. This is especially useful when doing quick exploratory analysis. You might start updating results, realize you made a mistake, cancel the queries, and update again. Now, any queries that haven't changed will pick up where they left off instead of starting brand new db queries.
Simpler background experiment refresh job
The logic in the Agenda job to update experiment results in the background is also now much simpler and more performant. Instead of periodically checking for results to be available, it just listens for an emitted event from the QueryRunner class. We also had duplicated code to process query results all over the code base. Now, everything is contained within the QueryRunner class.
More efficient North Star Metric updates
Previously, we were refreshing all North Star Metrics every 24 hours. However, the way Agenda works, jobs can run much more frequently than the scheduled interval. For example, whenever a new Node.js process is started, it can also kick off a job. This means every deploy we do to production could cause every North Star Metric to be refreshed 3+ times (once for each new server we bring online during the rolling deploy). In addition, the "unique" key for the job included all organization settings, which resulted in many duplicate job entries in Mongo.
Now, we only refresh a metric if it hasn't been refreshed in the past 24 hours, either manually or via this update job. Also, the unique job attributes were reduced to a minimum - metricId and organizationId.
Future Work
There are still 3 critical problems with our query running architecture, even after this PR:
To fix these issues, we need to build a true dedicated QueryRunner microservice.
Testing Plan
Testing: