Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs with Update Data button on experiment results #1373

Merged
merged 21 commits into from
Jul 17, 2023
Merged

Conversation

jdorn
Copy link
Member

@jdorn jdorn commented Jun 15, 2023

Description

Completely rewrites query running logic to be more robust and fault tolerant. Should hopefully fix various bugs with the current "Update Data" button on experiments.

Changes

QueryRunner classes

Every set of queries we run now has a dedicated class that extends QueryRunner on the back-end. This class does a number of things:

  1. Starts executing all required queries
  2. Runs any post-processing and/or stats analysis on the query results when they all finish
  3. Keeps Mongo up-to-date with the current status of queries as they finish or error
  4. Implements a query caching layer to re-use recent identical queries

Persist analysis errors

Before, only query execution errors were persisted in Mongo. If there was an error afterwards with the stats engine for example, it was only shown briefly on the front-end. If the query was executed in a background job, analysis errors were never surfaced at all.

Now, both analysis and query errors are persisted in Mongo.

expireOldQueries Job

If the Node.js process is killed while a query is being executed, it could be left in a perpetual running state in Mongo. There is now a dedicated Agenda job that marks queries as failed if they haven't had a heartbeat in a while and are still marked as running.

It also updates any models (snapshots, reports, metric, etc.) that are referencing these queries.

Simplified RunQueriesButton

Before, the RunQueriesButton front-end component was actually controlling the data pipeline. This means, if you closed the window while queries were running, it would never actually finish doing the analysis. It also means, if multiple people had the same page open while queries were running, there would be race conditions.

Now, all of the data pipeline logic lives in the back-end within the QueryRunner classes. This front-end component is now much simpler. When queries are running, it is now only responsible for showing a progress indicator, periodically telling the page to check for updates (calling mutate), and providing a "cancel" link.

Better query caching layer

When starting a new query, we first check to see if there was an identical one that was started recently. If it's already finished, we use the result immediately. Otherwise, we set up a listener to wait for the query to finish. This is especially useful when doing quick exploratory analysis. You might start updating results, realize you made a mistake, cancel the queries, and update again. Now, any queries that haven't changed will pick up where they left off instead of starting brand new db queries.

Simpler background experiment refresh job

The logic in the Agenda job to update experiment results in the background is also now much simpler and more performant. Instead of periodically checking for results to be available, it just listens for an emitted event from the QueryRunner class. We also had duplicated code to process query results all over the code base. Now, everything is contained within the QueryRunner class.

More efficient North Star Metric updates

Previously, we were refreshing all North Star Metrics every 24 hours. However, the way Agenda works, jobs can run much more frequently than the scheduled interval. For example, whenever a new Node.js process is started, it can also kick off a job. This means every deploy we do to production could cause every North Star Metric to be refreshed 3+ times (once for each new server we bring online during the rolling deploy). In addition, the "unique" key for the job included all organization settings, which resulted in many duplicate job entries in Mongo.

Now, we only refresh a metric if it hasn't been refreshed in the past 24 hours, either manually or via this update job. Also, the unique job attributes were reduced to a minimum - metricId and organizationId.

Future Work

There are still 3 critical problems with our query running architecture, even after this PR:

  1. Queries cannot recover/restart if a Node.js process dies. This is especially bad for long-running queries during the day when we deploy to Cloud once an hour. We mitigate this today by keeping old containers around for 5 minutes before killing them. That's enough time for >95% of queries to finish executing.
  2. Queries cannot be cancelled once they are started. This can lead to expensive bills in BigQuery and others when you make a typo, realize it, and cancel the query. It will still keep running in the background.
  3. There's no way to control query concurrency. This can be an issue for data warehouses like Redshift where compute does not scale elastically. So it's possible for us to bombard someone's data warehouse with 20+ queries all at once and bring it down.

To fix these issues, we need to build a true dedicated QueryRunner microservice.

  • Queries can be queued and cancelled.
  • Cancelled queries will also cancel the job in the data warehouse (if supported)
  • Concurrency can be configured and controlled at the data source level.
  • Query running jobs can run independently of the main app servers and will not be killed during rolling deploys.
  • If a query fails for a temporary reason (e.g. networking error), it can be restarted automatically if desired.

Testing Plan

Testing:

  • Manually update experiment results successfully
  • Manually update experiment results with a query error
  • Manually update experiment results with an analysis error
  • Close the tab while experiment results are being updated and make sure it finishes the analysis in Mongo. Coming back to the page during or after the queries finish should show the proper state.
  • Kill Node.js while a query is running, then restart and make sure it eventually recovers.
  • Check that the query cache is used for recently completed queries
  • Manually update experiment results, cancel the query, then immediately update data again. Check to see that already in-progress queries are re-used instead of starting new queries.
  • Background experiment update job
  • Metric analysis
  • Metric analysis background job (North Star)
  • Past experiments query
  • Manually update report data
  • Save changes to a report's configuration and make sure it refreshes results

@github-actions
Copy link

github-actions bot commented Jun 15, 2023

Your preview environment pr-1373-bttf has been deployed.

Preview environment endpoints are available at:

@jdorn jdorn marked this pull request as ready for review July 17, 2023 05:28
@jdorn jdorn merged commit d6ef009 into main Jul 17, 2023
4 checks passed
@jdorn jdorn deleted the update-data-bug branch July 17, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant