Fix bugs with Update Data button on experiment results #1373

jdorn · 2023-06-15T04:15:47Z

Description

Completely rewrites query running logic to be more robust and fault tolerant. Should hopefully fix various bugs with the current "Update Data" button on experiments.

Changes

QueryRunner classes

Every set of queries we run now has a dedicated class that extends QueryRunner on the back-end. This class does a number of things:

Starts executing all required queries
Runs any post-processing and/or stats analysis on the query results when they all finish
Keeps Mongo up-to-date with the current status of queries as they finish or error
Implements a query caching layer to re-use recent identical queries

Persist analysis errors

Before, only query execution errors were persisted in Mongo. If there was an error afterwards with the stats engine for example, it was only shown briefly on the front-end. If the query was executed in a background job, analysis errors were never surfaced at all.

Now, both analysis and query errors are persisted in Mongo.

expireOldQueries Job

If the Node.js process is killed while a query is being executed, it could be left in a perpetual running state in Mongo. There is now a dedicated Agenda job that marks queries as failed if they haven't had a heartbeat in a while and are still marked as running.

It also updates any models (snapshots, reports, metric, etc.) that are referencing these queries.

Simplified RunQueriesButton

Before, the RunQueriesButton front-end component was actually controlling the data pipeline. This means, if you closed the window while queries were running, it would never actually finish doing the analysis. It also means, if multiple people had the same page open while queries were running, there would be race conditions.

Now, all of the data pipeline logic lives in the back-end within the QueryRunner classes. This front-end component is now much simpler. When queries are running, it is now only responsible for showing a progress indicator, periodically telling the page to check for updates (calling mutate), and providing a "cancel" link.

Better query caching layer

When starting a new query, we first check to see if there was an identical one that was started recently. If it's already finished, we use the result immediately. Otherwise, we set up a listener to wait for the query to finish. This is especially useful when doing quick exploratory analysis. You might start updating results, realize you made a mistake, cancel the queries, and update again. Now, any queries that haven't changed will pick up where they left off instead of starting brand new db queries.

Simpler background experiment refresh job

The logic in the Agenda job to update experiment results in the background is also now much simpler and more performant. Instead of periodically checking for results to be available, it just listens for an emitted event from the QueryRunner class. We also had duplicated code to process query results all over the code base. Now, everything is contained within the QueryRunner class.

More efficient North Star Metric updates

Previously, we were refreshing all North Star Metrics every 24 hours. However, the way Agenda works, jobs can run much more frequently than the scheduled interval. For example, whenever a new Node.js process is started, it can also kick off a job. This means every deploy we do to production could cause every North Star Metric to be refreshed 3+ times (once for each new server we bring online during the rolling deploy). In addition, the "unique" key for the job included all organization settings, which resulted in many duplicate job entries in Mongo.

Now, we only refresh a metric if it hasn't been refreshed in the past 24 hours, either manually or via this update job. Also, the unique job attributes were reduced to a minimum - metricId and organizationId.

Future Work

There are still 3 critical problems with our query running architecture, even after this PR:

Queries cannot recover/restart if a Node.js process dies. This is especially bad for long-running queries during the day when we deploy to Cloud once an hour. We mitigate this today by keeping old containers around for 5 minutes before killing them. That's enough time for >95% of queries to finish executing.
Queries cannot be cancelled once they are started. This can lead to expensive bills in BigQuery and others when you make a typo, realize it, and cancel the query. It will still keep running in the background.
There's no way to control query concurrency. This can be an issue for data warehouses like Redshift where compute does not scale elastically. So it's possible for us to bombard someone's data warehouse with 20+ queries all at once and bring it down.

To fix these issues, we need to build a true dedicated QueryRunner microservice.

Queries can be queued and cancelled.
Cancelled queries will also cancel the job in the data warehouse (if supported)
Concurrency can be configured and controlled at the data source level.
Query running jobs can run independently of the main app servers and will not be killed during rolling deploys.
If a query fails for a temporary reason (e.g. networking error), it can be restarted automatically if desired.

Testing Plan

Testing:

github-actions · 2023-06-15T04:25:15Z

Your preview environment pr-1373-bttf has been deployed.

Preview environment endpoints are available at:

Fix bugs with Update Data button on experiment results

36237c8

jdorn added 13 commits July 1, 2023 08:08

Merge branch 'main' into update-data-bug

5719d79

Finish query refactor

3d644ea

Separate job to expire old queries

789cafc

Refactor all query status endpoints

9fb92f3

Rewrite query running logic from scratch

0ef27cd

Remove old status endpoints which aren't used anymore

2538bb2

Add placeholder TODOs for expiring stale models

34296f9

Update expireQueriesJob to clean up models as well

6610e08

Merge remote-tracking branch 'origin/main' into update-data-bug

2234ea6

Add debug logging, fix bugs from testing

c7715c2

Fix useCache behavior. Add logging to expireStaleQueries job

0feb23a

Fix report update on save method

d051109

Merge remote-tracking branch 'origin/main' into update-data-bug

5b374b7

jdorn marked this pull request as ready for review July 17, 2023 05:28

jdorn added 7 commits July 17, 2023 08:02

Fix bug when expiring stale queries

3332b98

Add logging when experiment successfully refreshes in background job

14af8d4

Fix north star metric update logic

4b72e6d

Don't set dateUpdated on a metric if only the analysis data is changing

fb33718

Fix UI when datasource encryption key changes

3aac4f3

Simplify report query runner

a42f92f

Merge branch 'main' into update-data-bug

5974d2c

jdorn merged commit d6ef009 into main Jul 17, 2023
4 checks passed

jdorn deleted the update-data-bug branch July 17, 2023 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs with Update Data button on experiment results #1373

Fix bugs with Update Data button on experiment results #1373

jdorn commented Jun 15, 2023 •

edited

Loading

github-actions bot commented Jun 15, 2023 •

edited

Loading

Fix bugs with Update Data button on experiment results #1373

Fix bugs with Update Data button on experiment results #1373

Conversation

jdorn commented Jun 15, 2023 • edited Loading

Description

Changes

QueryRunner classes

Persist analysis errors

expireOldQueries Job

Simplified RunQueriesButton

Better query caching layer

Simpler background experiment refresh job

More efficient North Star Metric updates

Future Work

Testing Plan

github-actions bot commented Jun 15, 2023 • edited Loading

jdorn commented Jun 15, 2023 •

edited

Loading

github-actions bot commented Jun 15, 2023 •

edited

Loading