-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chain filters cpu utilization #16817
Conversation
- abandons at max-depth and looks in breadth first fashion rather than descending as far as possible on each possible path
not actually fooled by loops since we abandon after a certain path length. But will remove some false starts before they have to be killed by the path length check.
We reach nodes in the shortest path possible, so keep a list of nodes visited and don't revisit them: we got to them as fast as possible, and once there, there's no new nodes that we can see from that vantage point. This allows a serious speedup and should hopefully fully resolve the cpu issues seen by some.
Codecov Report
@@ Coverage Diff @@
## master #16817 +/- ##
==========================================
+ Coverage 84.74% 84.95% +0.21%
==========================================
Files 406 415 +9
Lines 32333 33197 +864
Branches 2329 2378 +49
==========================================
+ Hits 27400 28202 +802
- Misses 2604 2617 +13
- Partials 2329 2378 +49
Continue to review full report at Codecov.
|
High confidence in this as there are several tests that run on top, using the actual joins constructed with this that are still passing: (deftest multi-hop-test
(mt/dataset airports
(testing "Should be able to filter against other tables with that require multiple joins\n"
(testing "single direct join: Airport -> Municipality"
(is (= ["San Francisco International Airport"]
(chain-filter airport.name {municipality.name ["San Francisco"]}))))
(testing "2 joins required: Airport -> Municipality -> Region"
(is (= ["Beale Air Force Base"
"Edwards Air Force Base"
"John Wayne Airport-Orange County Airport"]
(take 3 (chain-filter airport.name {region.name ["California"]})))))
(testing "3 joins required: Airport -> Municipality -> Region -> Country"
(is (= ["Abraham Lincoln Capital Airport"
"Albuquerque International Sunport"
"Altus Air Force Base"]
(take 3 (chain-filter airport.name {country.name ["United States"]})))))
(testing "4 joins required: Airport -> Municipality -> Region -> Country -> Continent"
(is (= ["Afonso Pena Airport"
"Alejandro Velasco Astete International Airport"
"Carrasco International /General C L Berisso Airport"]
(take 3 (chain-filter airport.name {continent.name ["South America"]})))))
... This change is below this level and matches the 70 or so assertions that were using it, in addition to the new tests. Snowflake error is related to monthly quota and not relevant. |
looks like new mega graph tests add like 10 whole minutes to the CI it would be happy times if they were less mega |
How did you determine this? Checking test runs: this branch | master
times taken from https://app.circleci.com/pipelines/github/metabase/metabase?branch=master I'm not sure I see any time that is particularly longer. Happy to adjust down though if we do see something like that. Can you show me numbers somewhere that show that @howonlee ? |
you're right, the mariadb threw me off. on my machine it's a solid 15 secs longer but not big enough to be really annoying the real but off-topic question is wtf is going on with the mariadb tests! |
(#'chain-filter/traverse-graph megagraph :start :end 5))) | ||
(is (= [[:start 90] [90 200] [200 :end]] | ||
(#'chain-filter/traverse-graph megagraph-single-path :start :end 5)))) | ||
(testing "Returns nil if there is no path" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should prolly check that it respects the max-hops param in a graph where it would get a path if it didn't respect the max-hops param, since putting that in is the thing that's gonna fix this bug either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 129 checks this:
(testing "Finds over a few hops"
(let [graph {:start {:a [:start->a]}
:a {:b [:a->b]}
:b {:c [:b->c]}
:c {:end [:c->end]}}]
(is (= [:start->a :a->b :b->c :c->end]
(#'chain-filter/traverse-graph graph :start :end 5)))
(testing "But will not exceed the max depth" ;; max depth set to 2 and it returns nil here
(is (nil? (#'chain-filter/traverse-graph graph :start :end 2))))))
The thing that actually fixes this is the BFS change. Descending into traversals was the wrong strategy to find our way across big databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit wrt test, but otherwise works great on repl and clickin around in the product. what an interesting bug
* Breadth first search on graph for chain filters - abandons at max-depth and looks in breadth first fashion rather than descending as far as possible on each possible path * Add loop detection not actually fooled by loops since we abandon after a certain path length. But will remove some false starts before they have to be killed by the path length check. * Add seen optimization of nodes We reach nodes in the shortest path possible, so keep a list of nodes visited and don't revisit them: we got to them as fast as possible, and once there, there's no new nodes that we can see from that vantage point. This allows a serious speedup and should hopefully fully resolve the cpu issues seen by some. * alignment
* Breadth first search on graph for chain filters - abandons at max-depth and looks in breadth first fashion rather than descending as far as possible on each possible path * Add loop detection not actually fooled by loops since we abandon after a certain path length. But will remove some false starts before they have to be killed by the path length check. * Add seen optimization of nodes We reach nodes in the shortest path possible, so keep a list of nodes visited and don't revisit them: we got to them as fast as possible, and once there, there's no new nodes that we can see from that vantage point. This allows a serious speedup and should hopefully fully resolve the cpu issues seen by some. * alignment
Fixes reports of high cpu usage traced to chain filters.
Bug: when getting filter values for the UI dropdowns, takes in account of other current filters and tries to give possible values for the dropdown given the constraints of the others. The way it accomplishes this is to create a join on the underlying tables with those fields, and search for values of field X (the dropdown filter field) where fields Y,Z,... are values of the constraints.
In order to find these joins, we load up all of the join information for the db, and then attempt to get from one table to another. This was doing a depth first search of the entire FK relationship graph. On a large production database this could take an enormous amount of time and
energyspace:Newer version is a breadth-first search through the FK relationship graph that won't revisit any nodes, not just watching for circular paths, and gives up after a certain path length (currently 5):
These optimizations and algo change allows us to find a path through some really nasty graphs. In the tests are
megagraph
andmegagraph-single-path
Megagraph
This graph has a structure
Many traversals exist, and doing a depth-first would find
[:start 1 2 ... 49 50 :end]
but this is horrific. The far easier path, and the one found here isIt should arrive at this after 50 checks thanks to the
seen
optimization.megagraph-single-path
This graph has the shape
Which has only a single path to the end,
[:start 90 200 :end]
. This test will OOM and/or take an enormous amount of time in a depth first search, or even in a breadth first search without recording the nodes we've already seen. However, with the optimizations we can achieve this traversal inActual FK relationship graph
These tests make it easy to test horrific versions (we could even test the horror show schema if it had fk's in it). The actual shape of the graph that we are using in the real code path looks like the following:
Which is a map of table-id->reachable-table->join-info.
This is the graph we are searching through to find how to construct the joins to satisfy the queries for one field given constraints on other fields.