Chain filters cpu utilization #16817

dpsutton · 2021-06-29T20:48:03Z

Fixes reports of high cpu usage traced to chain filters.

Bug: when getting filter values for the UI dropdowns, takes in account of other current filters and tries to give possible values for the dropdown given the constraints of the others. The way it accomplishes this is to create a join on the underlying tables with those fields, and search for values of field X (the dropdown filter field) where fields Y,Z,... are values of the constraints.

In order to find these joins, we load up all of the join information for the db, and then attempt to get from one table to another. This was doing a depth first search of the entire FK relationship graph. On a large production database this could take an enormous amount of time and ~~energy~~ space:

;; NB: some logging omitted
(defn- find-joins* [database-id source-table-id other-table-id enable-reverse-joins?]
  (let [fk-relationships (database-fk-relationships database-id enable-reverse-joins?)]
    (letfn [(find-relationship [lhs-table-id rhs-table-id path]
              ;; get the tables LHS can join directly against.
              (when-let [direct-joins (not-empty (get fk-relationships lhs-table-id))]
                ;; first, see if there is a direct relationship between LHS and RHS. If so, use that.
                (or (get direct-joins rhs-table-id)
                    ;; if not, see if we can find an indirect path by recursing on the directly joined tables
                    (some not-empty
                          (for [[joined-id join-info] direct-joins
                                :when                 (and (not= joined-id lhs-table-id)
                                                           ;; only checks against cycles in the path, not if already visited node
                                                           (not (contains? (set path) joined-id)))
                                ;; depth first recursive call with a stack
                                :let                  [recursive-joins (find-relationship joined-id rhs-table-id
                                                                                          (conj path lhs-table-id))]
                                :when                 recursive-joins]
                            (concat join-info recursive-joins))))))]
      (find-relationship source-table-id other-table-id []))))

Newer version is a breadth-first search through the FK relationship graph that won't revisit any nodes, not just watching for circular paths, and gives up after a certain path length (currently 5):

(defn- traverse-graph
  "A breadth first traversal of graph, not probing any paths that are over `max-depth` in length."
  [graph start end max-depth]
  (letfn [(transform [path] (let [edges (partition 2 1 path)]
                              (not-empty (vec (mapcat (fn [[x y]] (get-in graph [x y])) edges)))))]
    (loop [paths (conj clojure.lang.PersistentQueue/EMPTY [start])
           seen  #{start}]
      (let [path (peek paths)
            node (peek path)]
        (cond (nil? node)
              nil
              ;; found a path, bfs finds shortest first
              (= node end)
              (transform path)
              ;; abandon this path.
              (= (count path) max-depth)
              (recur (pop paths) seen)
              ;; probe further and throw them on the queue
              :else
              (let [next-nodes (->> (get graph node)
                                    keys
                                    (remove seen))]
                (recur (into (pop paths) (for [n next-nodes] (conj path n)))
                       (set/union seen (set next-nodes)))))))))

These optimizations and algo change allows us to find a path through some really nasty graphs. In the tests are megagraph and megagraph-single-path

Megagraph

This graph has a structure

{:start {links to 1-50}
 1      {links to 1-50 except 1}
 2      {links to 1-50 except 2}
 ...
 49     {links to 1-50 except 49}
 50     {links to 1-50 except 50
         link to :end}}

Many traversals exist, and doing a depth-first would find [:start 1 2 ... 49 50 :end] but this is horrific. The far easier path, and the one found here is

      (is (= [[:start 50] [50 :end]]
             (#'chain-filter/traverse-graph megagraph :start :end 5)))

It should arrive at this after 50 checks thanks to the seen optimization.

megagraph-single-path

This graph has the shape

{:start {links to 1-199}
 1      {links to 1-199 except 1}
 2      {links to 1-199 except 2}
 ...
 90     {links to 1-199
         link to 200}
 91     {links to 1-199 except 2}
 ...
 198    {links to 1-199 except 198}
 199    {links to 1-199 except 199}
 200    {link to :end}}

Which has only a single path to the end, [:start 90 200 :end]. This test will OOM and/or take an enormous amount of time in a depth first search, or even in a breadth first search without recording the nodes we've already seen. However, with the optimizations we can achieve this traversal in

(time (#'chain-filter/traverse-graph megagraph-single-path :start :end 5))
"Elapsed time: 9.881863 msecs"
[[:start 90] [90 200] [200 :end]]

Actual FK relationship graph

These tests make it easy to test horrific versions (we could even test the horror show schema if it had fk's in it). The actual shape of the graph that we are using in the real code path looks like the following:

chain-filter=> (pprint
                 (let [enable-reverse-joins? true
                       database-id 1 #_ sample-db]
                   (database-fk-relationships database-id enable-reverse-joins?)))
{2
 {3 [{:lhs {:table 2, :field 3}, :rhs {:table 3, :field 22}}],
  1 [{:lhs {:table 2, :field 5}, :rhs {:table 1, :field 30}}],
  4 [{:lhs {:table 2, :field 4}, :rhs {:table 4, :field 34}}]},
 3 {2 [{:lhs {:table 3, :field 22}, :rhs {:table 2, :field 3}}]},
 1
 {2 [{:lhs {:table 1, :field 30}, :rhs {:table 2, :field 5}}],
  4 [{:lhs {:table 1, :field 30}, :rhs {:table 4, :field 33}}]},
 4
 {2 [{:lhs {:table 4, :field 34}, :rhs {:table 2, :field 4}}],
  1 [{:lhs {:table 4, :field 33}, :rhs {:table 1, :field 30}}]}}

Which is a map of table-id->reachable-table->join-info.

This is the graph we are searching through to find how to construct the joins to satisfy the queries for one field given constraints on other fields.

- abandons at max-depth and looks in breadth first fashion rather than descending as far as possible on each possible path

not actually fooled by loops since we abandon after a certain path length. But will remove some false starts before they have to be killed by the path length check.

We reach nodes in the shortest path possible, so keep a list of nodes visited and don't revisit them: we got to them as fast as possible, and once there, there's no new nodes that we can see from that vantage point. This allows a serious speedup and should hopefully fully resolve the cpu issues seen by some.

codecov · 2021-06-29T20:48:57Z

Codecov Report

Merging #16817 (9ebf825) into master (02686bb) will increase coverage by 0.21%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master   #16817      +/-   ##
==========================================
+ Coverage   84.74%   84.95%   +0.21%     
==========================================
  Files         406      415       +9     
  Lines       32333    33197     +864     
  Branches     2329     2378      +49     
==========================================
+ Hits        27400    28202     +802     
- Misses       2604     2617      +13     
- Partials     2329     2378      +49

Impacted Files	Coverage Δ
...etabase_enterprise/audit/pages/question_detail.clj	`68.75% <66.66%> (-1.25%)`	⬇️
...end/src/metabase_enterprise/audit/pages/common.clj	`86.04% <77.77%> (+0.09%)`	⬆️
...base_enterprise/enhancements/integrations/ldap.clj	`86.76% <90.00%> (+0.55%)`	⬆️
...backend/src/metabase_enterprise/search/scoring.clj	`90.90% <90.90%> (ø)`
...nd/src/metabase_enterprise/audit/pages/queries.clj	`95.55% <92.30%> (-3.12%)`	⬇️
...e/audit/pages/common/card_and_dashboard_detail.clj	`93.67% <95.12%> (+0.19%)`	⬆️
...c/metabase_enterprise/audit/pages/common/cards.clj	`100.00% <100.00%> (ø)`
...se_enterprise/enhancements/integrations/google.clj	`100.00% <100.00%> (ø)`
...d/src/metabase_enterprise/sso/integrations/jwt.clj	`93.54% <100.00%> (+0.82%)`	⬆️
.../src/metabase_enterprise/sso/integrations/saml.clj	`75.22% <100.00%> (+2.32%)`	⬆️
... and 92 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be98b10...9ebf825. Read the comment docs.

dpsutton · 2021-06-29T21:11:51Z

High confidence in this as there are several tests that run on top, using the actual joins constructed with this that are still passing:

(deftest multi-hop-test
  (mt/dataset airports
    (testing "Should be able to filter against other tables with that require multiple joins\n"
      (testing "single direct join: Airport -> Municipality"
        (is (= ["San Francisco International Airport"]
               (chain-filter airport.name {municipality.name ["San Francisco"]}))))
      (testing "2 joins required: Airport -> Municipality -> Region"
        (is (= ["Beale Air Force Base"
                "Edwards Air Force Base"
                "John Wayne Airport-Orange County Airport"]
               (take 3 (chain-filter airport.name {region.name ["California"]})))))
      (testing "3 joins required: Airport -> Municipality -> Region -> Country"
        (is (= ["Abraham Lincoln Capital Airport"
                "Albuquerque International Sunport"
                "Altus Air Force Base"]
               (take 3 (chain-filter airport.name {country.name ["United States"]})))))
      (testing "4 joins required: Airport -> Municipality -> Region -> Country -> Continent"
        (is (= ["Afonso Pena Airport"
                "Alejandro Velasco Astete International Airport"
                "Carrasco International /General C L Berisso Airport"]
               (take 3 (chain-filter airport.name {continent.name ["South America"]})))))
...

This change is below this level and matches the 70 or so assertions that were using it, in addition to the new tests.

Snowflake error is related to monthly quota and not relevant.

howonlee · 2021-06-29T21:59:13Z

looks like new mega graph tests add like 10 whole minutes to the CI

it would be happy times if they were less mega

dpsutton · 2021-06-29T22:09:27Z

looks like new mega graph tests add like 10 whole minutes to the CI

it would be happy times if they were less mega

How did you determine this?

Checking test runs: this branch | master

be-tests-java-16-ee 5m 40s | 5m 37s
be-tests-java-11-oss 9m 44s | 9m 0s
be-tests-mariadb-ee 9m 43s | 11 seconds for some reason
be-tests-mysql-ee 13m 59s | 14m 4s

times taken from https://app.circleci.com/pipelines/github/metabase/metabase?branch=master

I'm not sure I see any time that is particularly longer. Happy to adjust down though if we do see something like that. Can you show me numbers somewhere that show that @howonlee ?

howonlee · 2021-06-29T22:10:57Z

looks like new mega graph tests add like 10 whole minutes to the CI
it would be happy times if they were less mega

How did you determine this?

Checking test runs: this branch | master

be-tests-java-16-ee 5m 40s | 5m 37s

be-tests-java-11-oss 9m 44s | 9m 0s

be-tests-mariadb-ee 9m 43s | 11 seconds for some reason

be-tests-mysql-ee 13m 59s | 14m 4s

times taken from https://app.circleci.com/pipelines/github/metabase/metabase?branch=master

I'm not sure I see any time that is particularly longer. Happy to adjust down though if we do see something like that. Can you show me numbers somewhere that show that @howonlee ?

you're right, the mariadb threw me off. on my machine it's a solid 15 secs longer but not big enough to be really annoying

the real but off-topic question is wtf is going on with the mariadb tests!

howonlee · 2021-06-29T22:21:20Z

test/metabase/models/params/chain_filter_test.clj

+             (#'chain-filter/traverse-graph megagraph :start :end 5)))
+      (is (= [[:start 90] [90 200] [200 :end]]
+             (#'chain-filter/traverse-graph megagraph-single-path :start :end 5))))
+    (testing "Returns nil if there is no path"


nit: should prolly check that it respects the max-hops param in a graph where it would get a path if it didn't respect the max-hops param, since putting that in is the thing that's gonna fix this bug either way

line 129 checks this:

(testing "Finds over a few hops" (let [graph {:start {:a [:start->a]} :a {:b [:a->b]} :b {:c [:b->c]} :c {:end [:c->end]}}] (is (= [:start->a :a->b :b->c :c->end] (#'chain-filter/traverse-graph graph :start :end 5))) (testing "But will not exceed the max depth" ;; max depth set to 2 and it returns nil here (is (nil? (#'chain-filter/traverse-graph graph :start :end 2))))))

The thing that actually fixes this is the BFS change. Descending into traversals was the wrong strategy to find our way across big databases.

howonlee

one nit wrt test, but otherwise works great on repl and clickin around in the product. what an interesting bug

* Breadth first search on graph for chain filters - abandons at max-depth and looks in breadth first fashion rather than descending as far as possible on each possible path * Add loop detection not actually fooled by loops since we abandon after a certain path length. But will remove some false starts before they have to be killed by the path length check. * Add seen optimization of nodes We reach nodes in the shortest path possible, so keep a list of nodes visited and don't revisit them: we got to them as fast as possible, and once there, there's no new nodes that we can see from that vantage point. This allows a serious speedup and should hopefully fully resolve the cpu issues seen by some. * alignment

dpsutton added 4 commits June 29, 2021 12:43

Breadth first search on graph for chain filters

534d84a

- abandons at max-depth and looks in breadth first fashion rather than descending as far as possible on each possible path

Add loop detection

0f5992d

not actually fooled by loops since we abandon after a certain path length. But will remove some false starts before they have to be killed by the path length check.

alignment

9ebf825

dpsutton requested a review from a team June 29, 2021 20:48

howonlee reviewed Jun 29, 2021

View reviewed changes

howonlee approved these changes Jun 29, 2021

View reviewed changes

dpsutton merged commit 8e684e3 into master Jun 30, 2021

dpsutton deleted the chain-filters-cpu-utilization branch June 30, 2021 15:11

dpsutton mentioned this pull request Jul 1, 2021

Chain filters cpu utilization (#16817) #16857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chain filters cpu utilization #16817

Chain filters cpu utilization #16817

dpsutton commented Jun 29, 2021 •

edited

codecov bot commented Jun 29, 2021 •

edited

dpsutton commented Jun 29, 2021

howonlee commented Jun 29, 2021

dpsutton commented Jun 29, 2021

howonlee commented Jun 29, 2021 •

edited

howonlee Jun 29, 2021 •

edited

dpsutton Jun 29, 2021

howonlee left a comment

Chain filters cpu utilization #16817

Chain filters cpu utilization #16817

Conversation

dpsutton commented Jun 29, 2021 • edited

Megagraph

megagraph-single-path

Actual FK relationship graph

codecov bot commented Jun 29, 2021 • edited

Codecov Report

dpsutton commented Jun 29, 2021

howonlee commented Jun 29, 2021

dpsutton commented Jun 29, 2021

howonlee commented Jun 29, 2021 • edited

howonlee Jun 29, 2021 • edited

Choose a reason for hiding this comment

dpsutton Jun 29, 2021

Choose a reason for hiding this comment

howonlee left a comment

Choose a reason for hiding this comment

dpsutton commented Jun 29, 2021 •

edited

codecov bot commented Jun 29, 2021 •

edited

howonlee commented Jun 29, 2021 •

edited

howonlee Jun 29, 2021 •

edited