Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chain filters cpu utilization #16817

Merged
merged 4 commits into from
Jun 30, 2021
Merged

Chain filters cpu utilization #16817

merged 4 commits into from
Jun 30, 2021

Conversation

dpsutton
Copy link
Contributor

@dpsutton dpsutton commented Jun 29, 2021

Fixes reports of high cpu usage traced to chain filters.

Bug: when getting filter values for the UI dropdowns, takes in account of other current filters and tries to give possible values for the dropdown given the constraints of the others. The way it accomplishes this is to create a join on the underlying tables with those fields, and search for values of field X (the dropdown filter field) where fields Y,Z,... are values of the constraints.

In order to find these joins, we load up all of the join information for the db, and then attempt to get from one table to another. This was doing a depth first search of the entire FK relationship graph. On a large production database this could take an enormous amount of time and energy space:

;; NB: some logging omitted
(defn- find-joins* [database-id source-table-id other-table-id enable-reverse-joins?]
  (let [fk-relationships (database-fk-relationships database-id enable-reverse-joins?)]
    (letfn [(find-relationship [lhs-table-id rhs-table-id path]
              ;; get the tables LHS can join directly against.
              (when-let [direct-joins (not-empty (get fk-relationships lhs-table-id))]
                ;; first, see if there is a direct relationship between LHS and RHS. If so, use that.
                (or (get direct-joins rhs-table-id)
                    ;; if not, see if we can find an indirect path by recursing on the directly joined tables
                    (some not-empty
                          (for [[joined-id join-info] direct-joins
                                :when                 (and (not= joined-id lhs-table-id)
                                                           ;; only checks against cycles in the path, not if already visited node
                                                           (not (contains? (set path) joined-id)))
                                ;; depth first recursive call with a stack
                                :let                  [recursive-joins (find-relationship joined-id rhs-table-id
                                                                                          (conj path lhs-table-id))]
                                :when                 recursive-joins]
                            (concat join-info recursive-joins))))))]
      (find-relationship source-table-id other-table-id []))))

Newer version is a breadth-first search through the FK relationship graph that won't revisit any nodes, not just watching for circular paths, and gives up after a certain path length (currently 5):

(defn- traverse-graph
  "A breadth first traversal of graph, not probing any paths that are over `max-depth` in length."
  [graph start end max-depth]
  (letfn [(transform [path] (let [edges (partition 2 1 path)]
                              (not-empty (vec (mapcat (fn [[x y]] (get-in graph [x y])) edges)))))]
    (loop [paths (conj clojure.lang.PersistentQueue/EMPTY [start])
           seen  #{start}]
      (let [path (peek paths)
            node (peek path)]
        (cond (nil? node)
              nil
              ;; found a path, bfs finds shortest first
              (= node end)
              (transform path)
              ;; abandon this path.
              (= (count path) max-depth)
              (recur (pop paths) seen)
              ;; probe further and throw them on the queue
              :else
              (let [next-nodes (->> (get graph node)
                                    keys
                                    (remove seen))]
                (recur (into (pop paths) (for [n next-nodes] (conj path n)))
                       (set/union seen (set next-nodes)))))))))

These optimizations and algo change allows us to find a path through some really nasty graphs. In the tests are megagraph and megagraph-single-path

Megagraph

This graph has a structure

{:start {links to 1-50}
 1      {links to 1-50 except 1}
 2      {links to 1-50 except 2}
 ...
 49     {links to 1-50 except 49}
 50     {links to 1-50 except 50
         link to :end}}

Many traversals exist, and doing a depth-first would find [:start 1 2 ... 49 50 :end] but this is horrific. The far easier path, and the one found here is

      (is (= [[:start 50] [50 :end]]
             (#'chain-filter/traverse-graph megagraph :start :end 5)))

It should arrive at this after 50 checks thanks to the seen optimization.

megagraph-single-path

This graph has the shape

{:start {links to 1-199}
 1      {links to 1-199 except 1}
 2      {links to 1-199 except 2}
 ...
 90     {links to 1-199
         link to 200}
 91     {links to 1-199 except 2}
 ...
 198    {links to 1-199 except 198}
 199    {links to 1-199 except 199}
 200    {link to :end}}

Which has only a single path to the end, [:start 90 200 :end]. This test will OOM and/or take an enormous amount of time in a depth first search, or even in a breadth first search without recording the nodes we've already seen. However, with the optimizations we can achieve this traversal in

(time (#'chain-filter/traverse-graph megagraph-single-path :start :end 5))
"Elapsed time: 9.881863 msecs"
[[:start 90] [90 200] [200 :end]]

Actual FK relationship graph

These tests make it easy to test horrific versions (we could even test the horror show schema if it had fk's in it). The actual shape of the graph that we are using in the real code path looks like the following:

chain-filter=> (pprint
                 (let [enable-reverse-joins? true
                       database-id 1 #_ sample-db]
                   (database-fk-relationships database-id enable-reverse-joins?)))
{2
 {3 [{:lhs {:table 2, :field 3}, :rhs {:table 3, :field 22}}],
  1 [{:lhs {:table 2, :field 5}, :rhs {:table 1, :field 30}}],
  4 [{:lhs {:table 2, :field 4}, :rhs {:table 4, :field 34}}]},
 3 {2 [{:lhs {:table 3, :field 22}, :rhs {:table 2, :field 3}}]},
 1
 {2 [{:lhs {:table 1, :field 30}, :rhs {:table 2, :field 5}}],
  4 [{:lhs {:table 1, :field 30}, :rhs {:table 4, :field 33}}]},
 4
 {2 [{:lhs {:table 4, :field 34}, :rhs {:table 2, :field 4}}],
  1 [{:lhs {:table 4, :field 33}, :rhs {:table 1, :field 30}}]}}

Which is a map of table-id->reachable-table->join-info.

This is the graph we are searching through to find how to construct the joins to satisfy the queries for one field given constraints on other fields.

- abandons at max-depth and looks in breadth first fashion rather than
descending as far as possible on each possible path
not actually fooled by loops since we abandon after a certain path
length. But will remove some false starts before they have to be
killed by the path length check.
We reach nodes in the shortest path possible, so keep a list of nodes
visited and don't revisit them: we got to them as fast as possible,
and once there, there's no new nodes that we can see from that vantage
point.

This allows a serious speedup and should hopefully fully resolve the
cpu issues seen by some.
@dpsutton dpsutton requested a review from a team June 29, 2021 20:48
@codecov
Copy link

codecov bot commented Jun 29, 2021

Codecov Report

Merging #16817 (9ebf825) into master (02686bb) will increase coverage by 0.21%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16817      +/-   ##
==========================================
+ Coverage   84.74%   84.95%   +0.21%     
==========================================
  Files         406      415       +9     
  Lines       32333    33197     +864     
  Branches     2329     2378      +49     
==========================================
+ Hits        27400    28202     +802     
- Misses       2604     2617      +13     
- Partials     2329     2378      +49     
Impacted Files Coverage Δ
...etabase_enterprise/audit/pages/question_detail.clj 68.75% <66.66%> (-1.25%) ⬇️
...end/src/metabase_enterprise/audit/pages/common.clj 86.04% <77.77%> (+0.09%) ⬆️
...base_enterprise/enhancements/integrations/ldap.clj 86.76% <90.00%> (+0.55%) ⬆️
...backend/src/metabase_enterprise/search/scoring.clj 90.90% <90.90%> (ø)
...nd/src/metabase_enterprise/audit/pages/queries.clj 95.55% <92.30%> (-3.12%) ⬇️
...e/audit/pages/common/card_and_dashboard_detail.clj 93.67% <95.12%> (+0.19%) ⬆️
...c/metabase_enterprise/audit/pages/common/cards.clj 100.00% <100.00%> (ø)
...se_enterprise/enhancements/integrations/google.clj 100.00% <100.00%> (ø)
...d/src/metabase_enterprise/sso/integrations/jwt.clj 93.54% <100.00%> (+0.82%) ⬆️
.../src/metabase_enterprise/sso/integrations/saml.clj 75.22% <100.00%> (+2.32%) ⬆️
... and 92 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be98b10...9ebf825. Read the comment docs.

@dpsutton
Copy link
Contributor Author

High confidence in this as there are several tests that run on top, using the actual joins constructed with this that are still passing:

(deftest multi-hop-test
  (mt/dataset airports
    (testing "Should be able to filter against other tables with that require multiple joins\n"
      (testing "single direct join: Airport -> Municipality"
        (is (= ["San Francisco International Airport"]
               (chain-filter airport.name {municipality.name ["San Francisco"]}))))
      (testing "2 joins required: Airport -> Municipality -> Region"
        (is (= ["Beale Air Force Base"
                "Edwards Air Force Base"
                "John Wayne Airport-Orange County Airport"]
               (take 3 (chain-filter airport.name {region.name ["California"]})))))
      (testing "3 joins required: Airport -> Municipality -> Region -> Country"
        (is (= ["Abraham Lincoln Capital Airport"
                "Albuquerque International Sunport"
                "Altus Air Force Base"]
               (take 3 (chain-filter airport.name {country.name ["United States"]})))))
      (testing "4 joins required: Airport -> Municipality -> Region -> Country -> Continent"
        (is (= ["Afonso Pena Airport"
                "Alejandro Velasco Astete International Airport"
                "Carrasco International /General C L Berisso Airport"]
               (take 3 (chain-filter airport.name {continent.name ["South America"]})))))
...

This change is below this level and matches the 70 or so assertions that were using it, in addition to the new tests.

Snowflake error is related to monthly quota and not relevant.

@howonlee
Copy link
Contributor

looks like new mega graph tests add like 10 whole minutes to the CI

it would be happy times if they were less mega

@dpsutton
Copy link
Contributor Author

looks like new mega graph tests add like 10 whole minutes to the CI

it would be happy times if they were less mega

How did you determine this?

Checking test runs: this branch | master

  • be-tests-java-16-ee 5m 40s | 5m 37s
  • be-tests-java-11-oss 9m 44s | 9m 0s
  • be-tests-mariadb-ee 9m 43s | 11 seconds for some reason
  • be-tests-mysql-ee 13m 59s | 14m 4s

times taken from https://app.circleci.com/pipelines/github/metabase/metabase?branch=master

I'm not sure I see any time that is particularly longer. Happy to adjust down though if we do see something like that. Can you show me numbers somewhere that show that @howonlee ?

@howonlee
Copy link
Contributor

howonlee commented Jun 29, 2021

looks like new mega graph tests add like 10 whole minutes to the CI
it would be happy times if they were less mega

How did you determine this?

Checking test runs: this branch | master

  • be-tests-java-16-ee 5m 40s | 5m 37s
  • be-tests-java-11-oss 9m 44s | 9m 0s
  • be-tests-mariadb-ee 9m 43s | 11 seconds for some reason
  • be-tests-mysql-ee 13m 59s | 14m 4s

times taken from https://app.circleci.com/pipelines/github/metabase/metabase?branch=master

I'm not sure I see any time that is particularly longer. Happy to adjust down though if we do see something like that. Can you show me numbers somewhere that show that @howonlee ?

you're right, the mariadb threw me off. on my machine it's a solid 15 secs longer but not big enough to be really annoying

the real but off-topic question is wtf is going on with the mariadb tests!

(#'chain-filter/traverse-graph megagraph :start :end 5)))
(is (= [[:start 90] [90 200] [200 :end]]
(#'chain-filter/traverse-graph megagraph-single-path :start :end 5))))
(testing "Returns nil if there is no path"
Copy link
Contributor

@howonlee howonlee Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should prolly check that it respects the max-hops param in a graph where it would get a path if it didn't respect the max-hops param, since putting that in is the thing that's gonna fix this bug either way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 129 checks this:

   (testing "Finds over a few hops"
      (let [graph {:start {:a [:start->a]}
                   :a     {:b [:a->b]}
                   :b     {:c [:b->c]}
                   :c     {:end [:c->end]}}]
        (is (= [:start->a :a->b :b->c :c->end]
               (#'chain-filter/traverse-graph graph :start :end 5)))
        (testing "But will not exceed the max depth" ;; max depth set to 2 and it returns nil here
          (is (nil? (#'chain-filter/traverse-graph graph :start :end 2))))))

The thing that actually fixes this is the BFS change. Descending into traversals was the wrong strategy to find our way across big databases.

Copy link
Contributor

@howonlee howonlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit wrt test, but otherwise works great on repl and clickin around in the product. what an interesting bug

@dpsutton dpsutton merged commit 8e684e3 into master Jun 30, 2021
@dpsutton dpsutton deleted the chain-filters-cpu-utilization branch June 30, 2021 15:11
dpsutton added a commit that referenced this pull request Jul 1, 2021
* Breadth first search on graph for chain filters

- abandons at max-depth and looks in breadth first fashion rather than
descending as far as possible on each possible path

* Add loop detection

not actually fooled by loops since we abandon after a certain path
length. But will remove some false starts before they have to be
killed by the path length check.

* Add seen optimization of nodes

We reach nodes in the shortest path possible, so keep a list of nodes
visited and don't revisit them: we got to them as fast as possible,
and once there, there's no new nodes that we can see from that vantage
point.

This allows a serious speedup and should hopefully fully resolve the
cpu issues seen by some.

* alignment
dpsutton added a commit that referenced this pull request Jul 1, 2021
* Breadth first search on graph for chain filters

- abandons at max-depth and looks in breadth first fashion rather than
descending as far as possible on each possible path

* Add loop detection

not actually fooled by loops since we abandon after a certain path
length. But will remove some false starts before they have to be
killed by the path length check.

* Add seen optimization of nodes

We reach nodes in the shortest path possible, so keep a list of nodes
visited and don't revisit them: we got to them as fast as possible,
and once there, there's no new nodes that we can see from that vantage
point.

This allows a serious speedup and should hopefully fully resolve the
cpu issues seen by some.

* alignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants