Faster sync-fks #38970

calherries · 2024-02-20T16:30:33Z

This is part 1 towards resolving #38492
Part 2 is #38828

This PR adds an alternative implementation of the sync-fks step of sync where we sync the foreign key information of a database.

It adds a new driver feature: describe-fks. If a driver supports this feature, they need to implement the new driver method metabase.driver/describe-fks. If the driver doesn't support the feature, we'll continue to use metabase.driver/describe-table-fks, which has been deprecated. The plan for driver authors to implement describe-fks in terms of driver/describe-table-fks, which should be a really simple change for most drivers, assuming it doesn't need to be performant for large DBs.

The way this works is instead of querying the customer DB one table at a time using the getImportedKeys JDBC method, we execute a single query that gets all the data at once with paginated results. We then reduce over those results (yay transducers), updating our App DB one foreign key at a time.

Only redshift is implemented right now but its implementation is 22 LOC, so more drivers can be added easily.

I tested with the redshift dev instance using a local postgres app DB with a schema with 10 foreign keys over 20 tables.

(mt/with-driver :redshift
  (mt/with-temp-test-data
    (let [n-foreign-keys 10]
      (mapcat
       (fn [i]
         [[(str "continent_" i)
           [{:field-name "name", :base-type :type/Text}]
           [["ok"]]]
          [(str "country_" i)
           [{:field-name "name", :base-type :type/Text}
            {:field-name "continent_id", :base-type :type/Integer :fk (str "continent_" i)}]
           [["ok" 1]]]])
       (range n-foreign-keys)))))

I got these times from the sync logs:

Before: 11.5 seconds 🐌
After: 867.7 ms 🚀

I don't want to test any bigger DBs yet because they take a long time to sync their fields. The real perf tests will come with #38828

Low-level test coverage could probably be improved but there's so much that depends on sync I think it's pretty safe without adding more tests.

replay-io · 2024-02-20T16:49:02Z

Status	Complete ↗︎
Commit	`ac6359c`
Results	⚠️ 5 Flaky ✅ 2355 Passed

modules/drivers/redshift/src/metabase/driver/redshift.clj

src/metabase/sync/fetch_metadata.clj

src/metabase/driver/sql_jdbc/sync/describe_table.clj

modules/drivers/redshift/src/metabase/driver/redshift.clj

src/metabase/sync/sync_metadata/fks.clj

src/metabase/driver/sql_jdbc/execute.clj

qnkhuat · 2024-03-12T05:28:06Z

src/metabase/driver/sql_jdbc/execute.clj

@@ -714,6 +712,29 @@
              results-metadata {:cols (column-metadata driver rsmeta)}]
          (respond results-metadata (reducible-rows driver rs rsmeta qp.pipeline/*canceled-chan*))))))))

+(defn sql->reducible-rows


why do we need this when we have jdbc/reducible-query?

high-level reason:

do-with-connection-with-options, statement-or-prepared-statement and execute-statement-or-prepared-statement! reuse logic that we have in place for preparing query statements. We should reuse that logic.

low-level reasons:

jdbc/reducible-query doesn't use metabase.driver.sql-jdbc.execute/prepared-statement or metabase.driver.sql-jdbc.execute/statement, which does things like sets the fetch size and ResultSet/TYPE_FORWARD_ONLY

jdbc/reducible-query doesn't seem to work with java.sql.Statement for some reason, only strings or java.sql.PreparedStatement. We use java.sql.Statement by default instead of java.sql.PreparedStatement, and though I'm not sure why, I figured we should use the same configuration as we have for normal queries.

I've added a docstring and renamed the function to make the purpose a little clearer, and also nod towards it being similar to jdbc/reducible-query

metabase/src/metabase/driver/sql_jdbc/execute.clj

Lines 717 to 720 in 2d5975e

(defn simple-reducible-query

"Returns a reducible collection of rows as maps from `db` and a given SQL query. This is similar to [[jdbc/reducible-query]] but reuses the

driver-specific configuration for the Connection and Statement/PreparedStatement. This is slightly different from [[execute-reducible-query]]

in that it is not intended to be used as part of middleware. Keywordizes column names. "

I'm not sure about the name simple-reducible-query, but I wanted to distinguish it as the simpler version of execute-reducible-query in the same namespace.

why not call it reducible-query?

and I would mention the difference with jdbc/reducible-query in the docstring, I'm sure people will have the same question as I did.

why not call it reducible-query?

I wanted to distinguish it as the simpler version of execute-reducible-query in the same namespace. That's not so obvious though. I will accept your suggestion :)

and I would mention the difference with jdbc/reducible-query in the docstring, I'm sure people will have the same question as I did.

I have included this already, though not in such detail:

metabase/src/metabase/driver/sql_jdbc/execute.clj

Lines 718 to 719 in 2d5975e

"Returns a reducible collection of rows as maps from `db` and a given SQL query. This is similar to [[jdbc/reducible-query]] but reuses the

driver-specific configuration for the Connection and Statement/PreparedStatement. This is slightly different from [[execute-reducible-query]]

src/metabase/driver/sql_jdbc/sync/describe_table.clj

qnkhuat · 2024-03-12T05:34:49Z

src/metabase/sync/util.clj

+
+(defn set-initial-table-sync-complete-for-db!
+  "Marks initial sync for all tables in `db` as complete so that it becomes usable in the UI, if not already
+  set"


Suggested change

set"

set."

docstring need to be a complete sentence

qnkhuat

Generally, I think this is good, but I still have some suggestions that need resolving.

Also CI is failing for redshift, not sure if it's related to this PR, looks like a flake tho.

calherries · 2024-03-12T09:22:30Z

@qnkhuat can you take another look? I've resolved all your suggestions with a 👍 in the last commit: 8172cbd

Indeed redshift looks like a flake.

qnkhuat

LGTM!

github-actions · 2024-03-13T10:21:47Z

@calherries Did you forget to add a milestone to the issue for this PR? When and where should I add a milestone?

* Faster sync-fks (#38970) * Change driver changelog and remove describe-table-fks in 52 instead of 53 --------- Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com> Co-authored-by: Callum Herries <calherries@gmail.com>

…38970) (#40092)

calherries requested a review from camsaul as a code owner February 20, 2024 16:30

metabase-bot bot assigned calherries Feb 20, 2024

metabase-bot bot added the .Team/BackendComponents also known as BEC label Feb 20, 2024

calherries force-pushed the fast-sync-fks branch from 7917aa1 to a99e552 Compare February 20, 2024 16:37

calherries mentioned this pull request Feb 20, 2024

Faster sync-fields #38828

Merged

calherries added the backport Automatically create PR on current release branch on merge label Feb 21, 2024

calherries mentioned this pull request Feb 22, 2024

Remove special describe-table-fks for redshift #39011

Merged

calherries mentioned this pull request Mar 1, 2024

Getting fields in (Redshift) is slow #38492

Closed

calherries added 8 commits March 6, 2024 11:52

Make mark-fk! faster and update existing FKs

4d5eefa

Add foreign key update test

f5af96d

Tidy

650f3bc

Comment out failing test

0265654

add issue #

e2cd368

Make sync-fks! reducible

4b8610e

remove call count

80d01a5

tidy

da4d3a4

calherries mentioned this pull request Mar 6, 2024

Update existing foreign keys during metadata sync #39679

Merged

calherries added 5 commits March 6, 2024 15:28

Deduplicate

d479731

set semantic_type

04d71be

Use jdbc/execute! not toucan

23f11bd

Fix test

40f1ecd

Fix

e863a2d

calherries force-pushed the fast-sync-fks branch from 1968bdc to 26e9a44 Compare March 6, 2024 17:26

calherries changed the base branch from master to fast-mark-fk March 6, 2024 17:51

calherries added 5 commits March 6, 2024 20:17

Fix

a7488e1

Faster sync-fks

e8075d0

Set initial table sync complete in sync fks

46c1672

Fix docstring of get-fks

e531d05

Tidy up and add schema for fk-metadata

c1589ac