Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster sync-fks #38970

Merged
merged 69 commits into from
Mar 13, 2024
Merged

Faster sync-fks #38970

merged 69 commits into from
Mar 13, 2024

Conversation

calherries
Copy link
Contributor

@calherries calherries commented Feb 20, 2024

This is part 1 towards resolving #38492
Part 2 is #38828

This PR adds an alternative implementation of the sync-fks step of sync where we sync the foreign key information of a database.

It adds a new driver feature: describe-fks. If a driver supports this feature, they need to implement the new driver method metabase.driver/describe-fks. If the driver doesn't support the feature, we'll continue to use metabase.driver/describe-table-fks, which has been deprecated. The plan for driver authors to implement describe-fks in terms of driver/describe-table-fks, which should be a really simple change for most drivers, assuming it doesn't need to be performant for large DBs.

The way this works is instead of querying the customer DB one table at a time using the getImportedKeys JDBC method, we execute a single query that gets all the data at once with paginated results. We then reduce over those results (yay transducers), updating our App DB one foreign key at a time.

Only redshift is implemented right now but its implementation is 22 LOC, so more drivers can be added easily.

I tested with the redshift dev instance using a local postgres app DB with a schema with 10 foreign keys over 20 tables.

(mt/with-driver :redshift
  (mt/with-temp-test-data
    (let [n-foreign-keys 10]
      (mapcat
       (fn [i]
         [[(str "continent_" i)
           [{:field-name "name", :base-type :type/Text}]
           [["ok"]]]
          [(str "country_" i)
           [{:field-name "name", :base-type :type/Text}
            {:field-name "continent_id", :base-type :type/Integer :fk (str "continent_" i)}]
           [["ok" 1]]]])
       (range n-foreign-keys)))))

I got these times from the sync logs:

Before: 11.5 seconds 🐌
After: 867.7 ms 🚀

I don't want to test any bigger DBs yet because they take a long time to sync their fields. The real perf tests will come with #38828

Low-level test coverage could probably be improved but there's so much that depends on sync I think it's pretty safe without adding more tests.

Copy link

replay-io bot commented Feb 20, 2024

Status Complete ↗︎
Commit ac6359c
Results
⚠️ 5 Flaky
2355 Passed

@calherries calherries mentioned this pull request Feb 20, 2024
@calherries calherries added the backport Automatically create PR on current release branch on merge label Feb 21, 2024
@calherries calherries changed the base branch from master to fast-mark-fk March 6, 2024 17:51
@@ -714,6 +712,29 @@
results-metadata {:cols (column-metadata driver rsmeta)}]
(respond results-metadata (reducible-rows driver rs rsmeta qp.pipeline/*canceled-chan*))))))))

(defn sql->reducible-rows
Copy link
Contributor

@qnkhuat qnkhuat Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this when we have jdbc/reducible-query?

Copy link
Contributor Author

@calherries calherries Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-level reason:

  • do-with-connection-with-options, statement-or-prepared-statement and execute-statement-or-prepared-statement! reuse logic that we have in place for preparing query statements. We should reuse that logic.

low-level reasons:

  1. jdbc/reducible-query doesn't use metabase.driver.sql-jdbc.execute/prepared-statement or metabase.driver.sql-jdbc.execute/statement, which does things like sets the fetch size and ResultSet/TYPE_FORWARD_ONLY
  2. jdbc/reducible-query doesn't seem to work with java.sql.Statement for some reason, only strings or java.sql.PreparedStatement. We use java.sql.Statement by default instead of java.sql.PreparedStatement, and though I'm not sure why, I figured we should use the same configuration as we have for normal queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a docstring and renamed the function to make the purpose a little clearer, and also nod towards it being similar to jdbc/reducible-query

(defn simple-reducible-query
"Returns a reducible collection of rows as maps from `db` and a given SQL query. This is similar to [[jdbc/reducible-query]] but reuses the
driver-specific configuration for the Connection and Statement/PreparedStatement. This is slightly different from [[execute-reducible-query]]
in that it is not intended to be used as part of middleware. Keywordizes column names. "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the name simple-reducible-query, but I wanted to distinguish it as the simpler version of execute-reducible-query in the same namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call it reducible-query?

and I would mention the difference with jdbc/reducible-query in the docstring, I'm sure people will have the same question as I did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call it reducible-query?

I wanted to distinguish it as the simpler version of execute-reducible-query in the same namespace. That's not so obvious though. I will accept your suggestion :)

and I would mention the difference with jdbc/reducible-query in the docstring, I'm sure people will have the same question as I did.

I have included this already, though not in such detail:

"Returns a reducible collection of rows as maps from `db` and a given SQL query. This is similar to [[jdbc/reducible-query]] but reuses the
driver-specific configuration for the Connection and Statement/PreparedStatement. This is slightly different from [[execute-reducible-query]]


(defn set-initial-table-sync-complete-for-db!
"Marks initial sync for all tables in `db` as complete so that it becomes usable in the UI, if not already
set"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set"
set."

docstring need to be a complete sentence

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think this is good, but I still have some suggestions that need resolving.

Also CI is failing for redshift, not sure if it's related to this PR, looks like a flake tho.

@calherries calherries requested a review from qnkhuat March 12, 2024 09:18
@calherries
Copy link
Contributor Author

@qnkhuat can you take another look? I've resolved all your suggestions with a 👍 in the last commit: 8172cbd

Indeed redshift looks like a flake.

@calherries calherries mentioned this pull request Mar 12, 2024
Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@calherries calherries merged commit f0afae8 into master Mar 13, 2024
106 checks passed
@calherries calherries deleted the fast-sync-fks branch March 13, 2024 10:21
Copy link

@calherries Did you forget to add a milestone to the issue for this PR? When and where should I add a milestone?

github-actions bot pushed a commit that referenced this pull request Mar 13, 2024
paoliniluis pushed a commit that referenced this pull request Mar 13, 2024
* Faster sync-fks (#38970)

* Change driver changelog and remove describe-table-fks in 52 instead of 53

---------

Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>
Co-authored-by: Callum Herries <calherries@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Automatically create PR on current release branch on merge .Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants