Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster sync-fields #38828

Merged
merged 116 commits into from
Mar 14, 2024
Merged

Faster sync-fields #38828

merged 116 commits into from
Mar 14, 2024

Conversation

calherries
Copy link
Contributor

@calherries calherries commented Feb 15, 2024

This is part 2 towards resolving #38492
Part 1 is #38970

This PR adds an alternative implementation of the sync-fields step of sync where we sync the foreign key information of a database.

It adds a new driver feature: describe-fields. If a driver supports this feature, they need to implement the new driver method metabase.driver/describe-fields. If the driver doesn't support the feature, we'll continue to use metabase.driver/describe-table. The plan for driver authors to implement describe-fields in terms of driver/describe-table, which should be a really simple change for most drivers, assuming it doesn't need to be performant for large DBs.

The way this works is instead of querying the customer DB one table at a time using the getColumns JDBC method, we execute a single query that gets all the data at once with paginated results. We then reduce over those results, updating our App DB one foreign key at a time. This approach mirrors #38970

I've tested this with the "massive" redshift instance, and this is how long each sync step took:

  • "sync-tables": 13.3 s
  • "sync-fields": 1.0 mins (this is the step we're speeding up)
  • "sync-fks": 1.3 s (there are no FKs to sync though)
  • total time for all sync metadata steps: 5.0 minutes

Before this PR, sync-fields took hours (too long to bother measuring precisely anyway).

Copy link

replay-io bot commented Feb 15, 2024

Status Complete ↗︎
Commit ea9f93e
Results
⚠️ 2 Flaky
2359 Passed

@calherries calherries added the no-backport Do not backport this PR to any branch label Feb 16, 2024
@metabase metabase deleted a comment from luizarakaki Feb 16, 2024
@calherries calherries mentioned this pull request Feb 20, 2024
Comment on lines 87 to 88
[nil :database-is-auto-increment] ; only needed for actions, which redshift doesn't support yet
[nil :database-required] ; only needed for actions, which redshift doesn't support yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't even need to return it as these fields are optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, 14fb6a5

Comment on lines +372 to +377
(defn CollectionOf
"Helper for creating schemas to check whether something is an instance of a collection."
[item-schema]
[:fn
{:error/message (format "Collection of %s" item-schema)}
#(and (coll? %) (every? (partial mc/validate item-schema) %))])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually dead code, but I think it's generally useful so I'm inclined to keep it around.

At the moment we have no way to type something that could be a set or a vector for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised it's not a built-in!

Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting notified that diffs are out of date, flushing comments before I continue.

@@ -87,9 +121,15 @@
;; custom Redshift type handling

(def ^:private database-type->base-type
(sql-jdbc.sync/pattern-based-database-type->base-type
[[#"(?i)CHARACTER VARYING" :type/Text] ; Redshift uses CHARACTER VARYING (N) as a synonym for VARCHAR(N)
[#"(?i)NUMERIC" :type/Decimal]])) ; and also has a NUMERIC(P,S) type, which is the same as DECIMAL(P,S)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not keep this comment on NUMERIC(P, S)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this got deleted by accident 9edacd0

(some-fn (sql-jdbc.sync/pattern-based-database-type->base-type
[[#"(?i)CHARACTER VARYING" :type/Text] ; Redshift uses CHARACTER VARYING (N) as a synonym for VARCHAR(N)
[#"(?i)NUMERIC" :type/Decimal]])
{:super :type/* ; (requested support in metabase#36642)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a short comment here explaining how these "*" types are treated by the application, and that we only list these mappings here to disable warnings in the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that type of comment could get out of date here. If you have any questions about type/* it should be documented metabase.types namespace. I see that it isn't, but I don't think I could do it justice if I tried right now

I'm mostly listing these types to document these types are deliberately not being handled, as opposed to accidentally not being handled. Removing the log warnings is a nice side-effect though.

@@ -241,8 +241,8 @@
view-nm))
(let [table-id (t2/select-one-pk Table :db_id (u/the-id database), :name view-nm)]
;; and its columns' :base_type should have been identified correctly
(is (= [{:name "numeric_col", :database_type "numeric(10,2)", :base_type :type/Decimal}
{:name "weird_varchar", :database_type "character varying(50)", :base_type :type/Text}]
(is (= [{:name "numeric_col", :database_type "numeric", :base_type :type/Decimal}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: are there any consequences to us no longer recording the size of the types?

nit: I've been fooled by bad GH rendering before, but it looks like :base_type is unaligned now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, we don't use the size / precision data from these strings anywhere.

src/metabase/driver.clj Show resolved Hide resolved
src/metabase/sync/fetch_metadata.clj Show resolved Hide resolved
src/metabase/sync/sync_metadata/fields.clj Show resolved Hide resolved
@@ -355,7 +355,7 @@
metabase.sync.field-values field-values
metabase.sync.interface i
metabase.sync.schedules sync.schedules
metabase.sync.sync-metadata.fields.fetch-metadata fetch-metadata
metabase.sync.sync-metadata.fields.our-metadata our-metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metabase.sync.sync-metadata.fields.our-metadata our-metadata
metabase.sync.sync-metadata.fields.our-metadata fields.our-metadata

without this prefix, it's hard to know whether it's fields metadata or table metadata or db metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. we don't use this namespace much so the extra characters doesn't have much downside
40df070


(mu/defn sync-fields-for-db!
"Sync the Fields in the Metabase application database for a specific `table`."
[database :- i/DatabaseInstance]
Copy link
Contributor

@qnkhuat qnkhuat Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this take tables and create a [schema name] -> table mapping that you can use inside the transducer to get the table.

Always good to reduce the number of DB calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also bothered by the N+1 table queries, but ended up not commenting as for large N we'd rather not fetch all the tables into memory.

It seems unfortunate to me that there's a deep nest of methods to require a full table model, because in practice they're mostly using the id, name, and schema mostly. I suspect we could just add one or two more table fields to the educed rows and get rid of this query, but it would be brittle to use a partially populated instance, and refactoring all those methods to depend only on their required table fields would be pretty hairy and ugly.

Copy link
Contributor Author

@calherries calherries Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Chris said, I'm fetching each table because I'm avoiding fetching all the tables into memory. The good thing is app DB queries should be quite cheap compared to customer DB queries, so this is still a huge win.

In the future we can avoid the N+1 by doing a refactoring like the one I did for sync-fks as I did here https://github.com/metabase/metabase/pull/39679/files#diff-e09c063342055336a262f66623a6a1783a89871e4426c33c92f8f9b5f2bcaaec

But I don't think it's worth it right now. That refactoring took longer than I thought because of the quirks between different app DBs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that makes sense; in that case, you need to "not" fetch all tables in metabase.sync.sync-metadata.fields line 117

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see why you made that comment now. Nice spot. Yes that was left over from a previous version where I was doing what you described.

Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done 👌

Comment on lines +372 to +377
(defn CollectionOf
"Helper for creating schemas to check whether something is an instance of a collection."
[item-schema]
[:fn
{:error/message (format "Collection of %s" item-schema)}
#(and (coll? %) (every? (partial mc/validate item-schema) %))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised it's not a built-in!

src/metabase/sync/sync_metadata/fields.clj Outdated Show resolved Hide resolved
@@ -4,6 +4,16 @@ title: Driver interface changelog

# Driver Interface Changelog

## Metabase 0.49.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn't you tell me that we don't introduce driver methods in the patch releases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe I did say that :) I can't remember how I justified it though, probably because we've never done it before. But putting it in the 0.49.0 section doesn't make much sense.

@calherries calherries requested a review from qnkhuat March 14, 2024 09:51
@calherries
Copy link
Contributor Author

calherries commented Mar 14, 2024

@qnkhuat can you take another look? It would be nice to get this in 49.0. It could be if it were approved in the next few hours I think. If it does, I'll update the changelog in the backport PR.

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve this then it LGTM!

@calherries
Copy link
Contributor Author

calherries commented Mar 14, 2024

@qnkhuat done, I also squeezed this one in: ea9f93e. In theory PKs should be optional because they'll default to false for databases that don't support them.

@calherries calherries enabled auto-merge (squash) March 14, 2024 10:17
@calherries calherries added backport Automatically create PR on current release branch on merge and removed no-backport Do not backport this PR to any branch labels Mar 14, 2024
@calherries calherries merged commit 37b5067 into master Mar 14, 2024
120 of 133 checks passed
@calherries calherries deleted the fast-sync-fields branch March 14, 2024 11:54
Copy link
Contributor

@calherries Did you forget to add a milestone to the issue for this PR? When and where should I add a milestone?

github-actions bot pushed a commit that referenced this pull request Mar 14, 2024
metabase-bot bot added a commit that referenced this pull request Mar 14, 2024
Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Automatically create PR on current release branch on merge .Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants