CSV appends: Use a transaction for inserting data #36995

calherries · 2023-12-20T16:54:08Z

This PR will merge into the feature branch for Merge 1 of Milestone 0 (PR)

Epic: Allow appending more data to CSV uploads
product doc
eng doc

This PR mostly preserves observable behaviour. It replaces the logic I created for handling insertion errors to fake transactions with real transactions.

Originally I wanted to use _mb_row_id for deleting inserted rows if there was a failure, because I thought the reason we weren't using transactions before was to make insertions faster. Turns out I was wrong, using a transaction is actually slightly faster because a transaction doesn't add much overhead but each commit does.

I say "This PR mostly preserves observable behaviour" because it makes one change: if a failure occurs on inserting data and the table didn't have a _mb_row_id column before, no _mb_row_id column is created. That behaviour can be seen in append-no-mb-row-id-failure-test.

qnkhuat · 2023-12-21T04:50:57Z

What's your thought on memory usage?

calherries · 2023-12-21T14:05:03Z

Are you referring to the memory footprint on the JVM, or the database?

Memory usage of the JVM should be independent of whether transactions are used or not, because the objects can be garbage collected after they are inserted.

As for the database, MySQL and Postgres are smart enough to write large files of data to disk, so memory isn't an issue there either. Otherwise large transactions would be impossible to execute.

qnkhuat · 2023-12-22T02:53:59Z

I was thinking about database memory usage.

MySQL and Postgres are smart enough to write large files of data to disk

Cool, that's good to know, in that case we don't need to worry about blowing the dataware house up.

qnkhuat · 2023-12-22T02:57:59Z

src/metabase/driver/mysql.clj

+      (let [temp-file (File/createTempFile table-name ".tsv")
+            file-path (.getAbsolutePath temp-file)]
+        (try
+          (let [tsvs (map (partial row->tsv driver (count column-names)) values)
+                sql  (sql/format {::load   [file-path (keyword table-name)]
+                                  :columns (map keyword column-names)}
+                                 :quoted  true
+                                 :dialect (sql.qp/quote-style driver))]
+            (with-open [^java.io.Writer writer (jio/writer file-path)]
+              (doseq [value (interpose \newline tsvs)]
+                (.write writer (str value))))


this should be done outside of transaction

qnkhuat · 2023-12-22T02:59:14Z

src/metabase/driver/sql_jdbc.clj

+    (let [table-name (keyword table-name)
+          columns    (map keyword column-names)
+          ;; We need to partition the insert into multiple statements for both performance and correctness.
+          ;;
+          ;; On Postgres with a large file, 100 (3.76m) was significantly faster than 50 (4.03m) and 25 (4.27m). 1,000 was a
+          ;; little faster but not by much (3.63m), and 10,000 threw an error:
+          ;;     PreparedStatement can have at most 65,535 parameters
+          ;; One imagines that `(long (/ 65535 (count columns)))` might be best, but I don't trust the 65K limit to apply
+          ;; across all drivers. With that in mind, 100 seems like a safe compromise.
+          ;; There's nothing magic about 100, but it felt good in testing. There could well be a better number.
+          chunks     (partition-all (or driver/*insert-chunk-rows* 100) values)
+          sqls       (map #(sql/format {:insert-into table-name
+                                        :columns     columns
+                                        :values      %}
+                                       :quoted true
+                                       :dialect (sql.qp/quote-style driver))
+                          chunks)]


let's move this preparation outside of the transaction.

qnkhuat

Some preparation steps should be done outside of transaction, other than that this looks good.

qnkhuat · 2023-12-22T09:03:56Z

src/metabase/upload.clj

+      (try
+        (driver/insert-into! driver (:id database) (table-identifier table) normed-header parsed-rows)
+        (catch Throwable e
+


Suggested change

qnkhuat

LGTM

cypress · 2023-12-22T09:28:40Z

1 failed test on run #904 ↗︎

1	2177	159	0	0
⚠️ You've recorded test results over your free plan limit. Upgrade your plan to view test results.

Details:

CSV appends: Use a transaction for inserting data
Project: Metabase e2e	Commit: `2cdf9bef30`
Status: Failed	Duration: 14:42 💡
Started: Dec 22, 2023 10:40 AM	Ended: Dec 22, 2023 10:55 AM

Review all test suite changes for PR #36995 ↗︎

Use a transaction for inserting data

c534cf2

metabase-bot bot assigned calherries Dec 20, 2023

metabase-bot bot added the .Team/BackendComponents also known as BEC label Dec 20, 2023

Formatting

cc19321

calherries requested a review from a team December 20, 2023 17:07

Don't create auto-pk twice

d5b87ef

linter

81002df

qnkhuat reviewed Dec 22, 2023

View reviewed changes

qnkhuat requested changes Dec 22, 2023

View reviewed changes

Move prep out of transaction

7e150ea

calherries requested a review from qnkhuat December 22, 2023 09:01

qnkhuat reviewed Dec 22, 2023

View reviewed changes

qnkhuat approved these changes Dec 22, 2023

View reviewed changes

Fix mysql

4a23eb2

calherries merged commit cdb50f5 into appends/milestone-0-endpoint Dec 22, 2023
70 of 72 checks passed

calherries deleted the appends/use-transaction branch December 22, 2023 10:27

calherries mentioned this pull request Dec 22, 2023

CSV appends Milestone 0, Merge 1 of 2: add endpoint to append to table #36640

Merged

iethree mentioned this pull request Jan 18, 2024

[Epic] Allow appending more data to CSV uploads #35614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV appends: Use a transaction for inserting data #36995

CSV appends: Use a transaction for inserting data #36995

calherries commented Dec 20, 2023 •

edited

Loading

qnkhuat commented Dec 21, 2023

calherries commented Dec 21, 2023

qnkhuat commented Dec 22, 2023

qnkhuat Dec 22, 2023

qnkhuat Dec 22, 2023

qnkhuat left a comment

qnkhuat Dec 22, 2023

qnkhuat left a comment

cypress bot commented Dec 22, 2023 •

edited

Loading

CSV appends: Use a transaction for inserting data #36995

CSV appends: Use a transaction for inserting data #36995

Conversation

calherries commented Dec 20, 2023 • edited Loading

qnkhuat commented Dec 21, 2023

calherries commented Dec 21, 2023

qnkhuat commented Dec 22, 2023

qnkhuat Dec 22, 2023

Choose a reason for hiding this comment

qnkhuat Dec 22, 2023

Choose a reason for hiding this comment

qnkhuat left a comment

Choose a reason for hiding this comment

qnkhuat Dec 22, 2023

Choose a reason for hiding this comment

qnkhuat left a comment

Choose a reason for hiding this comment

cypress bot commented Dec 22, 2023 • edited Loading

1 failed test on run #904 ↗︎

Review all test suite changes for PR #36995 ↗︎

calherries commented Dec 20, 2023 •

edited

Loading

cypress bot commented Dec 22, 2023 •

edited

Loading