Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV appends: Use a transaction for inserting data #36995

Merged
merged 6 commits into from
Dec 22, 2023

Conversation

calherries
Copy link
Contributor

@calherries calherries commented Dec 20, 2023

This PR will merge into the feature branch for Merge 1 of Milestone 0 (PR)

Epic: Allow appending more data to CSV uploads
product doc
eng doc

This PR mostly preserves observable behaviour. It replaces the logic I created for handling insertion errors to fake transactions with real transactions.

Originally I wanted to use _mb_row_id for deleting inserted rows if there was a failure, because I thought the reason we weren't using transactions before was to make insertions faster. Turns out I was wrong, using a transaction is actually slightly faster because a transaction doesn't add much overhead but each commit does.

I say "This PR mostly preserves observable behaviour" because it makes one change: if a failure occurs on inserting data and the table didn't have a _mb_row_id column before, no _mb_row_id column is created. That behaviour can be seen in append-no-mb-row-id-failure-test.

@metabase-bot metabase-bot bot added the .Team/BackendComponents also known as BEC label Dec 20, 2023
@calherries calherries requested a review from a team December 20, 2023 17:07
@qnkhuat
Copy link
Contributor

qnkhuat commented Dec 21, 2023

What's your thought on memory usage?

@calherries
Copy link
Contributor Author

Are you referring to the memory footprint on the JVM, or the database?

Memory usage of the JVM should be independent of whether transactions are used or not, because the objects can be garbage collected after they are inserted.

As for the database, MySQL and Postgres are smart enough to write large files of data to disk, so memory isn't an issue there either. Otherwise large transactions would be impossible to execute.

@qnkhuat
Copy link
Contributor

qnkhuat commented Dec 22, 2023

I was thinking about database memory usage.

MySQL and Postgres are smart enough to write large files of data to disk

Cool, that's good to know, in that case we don't need to worry about blowing the dataware house up.

Comment on lines 751 to 761
(let [temp-file (File/createTempFile table-name ".tsv")
file-path (.getAbsolutePath temp-file)]
(try
(let [tsvs (map (partial row->tsv driver (count column-names)) values)
sql (sql/format {::load [file-path (keyword table-name)]
:columns (map keyword column-names)}
:quoted true
:dialect (sql.qp/quote-style driver))]
(with-open [^java.io.Writer writer (jio/writer file-path)]
(doseq [value (interpose \newline tsvs)]
(.write writer (str value))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done outside of transaction

Comment on lines 144 to 160
(let [table-name (keyword table-name)
columns (map keyword column-names)
;; We need to partition the insert into multiple statements for both performance and correctness.
;;
;; On Postgres with a large file, 100 (3.76m) was significantly faster than 50 (4.03m) and 25 (4.27m). 1,000 was a
;; little faster but not by much (3.63m), and 10,000 threw an error:
;; PreparedStatement can have at most 65,535 parameters
;; One imagines that `(long (/ 65535 (count columns)))` might be best, but I don't trust the 65K limit to apply
;; across all drivers. With that in mind, 100 seems like a safe compromise.
;; There's nothing magic about 100, but it felt good in testing. There could well be a better number.
chunks (partition-all (or driver/*insert-chunk-rows* 100) values)
sqls (map #(sql/format {:insert-into table-name
:columns columns
:values %}
:quoted true
:dialect (sql.qp/quote-style driver))
chunks)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this preparation outside of the transaction.

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preparation steps should be done outside of transaction, other than that this looks good.

(try
(driver/insert-into! driver (:id database) (table-identifier table) normed-header parsed-rows)
(catch Throwable e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

cypress bot commented Dec 22, 2023

1 failed test on run #904 ↗︎

1 2177 159 0 Flakiness 0
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.

Details:

CSV appends: Use a transaction for inserting data
Project: Metabase e2e Commit: 2cdf9bef30
Status: Failed Duration: 14:42 💡
Started: Dec 22, 2023 10:40 AM Ended: Dec 22, 2023 10:55 AM

Review all test suite changes for PR #36995 ↗︎

@calherries calherries merged commit cdb50f5 into appends/milestone-0-endpoint Dec 22, 2023
70 of 72 checks passed
@calherries calherries deleted the appends/use-transaction branch December 22, 2023 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants