-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV appends: Use a transaction for inserting data #36995
CSV appends: Use a transaction for inserting data #36995
Conversation
What's your thought on memory usage? |
Are you referring to the memory footprint on the JVM, or the database? Memory usage of the JVM should be independent of whether transactions are used or not, because the objects can be garbage collected after they are inserted. As for the database, MySQL and Postgres are smart enough to write large files of data to disk, so memory isn't an issue there either. Otherwise large transactions would be impossible to execute. |
I was thinking about database memory usage.
Cool, that's good to know, in that case we don't need to worry about blowing the dataware house up. |
src/metabase/driver/mysql.clj
Outdated
(let [temp-file (File/createTempFile table-name ".tsv") | ||
file-path (.getAbsolutePath temp-file)] | ||
(try | ||
(let [tsvs (map (partial row->tsv driver (count column-names)) values) | ||
sql (sql/format {::load [file-path (keyword table-name)] | ||
:columns (map keyword column-names)} | ||
:quoted true | ||
:dialect (sql.qp/quote-style driver))] | ||
(with-open [^java.io.Writer writer (jio/writer file-path)] | ||
(doseq [value (interpose \newline tsvs)] | ||
(.write writer (str value)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done outside of transaction
src/metabase/driver/sql_jdbc.clj
Outdated
(let [table-name (keyword table-name) | ||
columns (map keyword column-names) | ||
;; We need to partition the insert into multiple statements for both performance and correctness. | ||
;; | ||
;; On Postgres with a large file, 100 (3.76m) was significantly faster than 50 (4.03m) and 25 (4.27m). 1,000 was a | ||
;; little faster but not by much (3.63m), and 10,000 threw an error: | ||
;; PreparedStatement can have at most 65,535 parameters | ||
;; One imagines that `(long (/ 65535 (count columns)))` might be best, but I don't trust the 65K limit to apply | ||
;; across all drivers. With that in mind, 100 seems like a safe compromise. | ||
;; There's nothing magic about 100, but it felt good in testing. There could well be a better number. | ||
chunks (partition-all (or driver/*insert-chunk-rows* 100) values) | ||
sqls (map #(sql/format {:insert-into table-name | ||
:columns columns | ||
:values %} | ||
:quoted true | ||
:dialect (sql.qp/quote-style driver)) | ||
chunks)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this preparation outside of the transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preparation steps should be done outside of transaction, other than that this looks good.
(try | ||
(driver/insert-into! driver (:id database) (table-identifier table) normed-header parsed-rows) | ||
(catch Throwable e | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1 failed test on run #904 ↗︎
Details:
Review all test suite changes for PR #36995 ↗︎ |
cdb50f5
into
appends/milestone-0-endpoint
This PR will merge into the feature branch for Merge 1 of Milestone 0 (PR)
Epic: Allow appending more data to CSV uploads
product doc
eng doc
This PR mostly preserves observable behaviour. It replaces the logic I created for handling insertion errors to fake transactions with real transactions.
Originally I wanted to use
_mb_row_id
for deleting inserted rows if there was a failure, because I thought the reason we weren't using transactions before was to make insertions faster. Turns out I was wrong, using a transaction is actually slightly faster because a transaction doesn't add much overhead but each commit does.I say "This PR mostly preserves observable behaviour" because it makes one change: if a failure occurs on inserting data and the table didn't have a
_mb_row_id
column before, no_mb_row_id
column is created. That behaviour can be seen inappend-no-mb-row-id-failure-test
.