-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV appends: Handle _mb_row_id columns #36705
Conversation
src/metabase/upload.clj
Outdated
{:keys [parse-rows table-col-names table-has-auto-pk?]} (check-schema table header) | ||
driver (driver.u/database->driver database) | ||
auto-pk-type (driver/upload-type->database-type driver ::auto-incrementing-int-pk) | ||
create-auto-pk? (not table-has-auto-pk?)] | ||
(try | ||
(let [parsed-rows (parse-rows rows)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized that we are realizing the whole "rows" here.
I think it'd be nice if driver/insert-into!
take a lazy seq and returns the number of items inserted. this way we don't have to have the whole seq realized before hand.
This is not a blocker a but a potential TODO
src/metabase/upload.clj
Outdated
;; check for duplicates | ||
(when (some #(< 1 %) (vals (frequencies normalized-header))) | ||
(throw (ex-info (tru "The CSV file contains duplicate column names.") | ||
{:status-code 422}))) | ||
(if (and (empty? extra) (empty? missing)) | ||
{:parse-rows (fn [rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated with what we have in detect-schema
.
it makes me think about whether we should stop having these functions returning a lexical scoped function like this.
WDYT about having a function
(defn parse-rows [col->types rows excluded-columns-idx])
this way we can decouple schema checking and parsing the rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point, ideally these shouldn't be coupled and duplicated logic can be separated. Let's think about this. check-schema
is doing a lot here, and its logic can definitely be split
Schema checking and parsing the rows are coupled in the sense that:
(1) both the header and rows need to have their auto-pk columns removed.
(2) the schema checking uses the types of fields to know what parser to apply to each column
Creating (defn parse-rows [col->types rows excluded-columns-idx])
might help reduce some coupling because the coupling from (1) is more explicitly in the form of the excluded-columns-idx
argument. But they're still coupled by the shared decision to remove the auto-pk columns. To decouple them fully, I could remove the auto-pk column from both the header
argument of schema-check
and the rows
argument of parse-rows
.
If we have these functions:
(defn auto-pk-col-indices [header])
(defn schema-check [table-fields header-without-auto-pk-columns])
(defn parse-rows [col-upload-types rows-without-auto-pk-columns])
Then have them called like so in append-csv!*
:
excluded-columns-indices (auto-pk-col-indices header)
header' (remove-indices excluded-columns-indices header)
rows' (map (partial remove-indices excluded-columns-indices) rows)
table-fields (t2/select :model/Field :table_id (:id table) :active true)
table-has-auto-pk? (contains? fields-by-normed-name auto-pk-column-name)
fields-by-normed-name (m/index-by (comp normalize-column-name :name) fields)
col-upload-types (map (comp base-type->upload-type :base_type fields-by-normed-name normalize-column-name) header)
_ (check-schema table-fields header') ;; all this does is raise errors
... then later
parsed-rows (parse-rows col-upload-types rows')
Now (1) and (2) are both outside check-schema
, and parse-rows
doesn't have any knowledge about (1).
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, I think this is much better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a couple of suggestions.
Also you mentioned
This column is assumed to be auto-incrementing. We use the max value of this to roll back changes when failures occur during insertion
I don't see this being handled anywhere. Is it part of this PR?
No, that will be implemented with type mismatches exception handling. |
No failed tests 🎉 |
@qnkhuat I've implemented those tidy ups but MySQL tests are still failing because of this bug I just found I will fix that test before this PR merges to the feature branch, either by fixing the bug first or by temporarily excluding the datetime column from the test, and fixing the bug later. Probably the latter |
src/metabase/upload.clj
Outdated
{:keys [parse-rows table-col-names]} (check-schema table header)] | ||
(let [[header & rows] (rows-without-auto-pk-columns (csv/read-csv reader)) | ||
driver (driver.u/database->driver database) | ||
fields-by-normed-name (m/index-by (comp normalize-column-name :name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fields-by-normed-name (m/index-by (comp normalize-column-name :name) | |
normed-name->field (m/index-by (comp normalize-column-name :name) |
I think this is what we often use when naming maps
src/metabase/upload.clj
Outdated
[header] | ||
(set (indices-where #(= auto-pk-column-name (normalize-column-name %)) header))) | ||
|
||
(defn- rows-without-auto-pk-columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(defn- rows-without-auto-pk-columns | |
(defn- csv-without-auto-pk-columns |
This deals with both header and rows, so calling it rows
is misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csv
is misleading too, because it's kinda independent of the format because the CSV has already been parsed into vectors of rows and columns. How about without-auto-pk-columns
? Then anything that has columns (like the header and rows) are acceptable and there's no confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to without-auto-pk-columns
and used header-and-rows
in the arglist
(defn- without-auto-pk-columns
[header-and-rows]
(let [header (first header-and-rows)
auto-pk-indices (auto-pk-column-indices header)]
(cond->> header-and-rows
auto-pk-indices
(map (partial remove-indices auto-pk-indices)))))
src/metabase/driver/sql_jdbc.clj
Outdated
@@ -158,6 +158,17 @@ | |||
(doseq [sql sqls] | |||
(qp.writeback/execute-write-sql! db-id sql)))) | |||
|
|||
(defmethod driver/add-columns! :sql-jdbc | |||
[driver db-id table-name col->type] | |||
(let [table-name (keyword table-name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let [table-name (keyword table-name) | |
(let [ |
you're keyword
-ing it twice
src/metabase/upload.clj
Outdated
(when-not (contains? indices idx) | ||
item)) | ||
coll)) | ||
(defn- get-auto-pk-column [table-id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(defn- get-auto-pk-column [table-id] | |
(defn- table-id->auto-pk-field-id [table-id] |
I think this is more descriptive
src/metabase/upload.clj
Outdated
parsers (map #(upload-parsing/upload-type->parser % settings) col-upload-types)] | ||
(for [row rows] | ||
(for [[value parser] (map-with-nils vector row parsers)] | ||
(when (not (str/blank? value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(when (not (str/blank? value)) | |
(when-not (str/blank? value) |
test/metabase/upload_test.clj
Outdated
(testing "Check the data was not uploaded into the table" | ||
(is (= [[1 "Obi-Wan Kenobi"]] | ||
(rows-for-table table)))) | ||
(io/delete-file file)))))))) | ||
|
||
(deftest append-mb-row-id-1-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(deftest append-mb-row-id-1-test | |
(deftest append-mb-row-id-1-test |
I'm not a fan of naming tests with numbers, I think we can come up with a short and meaningful name here. it helps reader scan quickly what has been covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new structure is awesome!
I have some suggestions but those are not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong button ✅
Suggestions implemented here dc9ddb8 Thanks for the great review, as always |
This PR will merge into the feature branch for Merge 1 of Milestone 0 (PR)
Epic: Allow appending more data to CSV uploads
product doc
eng doc
This PR implements special behaviour for
_mb_row_id
columns for CSV appends.It corresponds to the following section of the eng doc (block link):
Special behaviour around _mb_row_id:
_mb_row_id
is uploaded, we will ignore the_mb_row_id
column in the CSV_mb_row_id
is uploaded, we will create the column anyway_mb_row_id
is uploaded to a table doesn’t have it, we will ignore the_mb_row_id
column in the CSV but create the column anyway_mb_row_id
is uploaded and the table doesn’t have it either, we will create the column anyway_mb_row_id
is uploaded and the table does have it, we can upload just fine (it’s not required)