Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV appends: Handle _mb_row_id columns #36705

Merged
merged 17 commits into from
Dec 15, 2023

Conversation

calherries
Copy link
Contributor

@calherries calherries commented Dec 12, 2023

This PR will merge into the feature branch for Merge 1 of Milestone 0 (PR)

Epic: Allow appending more data to CSV uploads
product doc
eng doc

This PR implements special behaviour for _mb_row_id columns for CSV appends.

It corresponds to the following section of the eng doc (block link):

Special behaviour around _mb_row_id:

  • If a CSV file with _mb_row_id is uploaded, we will ignore the _mb_row_id column in the CSV
  • If a CSV file without _mb_row_id is uploaded, we will create the column anyway
  • This column is assumed to be auto-incrementing. We use the max value of this to roll back changes when failures occur during insertion
  • Test: If a CSV file with _mb_row_id is uploaded to a table doesn’t have it, we will ignore the _mb_row_id column in the CSV but create the column anyway
  • Test: If a CSV file without _mb_row_id is uploaded and the table doesn’t have it either, we will create the column anyway
  • Test: If a CSV file without _mb_row_id is uploaded and the table does have it, we can upload just fine (it’s not required)

@metabase-bot metabase-bot bot added the .Team/BackendComponents also known as BEC label Dec 12, 2023
@calherries calherries changed the title Handle _mb_row_id columns in the uploaded table and CSV CSV appends: Handle _mb_row_id columns in the uploaded table and CSV Dec 12, 2023
@calherries calherries requested a review from a team December 12, 2023 16:02
@calherries calherries changed the title CSV appends: Handle _mb_row_id columns in the uploaded table and CSV CSV appends: Handle _mb_row_id columns Dec 12, 2023
src/metabase/upload.clj Outdated Show resolved Hide resolved
src/metabase/upload.clj Outdated Show resolved Hide resolved
{:keys [parse-rows table-col-names table-has-auto-pk?]} (check-schema table header)
driver (driver.u/database->driver database)
auto-pk-type (driver/upload-type->database-type driver ::auto-incrementing-int-pk)
create-auto-pk? (not table-has-auto-pk?)]
(try
(let [parsed-rows (parse-rows rows)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized that we are realizing the whole "rows" here.
I think it'd be nice if driver/insert-into! take a lazy seq and returns the number of items inserted. this way we don't have to have the whole seq realized before hand.

This is not a blocker a but a potential TODO

src/metabase/upload.clj Outdated Show resolved Hide resolved
src/metabase/upload.clj Outdated Show resolved Hide resolved
;; check for duplicates
(when (some #(< 1 %) (vals (frequencies normalized-header)))
(throw (ex-info (tru "The CSV file contains duplicate column names.")
{:status-code 422})))
(if (and (empty? extra) (empty? missing))
{:parse-rows (fn [rows]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated with what we have in detect-schema.

it makes me think about whether we should stop having these functions returning a lexical scoped function like this.

WDYT about having a function
(defn parse-rows [col->types rows excluded-columns-idx])

this way we can decouple schema checking and parsing the rows.

Copy link
Contributor Author

@calherries calherries Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a point, ideally these shouldn't be coupled and duplicated logic can be separated. Let's think about this. check-schema is doing a lot here, and its logic can definitely be split

Schema checking and parsing the rows are coupled in the sense that:
(1) both the header and rows need to have their auto-pk columns removed.
(2) the schema checking uses the types of fields to know what parser to apply to each column

Creating (defn parse-rows [col->types rows excluded-columns-idx]) might help reduce some coupling because the coupling from (1) is more explicitly in the form of the excluded-columns-idx argument. But they're still coupled by the shared decision to remove the auto-pk columns. To decouple them fully, I could remove the auto-pk column from both the header argument of schema-check and the rows argument of parse-rows.

If we have these functions:

(defn auto-pk-col-indices [header])
(defn schema-check [table-fields header-without-auto-pk-columns])
(defn parse-rows [col-upload-types rows-without-auto-pk-columns])

Then have them called like so in append-csv!*:

          excluded-columns-indices (auto-pk-col-indices header)
          header' (remove-indices excluded-columns-indices header)    
          rows' (map (partial remove-indices excluded-columns-indices) rows)
          table-fields (t2/select :model/Field :table_id (:id table) :active true)
          table-has-auto-pk? (contains? fields-by-normed-name auto-pk-column-name)
          fields-by-normed-name (m/index-by (comp normalize-column-name :name) fields)
          col-upload-types (map (comp base-type->upload-type :base_type fields-by-normed-name normalize-column-name) header)
          _ (check-schema table-fields header') ;; all this does is raise errors
          ... then later
          parsed-rows (parse-rows col-upload-types rows')

Now (1) and (2) are both outside check-schema, and parse-rows doesn't have any knowledge about (1).

wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, I think this is much better :)

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple of suggestions.

Also you mentioned

This column is assumed to be auto-incrementing. We use the max value of this to roll back changes when failures occur during insertion

I don't see this being handled anywhere. Is it part of this PR?

@calherries
Copy link
Contributor Author

We use the max value of this to roll back changes when failures occur during insertion

No, that will be implemented with type mismatches exception handling.

Copy link

deploysentinel bot commented Dec 13, 2023

No failed tests 🎉

@calherries
Copy link
Contributor Author

calherries commented Dec 13, 2023

@qnkhuat I've implemented those tidy ups but MySQL tests are still failing because of this bug I just found
Uploaded CSVs with datetime fields create TIMESTAMP columns instead of DATETIME

I will fix that test before this PR merges to the feature branch, either by fixing the bug first or by temporarily excluding the datetime column from the test, and fixing the bug later. Probably the latter

{:keys [parse-rows table-col-names]} (check-schema table header)]
(let [[header & rows] (rows-without-auto-pk-columns (csv/read-csv reader))
driver (driver.u/database->driver database)
fields-by-normed-name (m/index-by (comp normalize-column-name :name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fields-by-normed-name (m/index-by (comp normalize-column-name :name)
normed-name->field (m/index-by (comp normalize-column-name :name)

I think this is what we often use when naming maps

[header]
(set (indices-where #(= auto-pk-column-name (normalize-column-name %)) header)))

(defn- rows-without-auto-pk-columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(defn- rows-without-auto-pk-columns
(defn- csv-without-auto-pk-columns

This deals with both header and rows, so calling it rows is misleading

Copy link
Contributor Author

@calherries calherries Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csv is misleading too, because it's kinda independent of the format because the CSV has already been parsed into vectors of rows and columns. How about without-auto-pk-columns? Then anything that has columns (like the header and rows) are acceptable and there's no confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to without-auto-pk-columns and used header-and-rows in the arglist

(defn- without-auto-pk-columns
  [header-and-rows]
  (let [header (first header-and-rows)
        auto-pk-indices (auto-pk-column-indices header)]
    (cond->> header-and-rows
      auto-pk-indices
      (map (partial remove-indices auto-pk-indices)))))

@@ -158,6 +158,17 @@
(doseq [sql sqls]
(qp.writeback/execute-write-sql! db-id sql))))

(defmethod driver/add-columns! :sql-jdbc
[driver db-id table-name col->type]
(let [table-name (keyword table-name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(let [table-name (keyword table-name)
(let [

you're keyword-ing it twice

(when-not (contains? indices idx)
item))
coll))
(defn- get-auto-pk-column [table-id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(defn- get-auto-pk-column [table-id]
(defn- table-id->auto-pk-field-id [table-id]

I think this is more descriptive

parsers (map #(upload-parsing/upload-type->parser % settings) col-upload-types)]
(for [row rows]
(for [[value parser] (map-with-nils vector row parsers)]
(when (not (str/blank? value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(when (not (str/blank? value))
(when-not (str/blank? value)

(testing "Check the data was not uploaded into the table"
(is (= [[1 "Obi-Wan Kenobi"]]
(rows-for-table table))))
(io/delete-file file))))))))

(deftest append-mb-row-id-1-test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(deftest append-mb-row-id-1-test
(deftest append-mb-row-id-1-test

I'm not a fan of naming tests with numbers, I think we can come up with a short and meaningful name here. it helps reader scan quickly what has been covered.

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new structure is awesome!

I have some suggestions but those are not blocking.

Copy link
Contributor

@qnkhuat qnkhuat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong button ✅

@calherries
Copy link
Contributor Author

Suggestions implemented here dc9ddb8

Thanks for the great review, as always

@calherries calherries merged commit c5d0767 into appends/milestone-0-endpoint Dec 15, 2023
103 checks passed
@calherries calherries deleted the appends/mb-row-id branch December 15, 2023 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants