CSV appends: Handle _mb_row_id columns #36705

calherries · 2023-12-12T15:57:50Z

This PR will merge into the feature branch for Merge 1 of Milestone 0 (PR)

Epic: Allow appending more data to CSV uploads
product doc
eng doc

This PR implements special behaviour for _mb_row_id columns for CSV appends.

It corresponds to the following section of the eng doc (block link):

Special behaviour around _mb_row_id:

If a CSV file with _mb_row_id is uploaded, we will ignore the _mb_row_id column in the CSV
If a CSV file without _mb_row_id is uploaded, we will create the column anyway
This column is assumed to be auto-incrementing. We use the max value of this to roll back changes when failures occur during insertion
Test: If a CSV file with _mb_row_id is uploaded to a table doesn’t have it, we will ignore the _mb_row_id column in the CSV but create the column anyway
Test: If a CSV file without _mb_row_id is uploaded and the table doesn’t have it either, we will create the column anyway
Test: If a CSV file without _mb_row_id is uploaded and the table does have it, we can upload just fine (it’s not required)

src/metabase/upload.clj

qnkhuat · 2023-12-13T07:00:22Z

src/metabase/upload.clj

+          {:keys [parse-rows table-col-names table-has-auto-pk?]} (check-schema table header)
+          driver (driver.u/database->driver database)
+          auto-pk-type (driver/upload-type->database-type driver ::auto-incrementing-int-pk)
+          create-auto-pk? (not table-has-auto-pk?)]
      (try
        (let [parsed-rows (parse-rows rows)]


just realized that we are realizing the whole "rows" here.
I think it'd be nice if driver/insert-into! take a lazy seq and returns the number of items inserted. this way we don't have to have the whole seq realized before hand.

This is not a blocker a but a potential TODO

src/metabase/upload.clj

qnkhuat · 2023-12-13T07:29:38Z

src/metabase/upload.clj

+    ;; check for duplicates
+    (when (some #(< 1 %) (vals (frequencies normalized-header)))
+      (throw (ex-info (tru "The CSV file contains duplicate column names.")
+                      {:status-code 422})))
    (if (and (empty? extra) (empty? missing))
      {:parse-rows (fn [rows]


This is duplicated with what we have in detect-schema.

it makes me think about whether we should stop having these functions returning a lexical scoped function like this.

WDYT about having a function
(defn parse-rows [col->types rows excluded-columns-idx])

this way we can decouple schema checking and parsing the rows.

You have a point, ideally these shouldn't be coupled and duplicated logic can be separated. Let's think about this. check-schema is doing a lot here, and its logic can definitely be split

Schema checking and parsing the rows are coupled in the sense that:
(1) both the header and rows need to have their auto-pk columns removed.
(2) the schema checking uses the types of fields to know what parser to apply to each column

Creating (defn parse-rows [col->types rows excluded-columns-idx]) might help reduce some coupling because the coupling from (1) is more explicitly in the form of the excluded-columns-idx argument. But they're still coupled by the shared decision to remove the auto-pk columns. To decouple them fully, I could remove the auto-pk column from both the header argument of schema-check and the rows argument of parse-rows.

If we have these functions:

(defn auto-pk-col-indices [header]) (defn schema-check [table-fields header-without-auto-pk-columns]) (defn parse-rows [col-upload-types rows-without-auto-pk-columns])

Then have them called like so in append-csv!*:

excluded-columns-indices (auto-pk-col-indices header) header' (remove-indices excluded-columns-indices header) rows' (map (partial remove-indices excluded-columns-indices) rows) table-fields (t2/select :model/Field :table_id (:id table) :active true) table-has-auto-pk? (contains? fields-by-normed-name auto-pk-column-name) fields-by-normed-name (m/index-by (comp normalize-column-name :name) fields) col-upload-types (map (comp base-type->upload-type :base_type fields-by-normed-name normalize-column-name) header) _ (check-schema table-fields header') ;; all this does is raise errors ... then later parsed-rows (parse-rows col-upload-types rows')

Now (1) and (2) are both outside check-schema, and parse-rows doesn't have any knowledge about (1).

wdyt?

awesome, I think this is much better :)

qnkhuat

Have a couple of suggestions.

Also you mentioned

This column is assumed to be auto-incrementing. We use the max value of this to roll back changes when failures occur during insertion

I don't see this being handled anywhere. Is it part of this PR?

calherries · 2023-12-13T10:08:09Z

We use the max value of this to roll back changes when failures occur during insertion

No, that will be implemented with type mismatches exception handling.

deploysentinel · 2023-12-13T17:34:32Z

No failed tests 🎉

calherries · 2023-12-13T18:35:13Z

@qnkhuat I've implemented those tidy ups but MySQL tests are still failing because of this bug I just found
Uploaded CSVs with datetime fields create TIMESTAMP columns instead of DATETIME

I will fix that test before this PR merges to the feature branch, either by fixing the bug first or by temporarily excluding the datetime column from the test, and fixing the bug later. Probably the latter

qnkhuat · 2023-12-14T10:28:11Z

src/metabase/upload.clj

-          {:keys [parse-rows table-col-names]} (check-schema table header)]
+    (let [[header & rows]       (rows-without-auto-pk-columns (csv/read-csv reader))
+          driver                (driver.u/database->driver database)
+          fields-by-normed-name (m/index-by (comp normalize-column-name :name)


Suggested change

fields-by-normed-name (m/index-by (comp normalize-column-name :name)

normed-name->field (m/index-by (comp normalize-column-name :name)

I think this is what we often use when naming maps

qnkhuat · 2023-12-14T10:34:13Z

src/metabase/upload.clj

+  [header]
+  (set (indices-where #(= auto-pk-column-name (normalize-column-name %)) header)))
+
+(defn- rows-without-auto-pk-columns


Suggested change

(defn- rows-without-auto-pk-columns

(defn- csv-without-auto-pk-columns

This deals with both header and rows, so calling it rows is misleading

csv is misleading too, because it's kinda independent of the format because the CSV has already been parsed into vectors of rows and columns. How about without-auto-pk-columns? Then anything that has columns (like the header and rows) are acceptable and there's no confusion

I renamed it to without-auto-pk-columns and used header-and-rows in the arglist

(defn- without-auto-pk-columns [header-and-rows] (let [header (first header-and-rows) auto-pk-indices (auto-pk-column-indices header)] (cond->> header-and-rows auto-pk-indices (map (partial remove-indices auto-pk-indices)))))

qnkhuat · 2023-12-14T10:36:53Z

src/metabase/driver/sql_jdbc.clj

@@ -158,6 +158,17 @@
    (doseq [sql sqls]
      (qp.writeback/execute-write-sql! db-id sql))))

+(defmethod driver/add-columns! :sql-jdbc
+  [driver db-id table-name col->type]
+  (let [table-name (keyword table-name)


Suggested change

(let [table-name (keyword table-name)

(let [

you're keyword-ing it twice

qnkhuat · 2023-12-14T10:37:30Z

src/metabase/upload.clj

-                  (when-not (contains? indices idx)
-                    item))
-                coll))
+(defn- get-auto-pk-column [table-id]


Suggested change

(defn- get-auto-pk-column [table-id]

(defn- table-id->auto-pk-field-id [table-id]

I think this is more descriptive

qnkhuat · 2023-12-14T10:38:18Z

src/metabase/upload.clj

+        parsers  (map #(upload-parsing/upload-type->parser % settings) col-upload-types)]
+    (for [row rows]
+      (for [[value parser] (map-with-nils vector row parsers)]
+        (when (not (str/blank? value))


Suggested change

(when (not (str/blank? value))

(when-not (str/blank? value)

test/metabase/api/table_test.clj

qnkhuat · 2023-12-14T10:41:05Z

test/metabase/upload_test.clj

              (testing "Check the data was not uploaded into the table"
                (is (= [[1 "Obi-Wan Kenobi"]]
                       (rows-for-table table))))
              (io/delete-file file))))))))

+(deftest append-mb-row-id-1-test


Suggested change

(deftest append-mb-row-id-1-test

(deftest append-mb-row-id-1-test

I'm not a fan of naming tests with numbers, I think we can come up with a short and meaningful name here. it helps reader scan quickly what has been covered.

qnkhuat

The new structure is awesome!

I have some suggestions but those are not blocking.

qnkhuat

Wrong button ✅

…ith MySQL (#36790)

calherries · 2023-12-14T15:51:46Z

Suggestions implemented here dc9ddb8

Thanks for the great review, as always

This reverts commit 725c362.

calherries added 2 commits December 12, 2023 15:37

Change tests to use create-upload-table! instead of mt/dataset

d3c3395

Add driver/add-columns! and implement handling _mb_row_id columns

a588505

metabase-bot bot assigned calherries Dec 12, 2023

metabase-bot bot added the .Team/BackendComponents also known as BEC label Dec 12, 2023

calherries changed the title ~~Handle _mb_row_id columns in the uploaded table and CSV~~ CSV appends: Handle _mb_row_id columns in the uploaded table and CSV Dec 12, 2023

calherries requested a review from a team December 12, 2023 16:02

calherries added 4 commits December 12, 2023 18:05

Add driver/add-column! to driver changelog

64efe17

Tidy

0a29a46

Fix changelog

a2677df

Tidy

df6fe63

calherries changed the title ~~CSV appends: Handle _mb_row_id columns in the uploaded table and CSV~~ CSV appends: Handle _mb_row_id columns Dec 12, 2023

calherries added 3 commits December 12, 2023 19:21

Fix test

9485a28

Remove basic db definition

f0bbc46

Improve docstring

9621040