Convert column to float when appending floats to an integer column #39493

crisptrutski · 2024-03-03T13:52:24Z

Description

This adds a type detection step during append which checks for columns that would need to be relaxed, and does so if we've configured that transition as OK.

How to verify

Describe the steps to verify that the changes are working as expected.

Upload a CSV with an all integer column as a new table.
View the table and confirm it has integer type.
Append a CSV with the same column containing floats to the same table.
View the table and confirm it has a float type, and numbers were not truncated.

Demo

Watch it in action

crisptrutski · 2024-03-03T13:52:39Z

Convert column to float when appending floats to an integer column #39493 : 5 dependent PRs (#39706 , #39724 , #39741 and 2 others) 👈
Use toposort to simplify and fix CSV column type operations #39491
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @crisptrutski and the rest of your teammates on Graphite

replay-io · 2024-03-05T22:11:54Z

Status	In Progress ↗︎ 51 / 52
Commit	`444c2ed`
Results	⚠️ 4 Flaky ✅ 2345 Passed

crisptrutski · 2024-03-06T01:48:41Z

There's a regression, where we're upgrading the type unnecessarily.

Uploading 2.0 into a column of type int should be coerced to 2
expected: [[1 2]]
  actual: [[1 2.0]]
    diff: - [[nil 2]]
          + [[nil 2.0]]

src/metabase/driver.clj

src/metabase/upload.clj

calherries · 2024-03-06T15:25:20Z

src/metabase/upload.clj

+          ;; for now we just plan for the worst and perform a fairly expensive operation to detect any type changes
+          ;; we can come back and optimize this to an optimistic-with-fallback approach later.
+          type->value->type  (partial relax-type (settings->type->check settings))
+          relaxed-types      (->> (sample-rows rows)


I think we should forget sampling rows here. Uploaded CSVs can only be 50 MB and parsing them should be lightning fast (in the future that is). What we want right now is correctness first, speed second.

Should we remove it from uploads as well then, as a short-term fix for missing the floats in a mostly int column?

yes, although I'm not sure that's the cause of the bug

Interesting, I'll try reproduce it with and without.

calherries · 2024-03-06T15:43:53Z

src/metabase/upload.clj

@@ -619,8 +655,25 @@
                              (driver/create-auto-pk-with-append-csv? driver)
                              (not (contains? normed-name->field auto-pk-column-name)))
          _                  (check-schema (dissoc normed-name->field auto-pk-column-name) header)
-          col-upload-types   (map (comp base-type->upload-type :base_type normed-name->field) normed-header)
-          parsed-rows        (parse-rows col-upload-types rows)]
+          settings           (upload-parsing/get-settings)


This optimization around upload-parsing/get-settings is premature IMO. We just don't care much about the few milliseconds per 1000 rows right now, and it's adding a lot of noise to the code.

By my measure (upload-parsing/get-settings) is really cheap in comparison to everything else we're doing. You could even memoize it to make it really really cheap, but even that is questionably worth the keystrokes.

(def type->check (settings->type->check (upload-parsing/get-settings))) (time (dotimes [_ 100000] (value->type type->check "1"))) "Elapsed time: 6440.837959 msecs" (time (dotimes [_ 100000] (upload-parsing/get-settings))) "Elapsed time: 46.310292 msecs"

The purity is the main thing for me.

Oh. Well in that case I just disagree that purity is worth it here, but I do see some benefits so if you feel strongly about it let's keep it

My stance is that you need a really good reason for making things impure. I might get worn down over time by how little existing code is pure.

I like the way you're sanity checking the performance of things we've been discussing recently, it's very easy to reason about this stuff from an comfortable armchair, and be wrong.

I know you're using time just for ballpark numbers, and in this case was fit for purpose, but similarly to how we've started using that nice memory measurement tool, I'm thinking that it would be nice to have a little Criterium¹ harness to use for these kinds of micro benchmarks. The Java configuration we use for our REPL is not that representative of production, tables may turn once escape analysis and inlining kick in for example.

Even when doing microbenchmarks rigorously, it's often misleading for real world applications as soon as allocation or memory look-ups are involved - in a tight loop memory locality is huge. This is obviously not a tight loop, and Clojure is in general debaucherous when it comes to memory layout, but good habits go a long way to preventing headaches when things do matter. Non-local changes can often bring old naive code into a hot path.

My general approach is to avoid any work whenever it is simple to do so, especially when it has other benefits. Pure code is so much easier to test.

Footnotes

I see it's been 3 years since it got an update. Maybe there's a new kid in town? Hopefully this is just good-old Clojure stability. ↩

+1 on criterium, we should add it to :dev. I sometimes go to the trouble of using it myself. Point noted on not taking microbenchmarks too seriously, although I do find them handy as ballpark estimates.

My stance is that you need a really good reason for making things impure.

I guess I'm not so much of a purist haha. Pure code makes some code hard to read for little benefit. Sometimes you need a really good reason to make things pure.

All benchmarks are better than guesses, but it'll be really nice to a solid quite of real world benchmarks too.

Code being complex can be a good reason itself, unfortunately its so subjective.

When you just need environmental state, bundling that all up in the first argument is a really easy convention that composes pretty well. You can always partially apply the function to get back the impure version.

This "context first arg" is the approach I typically take for injecting things like database connections, metrics emitters, etc, even in imperative languages like Java. I find it immensely useful to be able to trace where effects can happen - things get so complex otherwise!

calherries · 2024-03-06T15:50:11Z

src/metabase/upload.clj

+(defn- relax-type
+  "Given an existing column type, and a value to insert into it, relax the type until it can parse the value."
+  [type->check current-type value]
+  (cond (nil? value) current-type
+        (nil? current-type) (value->type type->check value)
+        :else (let [trimmed (str/trim value)]
+                (->> (cons current-type (ancestors h current-type))
+                     (filter #((type->check %) trimmed))
+                     first))))


Could we use allowed-type-upgrades to limit the type checks? Then relax-type can return only allowed upgrades.

That's a nice idea, but I was thinking to leave this method agnostic of whether it's being used for append. That way we could also use it if we chunk up the original insertion, in which case we'd always allow the upgrade. I guess we could pass in allow-upgrade? as an argument though...

Another subtle issue with this approach occurs if we had any non-leaf ambiguous nodes, see my other comment about wanting to wait until we've converted to column-type. Perhaps we'll never have a use case for non-leaf non-column types, but I'd prefer not to make limiting assumptions in areas where there is bigger low hanging fruit for optimization.

Simplicity is another thing, but in this case having separate lines where we relax and then check if it's allowed is arguably simpler in a Hickey-ian sense than a function that interleaves both.

I'll add this idea to a larger backlog optimization issue, since it can be made to work around the abstract node thing by either putting the abstract nodes into the allowed set (I don't like this), or by implicitly putting all the nodes between each pair into the set as well - perhaps that's a bit too magic, but from the start I liked the idea of just defining the "convex hull" of relaxations we'll allow.

calherries · 2024-03-06T15:54:11Z

src/metabase/upload.clj

+          relaxed-types      (->> (sample-rows rows)
+                                  (reduce #(u/map-all type->value->type %1 %2) old-column-types)
+                                  (map column-type))
+          new-column-types   (map #(if (matching-or-upgradable? %1 %2) %2 %1) old-column-types relaxed-types)


Intuitively, I feel like we should be able to skip this line. During the reduction we should be able to get a list of columns that have new column types by construction, rather than checking whether they are matching afterwards. Maybe there's a reason why you've separated these two things though.

The idea is to avoid projecting to concrete types until we've finished relaxing the type, to avoid a case like turning ::boolean-or-int to ::boolean before we see a later ::int. Only once we turn each into a column-type can we check whether this each "upgrade" is allowed.

This is a bit theoretical, since our only abstract type is a leaf, but I'd rather have a bit more code and computation than make assumptions that could be painful to remove later.

calherries

Works for me!

crisptrutski · 2024-03-08T00:59:22Z

@calherries This should be good-to-go now, just need feedback on where I put the new multi-method.

Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>

This was referenced Mar 3, 2024

Use hierarchy instead of home-rolled DAG structure #39489

Closed

Implement an order preserving tag hierarchy #39490

Closed

crisptrutski mentioned this pull request Mar 3, 2024

Use toposort to simplify and fix CSV column type operations #39491

Merged

metabase-bot bot assigned crisptrutski Mar 3, 2024

metabase-bot bot added the .Team/BackendComponents also known as BEC label Mar 3, 2024

crisptrutski changed the title ~~WIP test~~ Alter Mar 3, 2024

crisptrutski changed the title ~~Alter~~ Convert column to float when appending floats to an integer column Mar 3, 2024

crisptrutski added the no-backport Do not backport this PR to any branch label Mar 3, 2024

crisptrutski force-pushed the toposort branch 2 times, most recently from 32f5afa to e30f710 Compare March 5, 2024 04:03

Base automatically changed from toposort to master March 5, 2024 14:01

crisptrutski force-pushed the upload-integer-to-float branch from 0e1a2d0 to 84417d1 Compare March 5, 2024 21:59

crisptrutski marked this pull request as ready for review March 5, 2024 22:00

crisptrutski requested a review from camsaul as a code owner March 5, 2024 22:00

crisptrutski requested review from a team and calherries March 5, 2024 22:00

calherries reviewed Mar 6, 2024

View reviewed changes

src/metabase/driver.clj Outdated Show resolved Hide resolved

calherries reviewed Mar 6, 2024

View reviewed changes

src/metabase/upload.clj Outdated Show resolved Hide resolved

calherries reviewed Mar 6, 2024

View reviewed changes

src/metabase/upload.clj Outdated Show resolved Hide resolved

calherries reviewed Mar 6, 2024

View reviewed changes

crisptrutski force-pushed the upload-integer-to-float branch from a94542a to 3a00284 Compare March 6, 2024 17:09

crisptrutski mentioned this pull request Mar 6, 2024

Take upload settings as an argument for easier testing #39706

Merged

calherries approved these changes Mar 6, 2024

View reviewed changes

crisptrutski mentioned this pull request Mar 6, 2024

Graph gymnastics to handle type coercing and promotion #39724

Closed

crisptrutski mentioned this pull request Mar 7, 2024

Clean up type inference interface #39822

Merged

crisptrutski and others added 19 commits March 8, 2024 08:09

WIP test

37b13dd

wipz

c518c3a

align

def7be9

More efficient type relaxation

5ca90e8

Finish implementation

e56b142

Pick backport destination

5c14e77

Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>

Missed passing arg, disable sampling

1e13bef

Remove sampling

c2d67a7

Update changelog

5282dfd

Fix alter-columns! for postgres

57cea5d

Allow regression

4b98f9f

Specialize alter-columns! query for postgres

4149888

Update changelog again

0b9ba6f

Optimize upload type inference (#39741)

0baf153

Fixup braces

7b0dc05

Take upload settings as an argument for easier testing (#39706)

7b2c27e

Move driver multi-method to sync.interface

b513f14

Avoid using interface ns directly

5c12b31

tweak changelog

444c2ed

crisptrutski force-pushed the upload-integer-to-float branch from f2d5a73 to 444c2ed Compare March 8, 2024 15:09

crisptrutski enabled auto-merge (squash) March 8, 2024 15:10

crisptrutski disabled auto-merge March 8, 2024 15:10

crisptrutski enabled auto-merge (squash) March 8, 2024 15:11

crisptrutski merged commit a0d78d1 into master Mar 8, 2024
111 checks passed

crisptrutski deleted the upload-integer-to-float branch March 8, 2024 17:00

crisptrutski added this to the 0.50 milestone Mar 10, 2024

crisptrutski mentioned this pull request Mar 19, 2024

Fix float-to-int type coercion for CSV appends using declarative maps #40275

Merged

crisptrutski mentioned this pull request Apr 16, 2024

🤖 backported "Fix coercing multiple columns from int to float in H2" #41194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert column to float when appending floats to an integer column #39493

Convert column to float when appending floats to an integer column #39493

crisptrutski commented Mar 3, 2024 •

edited

Loading

crisptrutski commented Mar 3, 2024 •

edited

Loading

replay-io bot commented Mar 5, 2024 •

edited

Loading

crisptrutski commented Mar 6, 2024

calherries Mar 6, 2024

crisptrutski Mar 6, 2024

calherries Mar 6, 2024

crisptrutski Mar 6, 2024

calherries Mar 6, 2024 •

edited

Loading

crisptrutski Mar 6, 2024

calherries Mar 6, 2024

crisptrutski Mar 6, 2024

crisptrutski Mar 6, 2024 •

edited

Loading

calherries Mar 6, 2024 •

edited

Loading

crisptrutski Mar 6, 2024

calherries Mar 6, 2024

crisptrutski Mar 6, 2024 •

edited

Loading

calherries Mar 6, 2024

crisptrutski Mar 6, 2024

calherries left a comment

crisptrutski commented Mar 8, 2024

Convert column to float when appending floats to an integer column #39493

Convert column to float when appending floats to an integer column #39493

Conversation

crisptrutski commented Mar 3, 2024 • edited Loading

Description

How to verify

Demo

crisptrutski commented Mar 3, 2024 • edited Loading

replay-io bot commented Mar 5, 2024 • edited Loading

crisptrutski commented Mar 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calherries Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crisptrutski Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Footnotes

calherries Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crisptrutski Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calherries left a comment

Choose a reason for hiding this comment

crisptrutski commented Mar 8, 2024

crisptrutski commented Mar 3, 2024 •

edited

Loading

crisptrutski commented Mar 3, 2024 •

edited

Loading

replay-io bot commented Mar 5, 2024 •

edited

Loading

calherries Mar 6, 2024 •

edited

Loading

crisptrutski Mar 6, 2024 •

edited

Loading

calherries Mar 6, 2024 •

edited

Loading

crisptrutski Mar 6, 2024 •

edited

Loading