Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert column to float when appending floats to an integer column #39493

Merged
merged 19 commits into from
Mar 8, 2024

Conversation

crisptrutski
Copy link
Contributor

@crisptrutski crisptrutski commented Mar 3, 2024

Closes #37069

Description

This adds a type detection step during append which checks for columns that would need to be relaxed, and does so if we've configured that transition as OK.

How to verify

Describe the steps to verify that the changes are working as expected.

  1. Upload a CSV with an all integer column as a new table.
  2. View the table and confirm it has integer type.
  3. Append a CSV with the same column containing floats to the same table.
  4. View the table and confirm it has a float type, and numbers were not truncated.

Demo

Watch it in action

Copy link
Contributor Author

crisptrutski commented Mar 3, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @crisptrutski and the rest of your teammates on Graphite Graphite

@metabase-bot metabase-bot bot added the .Team/BackendComponents also known as BEC label Mar 3, 2024
@crisptrutski crisptrutski changed the title WIP test Alter Mar 3, 2024
@crisptrutski crisptrutski changed the title Alter Convert column to float when appending floats to an integer column Mar 3, 2024
@crisptrutski crisptrutski added the no-backport Do not backport this PR to any branch label Mar 3, 2024
@crisptrutski crisptrutski force-pushed the toposort branch 2 times, most recently from 32f5afa to e30f710 Compare March 5, 2024 04:03
Base automatically changed from toposort to master March 5, 2024 14:01
@crisptrutski crisptrutski marked this pull request as ready for review March 5, 2024 22:00
@crisptrutski crisptrutski requested review from a team and calherries March 5, 2024 22:00
Copy link

replay-io bot commented Mar 5, 2024

Status In Progress ↗︎ 51 / 52
Commit 444c2ed
Results
⚠️ 4 Flaky
2345 Passed

@crisptrutski
Copy link
Contributor Author

There's a regression, where we're upgrading the type unnecessarily.

Uploading 2.0 into a column of type int should be coerced to 2
expected: [[1 2]]
  actual: [[1 2.0]]
    diff: - [[nil 2]]
          + [[nil 2.0]]

src/metabase/driver.clj Outdated Show resolved Hide resolved
src/metabase/upload.clj Outdated Show resolved Hide resolved
src/metabase/upload.clj Outdated Show resolved Hide resolved
;; for now we just plan for the worst and perform a fairly expensive operation to detect any type changes
;; we can come back and optimize this to an optimistic-with-fallback approach later.
type->value->type (partial relax-type (settings->type->check settings))
relaxed-types (->> (sample-rows rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should forget sampling rows here. Uploaded CSVs can only be 50 MB and parsing them should be lightning fast (in the future that is). What we want right now is correctness first, speed second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove it from uploads as well then, as a short-term fix for missing the floats in a mostly int column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, although I'm not sure that's the cause of the bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'll try reproduce it with and without.

@@ -619,8 +655,25 @@
(driver/create-auto-pk-with-append-csv? driver)
(not (contains? normed-name->field auto-pk-column-name)))
_ (check-schema (dissoc normed-name->field auto-pk-column-name) header)
col-upload-types (map (comp base-type->upload-type :base_type normed-name->field) normed-header)
parsed-rows (parse-rows col-upload-types rows)]
settings (upload-parsing/get-settings)
Copy link
Contributor

@calherries calherries Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization around upload-parsing/get-settings is premature IMO. We just don't care much about the few milliseconds per 1000 rows right now, and it's adding a lot of noise to the code.

By my measure (upload-parsing/get-settings) is really cheap in comparison to everything else we're doing. You could even memoize it to make it really really cheap, but even that is questionably worth the keystrokes.

(def type->check (settings->type->check (upload-parsing/get-settings)))

(time
 (dotimes [_ 100000]
   (value->type type->check "1")))
"Elapsed time: 6440.837959 msecs"

(time
 (dotimes [_ 100000]
   (upload-parsing/get-settings)))
"Elapsed time: 46.310292 msecs"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purity is the main thing for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Well in that case I just disagree that purity is worth it here, but I do see some benefits so if you feel strongly about it let's keep it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My stance is that you need a really good reason for making things impure. I might get worn down over time by how little existing code is pure.

Copy link
Contributor Author

@crisptrutski crisptrutski Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way you're sanity checking the performance of things we've been discussing recently, it's very easy to reason about this stuff from an comfortable armchair, and be wrong.

I know you're using time just for ballpark numbers, and in this case was fit for purpose, but similarly to how we've started using that nice memory measurement tool, I'm thinking that it would be nice to have a little Criterium1 harness to use for these kinds of micro benchmarks. The Java configuration we use for our REPL is not that representative of production, tables may turn once escape analysis and inlining kick in for example.

Even when doing microbenchmarks rigorously, it's often misleading for real world applications as soon as allocation or memory look-ups are involved - in a tight loop memory locality is huge. This is obviously not a tight loop, and Clojure is in general debaucherous when it comes to memory layout, but good habits go a long way to preventing headaches when things do matter. Non-local changes can often bring old naive code into a hot path.

My general approach is to avoid any work whenever it is simple to do so, especially when it has other benefits. Pure code is so much easier to test.

Footnotes

  1. I see it's been 3 years since it got an update. Maybe there's a new kid in town? Hopefully this is just good-old Clojure stability.

Copy link
Contributor

@calherries calherries Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on criterium, we should add it to :dev. I sometimes go to the trouble of using it myself. Point noted on not taking microbenchmarks too seriously, although I do find them handy as ballpark estimates.

My stance is that you need a really good reason for making things impure.

I guess I'm not so much of a purist haha. Pure code makes some code hard to read for little benefit. Sometimes you need a really good reason to make things pure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All benchmarks are better than guesses, but it'll be really nice to a solid quite of real world benchmarks too.

Code being complex can be a good reason itself, unfortunately its so subjective.

When you just need environmental state, bundling that all up in the first argument is a really easy convention that composes pretty well. You can always partially apply the function to get back the impure version.

This "context first arg" is the approach I typically take for injecting things like database connections, metrics emitters, etc, even in imperative languages like Java. I find it immensely useful to be able to trace where effects can happen - things get so complex otherwise!

Comment on lines 212 to 220
(defn- relax-type
"Given an existing column type, and a value to insert into it, relax the type until it can parse the value."
[type->check current-type value]
(cond (nil? value) current-type
(nil? current-type) (value->type type->check value)
:else (let [trimmed (str/trim value)]
(->> (cons current-type (ancestors h current-type))
(filter #((type->check %) trimmed))
first))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use allowed-type-upgrades to limit the type checks? Then relax-type can return only allowed upgrades.

Copy link
Contributor Author

@crisptrutski crisptrutski Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice idea, but I was thinking to leave this method agnostic of whether it's being used for append. That way we could also use it if we chunk up the original insertion, in which case we'd always allow the upgrade. I guess we could pass in allow-upgrade? as an argument though...

Another subtle issue with this approach occurs if we had any non-leaf ambiguous nodes, see my other comment about wanting to wait until we've converted to column-type. Perhaps we'll never have a use case for non-leaf non-column types, but I'd prefer not to make limiting assumptions in areas where there is bigger low hanging fruit for optimization.

Simplicity is another thing, but in this case having separate lines where we relax and then check if it's allowed is arguably simpler in a Hickey-ian sense than a function that interleaves both.

I'll add this idea to a larger backlog optimization issue, since it can be made to work around the abstract node thing by either putting the abstract nodes into the allowed set (I don't like this), or by implicitly putting all the nodes between each pair into the set as well - perhaps that's a bit too magic, but from the start I liked the idea of just defining the "convex hull" of relaxations we'll allow.

relaxed-types (->> (sample-rows rows)
(reduce #(u/map-all type->value->type %1 %2) old-column-types)
(map column-type))
new-column-types (map #(if (matching-or-upgradable? %1 %2) %2 %1) old-column-types relaxed-types)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, I feel like we should be able to skip this line. During the reduction we should be able to get a list of columns that have new column types by construction, rather than checking whether they are matching afterwards. Maybe there's a reason why you've separated these two things though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to avoid projecting to concrete types until we've finished relaxing the type, to avoid a case like turning ::boolean-or-int to ::boolean before we see a later ::int. Only once we turn each into a column-type can we check whether this each "upgrade" is allowed.

This is a bit theoretical, since our only abstract type is a leaf, but I'd rather have a bit more code and computation than make assumptions that could be painful to remove later.

Copy link
Contributor

@calherries calherries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

@crisptrutski
Copy link
Contributor Author

@calherries This should be good-to-go now, just need feedback on where I put the new multi-method.

@crisptrutski crisptrutski enabled auto-merge (squash) March 8, 2024 15:10
@crisptrutski crisptrutski enabled auto-merge (squash) March 8, 2024 15:11
@crisptrutski crisptrutski merged commit a0d78d1 into master Mar 8, 2024
111 checks passed
@crisptrutski crisptrutski deleted the upload-integer-to-float branch March 8, 2024 17:00
@crisptrutski crisptrutski added this to the 0.50 milestone Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-backport Do not backport this PR to any branch .Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV appends: integer columns should be convertable to float
2 participants