-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert column to float when appending floats to an integer column #39493
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @crisptrutski and the rest of your teammates on Graphite |
32f5afa
to
e30f710
Compare
0e1a2d0
to
84417d1
Compare
|
There's a regression, where we're upgrading the type unnecessarily.
|
src/metabase/upload.clj
Outdated
;; for now we just plan for the worst and perform a fairly expensive operation to detect any type changes | ||
;; we can come back and optimize this to an optimistic-with-fallback approach later. | ||
type->value->type (partial relax-type (settings->type->check settings)) | ||
relaxed-types (->> (sample-rows rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should forget sampling rows here. Uploaded CSVs can only be 50 MB and parsing them should be lightning fast (in the future that is). What we want right now is correctness first, speed second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove it from uploads as well then, as a short-term fix for missing the floats in a mostly int column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, although I'm not sure that's the cause of the bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I'll try reproduce it with and without.
@@ -619,8 +655,25 @@ | |||
(driver/create-auto-pk-with-append-csv? driver) | |||
(not (contains? normed-name->field auto-pk-column-name))) | |||
_ (check-schema (dissoc normed-name->field auto-pk-column-name) header) | |||
col-upload-types (map (comp base-type->upload-type :base_type normed-name->field) normed-header) | |||
parsed-rows (parse-rows col-upload-types rows)] | |||
settings (upload-parsing/get-settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization around upload-parsing/get-settings
is premature IMO. We just don't care much about the few milliseconds per 1000 rows right now, and it's adding a lot of noise to the code.
By my measure (upload-parsing/get-settings)
is really cheap in comparison to everything else we're doing. You could even memoize it to make it really really cheap, but even that is questionably worth the keystrokes.
(def type->check (settings->type->check (upload-parsing/get-settings)))
(time
(dotimes [_ 100000]
(value->type type->check "1")))
"Elapsed time: 6440.837959 msecs"
(time
(dotimes [_ 100000]
(upload-parsing/get-settings)))
"Elapsed time: 46.310292 msecs"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purity is the main thing for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Well in that case I just disagree that purity is worth it here, but I do see some benefits so if you feel strongly about it let's keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My stance is that you need a really good reason for making things impure. I might get worn down over time by how little existing code is pure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way you're sanity checking the performance of things we've been discussing recently, it's very easy to reason about this stuff from an comfortable armchair, and be wrong.
I know you're using time
just for ballpark numbers, and in this case was fit for purpose, but similarly to how we've started using that nice memory measurement tool, I'm thinking that it would be nice to have a little Criterium1 harness to use for these kinds of micro benchmarks. The Java configuration we use for our REPL is not that representative of production, tables may turn once escape analysis and inlining kick in for example.
Even when doing microbenchmarks rigorously, it's often misleading for real world applications as soon as allocation or memory look-ups are involved - in a tight loop memory locality is huge. This is obviously not a tight loop, and Clojure is in general debaucherous when it comes to memory layout, but good habits go a long way to preventing headaches when things do matter. Non-local changes can often bring old naive code into a hot path.
My general approach is to avoid any work whenever it is simple to do so, especially when it has other benefits. Pure code is so much easier to test.
Footnotes
-
I see it's been 3 years since it got an update. Maybe there's a new kid in town? Hopefully this is just good-old Clojure stability. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on criterium, we should add it to :dev
. I sometimes go to the trouble of using it myself. Point noted on not taking microbenchmarks too seriously, although I do find them handy as ballpark estimates.
My stance is that you need a really good reason for making things impure.
I guess I'm not so much of a purist haha. Pure code makes some code hard to read for little benefit. Sometimes you need a really good reason to make things pure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All benchmarks are better than guesses, but it'll be really nice to a solid quite of real world benchmarks too.
Code being complex can be a good reason itself, unfortunately its so subjective.
When you just need environmental state, bundling that all up in the first argument is a really easy convention that composes pretty well. You can always partially apply the function to get back the impure version.
This "context first arg" is the approach I typically take for injecting things like database connections, metrics emitters, etc, even in imperative languages like Java. I find it immensely useful to be able to trace where effects can happen - things get so complex otherwise!
src/metabase/upload.clj
Outdated
(defn- relax-type | ||
"Given an existing column type, and a value to insert into it, relax the type until it can parse the value." | ||
[type->check current-type value] | ||
(cond (nil? value) current-type | ||
(nil? current-type) (value->type type->check value) | ||
:else (let [trimmed (str/trim value)] | ||
(->> (cons current-type (ancestors h current-type)) | ||
(filter #((type->check %) trimmed)) | ||
first)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use allowed-type-upgrades
to limit the type checks? Then relax-type can return only allowed upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice idea, but I was thinking to leave this method agnostic of whether it's being used for append. That way we could also use it if we chunk up the original insertion, in which case we'd always allow the upgrade. I guess we could pass in allow-upgrade?
as an argument though...
Another subtle issue with this approach occurs if we had any non-leaf ambiguous nodes, see my other comment about wanting to wait until we've converted to column-type
. Perhaps we'll never have a use case for non-leaf non-column types, but I'd prefer not to make limiting assumptions in areas where there is bigger low hanging fruit for optimization.
Simplicity is another thing, but in this case having separate lines where we relax and then check if it's allowed is arguably simpler in a Hickey-ian sense than a function that interleaves both.
I'll add this idea to a larger backlog optimization issue, since it can be made to work around the abstract node thing by either putting the abstract nodes into the allowed set (I don't like this), or by implicitly putting all the nodes between each pair into the set as well - perhaps that's a bit too magic, but from the start I liked the idea of just defining the "convex hull" of relaxations we'll allow.
relaxed-types (->> (sample-rows rows) | ||
(reduce #(u/map-all type->value->type %1 %2) old-column-types) | ||
(map column-type)) | ||
new-column-types (map #(if (matching-or-upgradable? %1 %2) %2 %1) old-column-types relaxed-types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively, I feel like we should be able to skip this line. During the reduction we should be able to get a list of columns that have new column types by construction, rather than checking whether they are matching afterwards. Maybe there's a reason why you've separated these two things though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to avoid projecting to concrete types until we've finished relaxing the type, to avoid a case like turning ::boolean-or-int
to ::boolean
before we see a later ::int
. Only once we turn each into a column-type
can we check whether this each "upgrade" is allowed.
This is a bit theoretical, since our only abstract type is a leaf, but I'd rather have a bit more code and computation than make assumptions that could be painful to remove later.
a94542a
to
3a00284
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me!
@calherries This should be good-to-go now, just need feedback on where I put the new multi-method. |
Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>
f2d5a73
to
444c2ed
Compare
Closes #37069
Description
This adds a type detection step during append which checks for columns that would need to be relaxed, and does so if we've configured that transition as OK.
How to verify
Describe the steps to verify that the changes are working as expected.
Demo
Watch it in action