-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize upload type inference #39741
Optimize upload type inference #39741
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @crisptrutski and the rest of your teammates on |
e0f9f9d
to
002b41f
Compare
e9d5982
to
26e1d0b
Compare
ab3b933
to
e05d22f
Compare
|
26e1d0b
to
812244f
Compare
e05d22f
to
049abfc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support the theory and uploaded a fat file that was fairly quick, but don't have enough background with previous perf benchmarks to make definitive claims.
Would be nice to maybe do some simple benchmarks? Even a time $(curl ...)
sort of thing would be great
;; It's important to realise this lazy sequence, because otherwise we can build a huge stack and overflow. | ||
(vec (u/map-all type->value->type value-types row))))) | ||
|
||
(defn- relax-types [settings current-types rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relax-types
should have either a name, docstring or malli schema that explains that its return value is a list of concrete column types.
The key thing to understand is that relax-types
returns a sequence of concrete types. This is a key property that's used in append-csv*
's implementation.
Now that you've extracted this function from append-csv!*
, that information is lost if you're just reading the body of that function. To understand how append-csv!*
works, you would have to understand the implementation of relax-types
and see that it calls column-type
.
That's exactly why column-types-from-rows
has a malli signature added for its return type: to make this property easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also rename column-type
to be concrete-type
, just for extra clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, especially that I was too cavalier hiding the projection that had nothing to do with relaxing. I'll think about renaming the type, pulled in both directions. Made a note to come back to this after this chunky review stack is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great btw. Faster with less code!
Yeah I really want benchmarks in this area. There are still a bunch more optimizations on the table, I'm planning to cut an issue today outlining them, with its first step to add a criterium benchmark. In retrospect it would have been nice to quantity both the line length where the previously implementation exploded, and long that took. I could also have added vec to the old implementation and given it a fair comparison, and also compared to the time when we still had sampling. Since the subjective difference is so large, I'm going to save this kind of comparison for future. I even have 3 different sampling algorithms hiding on in my no-commit that I'd be curious to try out for even more ludicrous file sizes... |
Co-authored-by: Tim Macdonald <tim@metabase.com>
Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>
7b24b00
to
db2a9a2
Compare
Closes #38958
Description
This change signifantly optimizes how we infer the types of the columns in a CSV upload.
Without this change we relied on sampling the rows of large files to infer the schema, and after removing the sampling we would now spin for a while before throwing a stack overflow.
With this change the reference file from the Slack conversation gets inferred almost instantly.
There are two small tricks:
Since I'd already started using this optimization for the append flow, this unifies the code nicely.
If we want to eek out a bit more performance here, we could make a
u/mapv-all
function using a vector transient, but evenclojure.core/mapv
uses laziness for more than 1 collection.How to verify