Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize upload type inference #39741

Merged
merged 5 commits into from
Mar 7, 2024

Conversation

crisptrutski
Copy link
Contributor

@crisptrutski crisptrutski commented Mar 7, 2024

Closes #38958

Description

This change signifantly optimizes how we infer the types of the columns in a CSV upload.

Without this change we relied on sampling the rows of large files to infer the schema, and after removing the sampling we would now spin for a while before throwing a stack overflow.

With this change the reference file from the Slack conversation gets inferred almost instantly.

There are two small tricks:

  • Strict evaluation at every step, to avoid building a huge call stack.
  • Skip types we've ruled out already, i.e. start from the current type.

Since I'd already started using this optimization for the append flow, this unifies the code nicely.

If we want to eek out a bit more performance here, we could make a u/mapv-all function using a vector transient, but even clojure.core/mapv uses laziness for more than 1 collection.

How to verify

  1. Upload the CSV file for which we incorrectly inferred the and int column as float before.
  2. It loads quickly, and the column has type float, with uncoerced values.

Copy link
Contributor Author

crisptrutski commented Mar 7, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @crisptrutski and the rest of your teammates on Graphite Graphite

@metabase-bot metabase-bot bot added the .Team/BackendComponents also known as BEC label Mar 7, 2024
@crisptrutski crisptrutski requested a review from a team March 7, 2024 00:35
@crisptrutski crisptrutski removed the .Team/BackendComponents also known as BEC label Mar 7, 2024 — with Graphite App
@crisptrutski crisptrutski changed the base branch from upload-integer-to-float to __experiment__2024_03_6__float_or_int March 7, 2024 00:37
@crisptrutski crisptrutski force-pushed the __experiment__2024_03_6__float_or_int branch from e9d5982 to 26e1d0b Compare March 7, 2024 00:50
Copy link

replay-io bot commented Mar 7, 2024

Status Complete ↗︎
Commit 049abfc
Results
⚠️ 3 Flaky
2329 Passed

@crisptrutski crisptrutski force-pushed the __experiment__2024_03_6__float_or_int branch from 26e1d0b to 812244f Compare March 7, 2024 01:48
@crisptrutski crisptrutski added the .Team/BackendComponents also known as BEC label Mar 7, 2024
src/metabase/upload.clj Outdated Show resolved Hide resolved
Copy link
Member

@tsmacdonald tsmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support the theory and uploaded a fat file that was fairly quick, but don't have enough background with previous perf benchmarks to make definitive claims.

Would be nice to maybe do some simple benchmarks? Even a time $(curl ...) sort of thing would be great

src/metabase/upload.clj Outdated Show resolved Hide resolved
;; It's important to realise this lazy sequence, because otherwise we can build a huge stack and overflow.
(vec (u/map-all type->value->type value-types row)))))

(defn- relax-types [settings current-types rows]
Copy link
Contributor

@calherries calherries Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relax-types should have either a name, docstring or malli schema that explains that its return value is a list of concrete column types.

The key thing to understand is that relax-types returns a sequence of concrete types. This is a key property that's used in append-csv*'s implementation.

Now that you've extracted this function from append-csv!*, that information is lost if you're just reading the body of that function. To understand how append-csv!* works, you would have to understand the implementation of relax-types and see that it calls column-type.

That's exactly why column-types-from-rows has a malli signature added for its return type: to make this property easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also rename column-type to be concrete-type, just for extra clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, especially that I was too cavalier hiding the projection that had nothing to do with relaxing. I'll think about renaming the type, pulled in both directions. Made a note to come back to this after this chunky review stack is merged.

Copy link
Contributor

@calherries calherries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great btw. Faster with less code!

@crisptrutski
Copy link
Contributor Author

crisptrutski commented Mar 7, 2024

Would be nice to maybe do some simple benchmarks? Even a time $(curl ...) sort of thing would be great

Yeah I really want benchmarks in this area. There are still a bunch more optimizations on the table, I'm planning to cut an issue today outlining them, with its first step to add a criterium benchmark.

In retrospect it would have been nice to quantity both the line length where the previously implementation exploded, and long that took. I could also have added vec to the old implementation and given it a fair comparison, and also compared to the time when we still had sampling.

Since the subjective difference is so large, I'm going to save this kind of comparison for future. I even have 3 different sampling algorithms hiding on in my no-commit that I'd be curious to try out for even more ludicrous file sizes...

Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>
@crisptrutski crisptrutski changed the base branch from __experiment__2024_03_6__float_or_int to upload-integer-to-float March 7, 2024 20:26
@crisptrutski crisptrutski merged commit bcac3a3 into upload-integer-to-float Mar 7, 2024
3 of 4 checks passed
@crisptrutski crisptrutski deleted the optimize-upload-inference branch March 7, 2024 21:40
crisptrutski added a commit that referenced this pull request Mar 7, 2024
crisptrutski added a commit that referenced this pull request Mar 7, 2024
crisptrutski added a commit that referenced this pull request Mar 8, 2024
crisptrutski added a commit that referenced this pull request Mar 8, 2024
crisptrutski added a commit that referenced this pull request Mar 8, 2024
crisptrutski added a commit that referenced this pull request Mar 8, 2024
Contains two child PRs:

* Optimize upload type inference (#39741)
* Take upload settings as an argument for easier testing (#39706)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.Team/BackendComponents also known as BEC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants