Optimize upload type inference #39741

crisptrutski · 2024-03-07T00:34:57Z

Description

This change signifantly optimizes how we infer the types of the columns in a CSV upload.

Without this change we relied on sampling the rows of large files to infer the schema, and after removing the sampling we would now spin for a while before throwing a stack overflow.

With this change the reference file from the Slack conversation gets inferred almost instantly.

There are two small tricks:

Strict evaluation at every step, to avoid building a huge call stack.
Skip types we've ruled out already, i.e. start from the current type.

Since I'd already started using this optimization for the append flow, this unifies the code nicely.

If we want to eek out a bit more performance here, we could make a u/mapv-all function using a vector transient, but even clojure.core/mapv uses laziness for more than 1 collection.

How to verify

Upload the CSV file for which we incorrectly inferred the and int column as float before.
It loads quickly, and the column has type float, with uncoerced values.

crisptrutski · 2024-03-07T00:35:13Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

Optimize upload type inference #39741 👈
Convert column to float when appending floats to an integer column #39493 : 4 other dependent PRs (#39706 , #39724 , #39821 and 1 other)
Use toposort to simplify and fix CSV column type operations #39491
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @crisptrutski and the rest of your teammates on Graphite

replay-io · 2024-03-07T01:04:01Z

Status	Complete ↗︎
Commit	`049abfc`
Results	⚠️ 3 Flaky ✅ 2329 Passed

src/metabase/upload.clj

tsmacdonald

I support the theory and uploaded a fat file that was fairly quick, but don't have enough background with previous perf benchmarks to make definitive claims.

Would be nice to maybe do some simple benchmarks? Even a time $(curl ...) sort of thing would be great

src/metabase/upload.clj

calherries · 2024-03-07T13:28:22Z

src/metabase/upload.clj

+      ;; It's important to realise this lazy sequence, because otherwise we can build a huge stack and overflow.
+      (vec (u/map-all type->value->type value-types row)))))
+
+(defn- relax-types [settings current-types rows]


relax-types should have either a name, docstring or malli schema that explains that its return value is a list of concrete column types.

The key thing to understand is that relax-types returns a sequence of concrete types. This is a key property that's used in append-csv*'s implementation.

Now that you've extracted this function from append-csv!*, that information is lost if you're just reading the body of that function. To understand how append-csv!* works, you would have to understand the implementation of relax-types and see that it calls column-type.

That's exactly why column-types-from-rows has a malli signature added for its return type: to make this property easier to understand.

Maybe we should also rename column-type to be concrete-type, just for extra clarity

Good points, especially that I was too cavalier hiding the projection that had nothing to do with relaxing. I'll think about renaming the type, pulled in both directions. Made a note to come back to this after this chunky review stack is merged.

calherries

This is great btw. Faster with less code!

crisptrutski · 2024-03-07T13:40:09Z

Would be nice to maybe do some simple benchmarks? Even a time $(curl ...) sort of thing would be great

Yeah I really want benchmarks in this area. There are still a bunch more optimizations on the table, I'm planning to cut an issue today outlining them, with its first step to add a criterium benchmark.

In retrospect it would have been nice to quantity both the line length where the previously implementation exploded, and long that took. I could also have added vec to the old implementation and given it a fair comparison, and also compared to the time when we still had sampling.

Since the subjective difference is so large, I'm going to save this kind of comparison for future. I even have 3 different sampling algorithms hiding on in my no-commit that I'd be curious to try out for even more ludicrous file sizes...

Co-authored-by: Tim Macdonald <tim@metabase.com>

Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>

Contains two child PRs: * Optimize upload type inference (#39741) * Take upload settings as an argument for easier testing (#39706)

crisptrutski requested a review from camsaul as a code owner March 7, 2024 00:34

This was referenced Mar 7, 2024

Use toposort to simplify and fix CSV column type operations #39491

Merged

Convert column to float when appending floats to an integer column #39493

Merged

This was referenced Mar 7, 2024

Take upload settings as an argument for easier testing #39706

Merged

Graph gymnastics to handle type coercing and promotion #39724

Closed

metabase-bot bot assigned crisptrutski Mar 7, 2024

crisptrutski requested a review from calherries March 7, 2024 00:35

metabase-bot bot added the .Team/BackendComponents also known as BEC label Mar 7, 2024

crisptrutski requested a review from a team March 7, 2024 00:35

crisptrutski removed the .Team/BackendComponents also known as BEC label Mar 7, 2024 — with Graphite App

crisptrutski force-pushed the optimize-upload-inference branch from e0f9f9d to 002b41f Compare March 7, 2024 00:37

crisptrutski changed the base branch from upload-integer-to-float to __experiment__2024_03_6__float_or_int March 7, 2024 00:37

crisptrutski force-pushed the __experiment__2024_03_6__float_or_int branch from e9d5982 to 26e1d0b Compare March 7, 2024 00:50

crisptrutski force-pushed the optimize-upload-inference branch from ab3b933 to e05d22f Compare March 7, 2024 00:50

crisptrutski force-pushed the __experiment__2024_03_6__float_or_int branch from 26e1d0b to 812244f Compare March 7, 2024 01:48

crisptrutski force-pushed the optimize-upload-inference branch from e05d22f to 049abfc Compare March 7, 2024 01:48

crisptrutski added the .Team/BackendComponents also known as BEC label Mar 7, 2024

tsmacdonald reviewed Mar 7, 2024

View reviewed changes

src/metabase/upload.clj Outdated Show resolved Hide resolved

tsmacdonald approved these changes Mar 7, 2024

View reviewed changes

calherries reviewed Mar 7, 2024

View reviewed changes

src/metabase/upload.clj Outdated Show resolved Hide resolved

calherries reviewed Mar 7, 2024

View reviewed changes

calherries approved these changes Mar 7, 2024

View reviewed changes

calherries reviewed Mar 7, 2024

View reviewed changes

crisptrutski and others added 4 commits March 7, 2024 13:25

Optimize upload type inference

c1a9784

whitespace

d7876cc

Fix inference with blank rows

794cb0f

Americanize

d358fbe

Co-authored-by: Tim Macdonald <tim@metabase.com>

Skip binding partial application

db2a9a2

Co-authored-by: Cal Herries <39073188+calherries@users.noreply.github.com>

crisptrutski force-pushed the optimize-upload-inference branch from 7b24b00 to db2a9a2 Compare March 7, 2024 20:26

crisptrutski changed the base branch from __experiment__2024_03_6__float_or_int to upload-integer-to-float March 7, 2024 20:26

crisptrutski merged commit bcac3a3 into upload-integer-to-float Mar 7, 2024
3 of 4 checks passed

crisptrutski deleted the optimize-upload-inference branch March 7, 2024 21:40

crisptrutski added a commit that referenced this pull request Mar 7, 2024

Optimize upload type inference (#39741)

c3bcc39

crisptrutski added a commit that referenced this pull request Mar 7, 2024

Optimize upload type inference (#39741)

28c9fa3

This was referenced Mar 7, 2024

Publish audit logs for CSV uploads #39821

Merged

Clean up type inference interface #39822

Merged

crisptrutski added a commit that referenced this pull request Mar 8, 2024

Optimize upload type inference (#39741)

0baf153

crisptrutski added a commit that referenced this pull request Mar 8, 2024

Optimize upload type inference (#39741)

7cefdba

crisptrutski added a commit that referenced this pull request Mar 8, 2024

Optimize upload type inference (#39741)

01fc46d

crisptrutski added a commit that referenced this pull request Mar 8, 2024

Upgrade column type when appending floats to an integer column (#39493)

a0d78d1

Contains two child PRs: * Optimize upload type inference (#39741) * Take upload settings as an argument for easier testing (#39706)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize upload type inference #39741

Optimize upload type inference #39741

crisptrutski commented Mar 7, 2024 •

edited

Loading

crisptrutski commented Mar 7, 2024 •

edited

Loading

replay-io bot commented Mar 7, 2024 •

edited

Loading

tsmacdonald left a comment

calherries Mar 7, 2024 •

edited

Loading

calherries Mar 7, 2024

crisptrutski Mar 7, 2024

calherries left a comment

crisptrutski commented Mar 7, 2024 •

edited

Loading

Optimize upload type inference #39741

Optimize upload type inference #39741

Conversation

crisptrutski commented Mar 7, 2024 • edited Loading

Description

How to verify

crisptrutski commented Mar 7, 2024 • edited Loading

replay-io bot commented Mar 7, 2024 • edited Loading

tsmacdonald left a comment

Choose a reason for hiding this comment

calherries Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

calherries Mar 7, 2024

Choose a reason for hiding this comment

crisptrutski Mar 7, 2024

Choose a reason for hiding this comment

calherries left a comment

Choose a reason for hiding this comment

crisptrutski commented Mar 7, 2024 • edited Loading

crisptrutski commented Mar 7, 2024 •

edited

Loading

crisptrutski commented Mar 7, 2024 •

edited

Loading

replay-io bot commented Mar 7, 2024 •

edited

Loading

calherries Mar 7, 2024 •

edited

Loading

crisptrutski commented Mar 7, 2024 •

edited

Loading