feat: add scanner.plan_splits function by hamersaw · Pull Request #5792 · lance-format/lance

hamersaw · 2026-01-22T21:56:39Z

This PR adds a plan_splits function the to Scanner struct. The goal is that this serves as a singular endpoint where distributed compute frameworks can effectively partition a Lance dataset for parallelized processing. The main goals are:
(1) Prune fragments that do not satisfy a filter (if exists): We use an index lookup to determine which fragments contain rows (and which do not) to prune unnecessary fragments.
(2) Bin pack fragments into spiits: Distributed compute frameworks typically work best with a "sweet-spot" partition size. Within Lance, this means a partition should typically contain multiple fragments. We expose a user configurable strategy, namely max row count or split size, and then estimate row sizes based on the schema to determine the size of the resultant split.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw · 2026-01-23T15:43:24Z

Before I plumb this through to the Lance Spark connector just wanted to get some input from interested parties:

@majin1102 / @fangbo in this thread I know you expressed interest in a solution. This does currently work with zone maps. @fangbo you'll recognize a large bit of code from your PR - thanks!

@Jay-ju IIUC your PR here is targeted at estimating row counts to achieve similar ends. I really like the idea of index hinting, as in my testing I noticed filtering index choices were not always what I expected them to be.

hamersaw · 2026-01-23T15:47:05Z

python/python/lance/dataset.py

        return self._scanner.analyze_plan()

+    def plan_splits(
+        self, max_split_size_bytes: Optional[int] = None


Will need to update this to include both max_split_size_bytes and max_row_count options, with one trumping the other if both are provided. I'm interested if people think this paradigm is useful? My intuition is that since we are estimating row sizes based on the schema that we could be VERY wrong (just using 64B for everything that is not known size - string / blob could be 1B - 1M+). In these scenarios a user will know their data better and can use a max_row_count to target a partition size. So basically, hopefully most use-cases we're close and estimation works well, but there are knobs to fine-tune in the other cases.

…nd use the min if both provided Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

LuQQiu · 2026-01-26T20:13:04Z

rust/lance/src/dataset/scanner.rs

+        } else {
+            Arc::new(self.dataset.fragments().as_ref().clone())
+        };
+


does this change need to take care of the scanner range e.g. scan_range_before_filter?

Great point. I'm going to have to look into this, but I think it probably should!

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw · 2026-01-30T21:17:33Z

closing in favor of #5863

hamersaw added 2 commits January 22, 2026 12:38

initial pass

046fe61

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

updated to use max_split_size_bytes

c176bb9

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added enhancement New feature or request python labels Jan 22, 2026

hamersaw added 2 commits January 22, 2026 16:19

added unit tests

dc752bb

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

adding SplitStrategy to allow users to specify split size or max rows

f1650d6

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw commented Jan 23, 2026

View reviewed changes

hamersaw marked this pull request as ready for review January 23, 2026 15:54

updated to use separate max_size_bytes and max_row_count parameters a…

5333860

…nd use the min if both provided Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

LuQQiu reviewed Jan 26, 2026

View reviewed changes

added to jni

300f9db

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the java label Jan 30, 2026

hamersaw closed this Jan 30, 2026

hamersaw deleted the feat/scanner_plan_splits branch February 4, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add scanner.plan_splits function#5792

feat: add scanner.plan_splits function#5792
hamersaw wants to merge 6 commits intolance-format:mainfrom
hamersaw:feat/scanner_plan_splits

hamersaw commented Jan 22, 2026 •

edited

Loading

Uh oh!

hamersaw commented Jan 23, 2026

Uh oh!

hamersaw Jan 23, 2026

Uh oh!

LuQQiu Jan 26, 2026

Uh oh!

hamersaw Jan 27, 2026 •

edited

Loading

Uh oh!

hamersaw commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamersaw commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamersaw commented Jan 23, 2026

Uh oh!

hamersaw Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

LuQQiu Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamersaw commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hamersaw commented Jan 22, 2026 •

edited

Loading

hamersaw Jan 27, 2026 •

edited

Loading