Skip to content

feat: add scanner.plan_splits function#5792

Closed
hamersaw wants to merge 6 commits intolance-format:mainfrom
hamersaw:feat/scanner_plan_splits
Closed

feat: add scanner.plan_splits function#5792
hamersaw wants to merge 6 commits intolance-format:mainfrom
hamersaw:feat/scanner_plan_splits

Conversation

@hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Jan 22, 2026

This PR adds a plan_splits function the to Scanner struct. The goal is that this serves as a singular endpoint where distributed compute frameworks can effectively partition a Lance dataset for parallelized processing. The main goals are:
(1) Prune fragments that do not satisfy a filter (if exists): We use an index lookup to determine which fragments contain rows (and which do not) to prune unnecessary fragments.
(2) Bin pack fragments into spiits: Distributed compute frameworks typically work best with a "sweet-spot" partition size. Within Lance, this means a partition should typically contain multiple fragments. We expose a user configurable strategy, namely max row count or split size, and then estimate row sizes based on the schema to determine the size of the resultant split.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions bot added enhancement New feature or request python labels Jan 22, 2026
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@hamersaw
Copy link
Contributor Author

Before I plumb this through to the Lance Spark connector just wanted to get some input from interested parties:

@majin1102 / @fangbo in this thread I know you expressed interest in a solution. This does currently work with zone maps. @fangbo you'll recognize a large bit of code from your PR - thanks!

@Jay-ju IIUC your PR here is targeted at estimating row counts to achieve similar ends. I really like the idea of index hinting, as in my testing I noticed filtering index choices were not always what I expected them to be.

return self._scanner.analyze_plan()

def plan_splits(
self, max_split_size_bytes: Optional[int] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to update this to include both max_split_size_bytes and max_row_count options, with one trumping the other if both are provided. I'm interested if people think this paradigm is useful? My intuition is that since we are estimating row sizes based on the schema that we could be VERY wrong (just using 64B for everything that is not known size - string / blob could be 1B - 1M+). In these scenarios a user will know their data better and can use a max_row_count to target a partition size. So basically, hopefully most use-cases we're close and estimation works well, but there are knobs to fine-tune in the other cases.

@hamersaw hamersaw marked this pull request as ready for review January 23, 2026 15:54
…nd use the min if both provided

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
} else {
Arc::new(self.dataset.fragments().as_ref().clone())
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change need to take care of the scanner range e.g. scan_range_before_filter?

Copy link
Contributor Author

@hamersaw hamersaw Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I'm going to have to look into this, but I think it probably should!

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions bot added the java label Jan 30, 2026
@hamersaw
Copy link
Contributor Author

closing in favor of #5863

@hamersaw hamersaw closed this Jan 30, 2026
@hamersaw hamersaw deleted the feat/scanner_plan_splits branch February 4, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants