[DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used#568
Merged
Merged
Conversation
DataFusion defaults to batch_size=8192. When the user sets a larger batch_size, transforms fragment batches at the 8192 boundary. Pass the user's batch_size through to the DataFusion SessionConfig so output batches respect the requested size.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Python DataLoader transform path so split-level DataFusion sessions can use the caller's requested batch_size, addressing batch fragmentation when table transformers are active. It fits into the dataloader's split execution layer, where each DataLoaderSplit reads Arrow batches and optionally runs transform SQL through DataFusion.
Changes:
- Add optional
batch_sizeplumbing to_create_transform_sessionand apply it to the DataFusionSessionConfig. - Pass
DataLoaderSplit._batch_sizeinto the transform-session creation path inDataLoaderSplit.__iter__. - Add a regression test covering transformed batches larger than DataFusion's default batch size.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py |
Propagates split batch_size into the DataFusion session used for per-batch transforms. |
integrations/python/dataloader/tests/test_data_loader_split.py |
Adds a regression test asserting transformed output is not fragmented at DataFusion's default batch boundary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
17 tasks
ShreyeshArangath
added a commit
to ShreyeshArangath/openhouse
that referenced
this pull request
May 30, 2026
…size linkedin#568 propagated the dataloader's read batch_size into datafusion.execution.batch_size. When a transform is active, a small batch_size then forces DataFusion to slice each transform input into batch_size-row pieces and invoke the (expensive) UDFs once per piece -- the per-batch overhead collapses throughput on large rows. This is the perf regression observed with batch_size=128 on embedding-sized rows (15s read_batch, starved prefetch queue, jittery s/step). - _create_transform_session now only RAISES execution.batch_size to honor a large requested batch_size (the case linkedin#568 fixed); it never LOWERS it below DataFusion's 8192 default by default. - Adds a tunable, engine-agnostic transform_batch_size parameter so callers can set the transform execution batch size explicitly (e.g. trade throughput for memory). - Adds regression + resolution + end-to-end wiring tests.
ShreyeshArangath
added a commit
to ShreyeshArangath/openhouse
that referenced
this pull request
May 30, 2026
…size linkedin#568 propagated the dataloader's read batch_size into datafusion.execution.batch_size. When a transform is active, a small batch_size then forces DataFusion to slice each transform input into batch_size-row pieces and invoke the (expensive) UDFs once per piece -- the per-batch overhead collapses throughput on large rows. This is the perf regression observed with batch_size=128 on embedding-sized rows (15s read_batch, starved prefetch queue, jittery s/step). - _create_transform_session now only RAISES execution.batch_size to honor a large requested batch_size (the case linkedin#568 fixed); it never LOWERS it below DataFusion's 8192 default by default. - Adds a tunable, engine-agnostic transform_batch_size parameter so callers can set the transform execution batch size explicitly (e.g. trade throughput for memory). - Adds regression + resolution + end-to-end wiring tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DataFusion defaults its internal
batch_sizeto 8192. When a DataLoader user setsbatch_sizelarger than 8192 and a table transformer is active, the DataFusion transform step silently fragments output batches at the 8192 boundary (e.g. a 10,000-row batch becomes[8192, 1808]). This PR propagates the user'sbatch_sizeto the DataFusionSessionConfigso output batches respect the requested size.Changes
_create_transform_sessionnow accepts an optionalbatch_sizeparameter. When set, it configuresdatafusion.execution.batch_sizeon theSessionConfigbefore creating theSessionContext. The call site inDataLoaderSplit.__iter__passes throughself._batch_size.Testing Done
Added
test_split_batch_size_honored_with_transformAdditional Information