[DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used by robreeves · Pull Request #568 · linkedin/openhouse

robreeves · 2026-05-04T18:19:37Z

Summary

DataFusion defaults its internal batch_size to 8192. When a DataLoader user sets batch_size larger than 8192 and a table transformer is active, the DataFusion transform step silently fragments output batches at the 8192 boundary (e.g. a 10,000-row batch becomes [8192, 1808]). This PR propagates the user's batch_size to the DataFusion SessionConfig so output batches respect the requested size.

Changes

_create_transform_session now accepts an optional batch_size parameter. When set, it configures datafusion.execution.batch_size on the SessionConfig before creating the SessionContext. The call site in DataLoaderSplit.__iter__ passes through self._batch_size.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Added test_split_batch_size_honored_with_transform

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

DataFusion defaults to batch_size=8192. When the user sets a larger batch_size, transforms fragment batches at the 8192 boundary. Pass the user's batch_size through to the DataFusion SessionConfig so output batches respect the requested size.

Copilot

Pull request overview

This PR updates the Python DataLoader transform path so split-level DataFusion sessions can use the caller's requested batch_size, addressing batch fragmentation when table transformers are active. It fits into the dataloader's split execution layer, where each DataLoaderSplit reads Arrow batches and optionally runs transform SQL through DataFusion.

Changes:

Add optional batch_size plumbing to _create_transform_session and apply it to the DataFusion SessionConfig.
Pass DataLoaderSplit._batch_size into the transform-session creation path in DataLoaderSplit.__iter__.
Add a regression test covering transformed batches larger than DataFusion's default batch size.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py`	Propagates split `batch_size` into the DataFusion session used for per-batch transforms.
`integrations/python/dataloader/tests/test_data_loader_split.py`	Adds a regression test asserting transformed output is not fragmented at DataFusion's default batch boundary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ShreyeshArangath

Good catch!

…size linkedin#568 propagated the dataloader's read batch_size into datafusion.execution.batch_size. When a transform is active, a small batch_size then forces DataFusion to slice each transform input into batch_size-row pieces and invoke the (expensive) UDFs once per piece -- the per-batch overhead collapses throughput on large rows. This is the perf regression observed with batch_size=128 on embedding-sized rows (15s read_batch, starved prefetch queue, jittery s/step). - _create_transform_session now only RAISES execution.batch_size to honor a large requested batch_size (the case linkedin#568 fixed); it never LOWERS it below DataFusion's 8192 default by default. - Adds a tunable, engine-agnostic transform_batch_size parameter so callers can set the transform execution batch size explicitly (e.g. trade throughput for memory). - Adds regression + resolution + end-to-end wiring tests.

robreeves added 3 commits May 4, 2026 18:19

Read DataFusion default batch_size from Config instead of hardcoding

c93412c

Simplify batch_size transform test assertions

ce18c24

robreeves changed the title ~~[DataLoader] Propagate batch_size to DataFusion SessionConfig~~ [DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used May 4, 2026

robreeves requested a review from Copilot May 4, 2026 18:26

Copilot started reviewing on behalf of robreeves May 4, 2026 18:27 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py

robreeves marked this pull request as ready for review May 4, 2026 18:41

robreeves requested review from ShreyeshArangath, cbb330 and sumedhsakdeo May 4, 2026 18:41

ShreyeshArangath approved these changes May 4, 2026

View reviewed changes

robreeves merged commit 4b29235 into linkedin:main May 4, 2026
6 of 7 checks passed

ShreyeshArangath mentioned this pull request May 30, 2026

[dataloader] Decouple transform execution batch size from read batch_size #624

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used#568

[DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used#568
robreeves merged 3 commits into
linkedin:mainfrom
robreeves:batchsize

robreeves commented May 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ShreyeshArangath left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robreeves commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

ShreyeshArangath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robreeves commented May 4, 2026 •

edited

Loading