Skip to content

[DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used#568

Merged
robreeves merged 3 commits into
linkedin:mainfrom
robreeves:batchsize
May 4, 2026
Merged

[DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used#568
robreeves merged 3 commits into
linkedin:mainfrom
robreeves:batchsize

Conversation

@robreeves
Copy link
Copy Markdown
Collaborator

@robreeves robreeves commented May 4, 2026

Summary

DataFusion defaults its internal batch_size to 8192. When a DataLoader user sets batch_size larger than 8192 and a table transformer is active, the DataFusion transform step silently fragments output batches at the 8192 boundary (e.g. a 10,000-row batch becomes [8192, 1808]). This PR propagates the user's batch_size to the DataFusion SessionConfig so output batches respect the requested size.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

_create_transform_session now accepts an optional batch_size parameter. When set, it configures datafusion.execution.batch_size on the SessionConfig before creating the SessionContext. The call site in DataLoaderSplit.__iter__ passes through self._batch_size.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Added test_split_batch_size_honored_with_transform

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

robreeves added 3 commits May 4, 2026 18:19
DataFusion defaults to batch_size=8192. When the user sets a larger
batch_size, transforms fragment batches at the 8192 boundary. Pass
the user's batch_size through to the DataFusion SessionConfig so
output batches respect the requested size.
@robreeves robreeves changed the title [DataLoader] Propagate batch_size to DataFusion SessionConfig [DataLoader] Fix for batch sizes greater than 8192 not being honored when a table transformer is used May 4, 2026
@robreeves robreeves requested a review from Copilot May 4, 2026 18:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python DataLoader transform path so split-level DataFusion sessions can use the caller's requested batch_size, addressing batch fragmentation when table transformers are active. It fits into the dataloader's split execution layer, where each DataLoaderSplit reads Arrow batches and optionally runs transform SQL through DataFusion.

Changes:

  • Add optional batch_size plumbing to _create_transform_session and apply it to the DataFusion SessionConfig.
  • Pass DataLoaderSplit._batch_size into the transform-session creation path in DataLoaderSplit.__iter__.
  • Add a regression test covering transformed batches larger than DataFusion's default batch size.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py Propagates split batch_size into the DataFusion session used for per-batch transforms.
integrations/python/dataloader/tests/test_data_loader_split.py Adds a regression test asserting transformed output is not fragmented at DataFusion's default batch boundary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@robreeves robreeves marked this pull request as ready for review May 4, 2026 18:41
Copy link
Copy Markdown
Collaborator

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@robreeves robreeves merged commit 4b29235 into linkedin:main May 4, 2026
6 of 7 checks passed
ShreyeshArangath added a commit to ShreyeshArangath/openhouse that referenced this pull request May 30, 2026
…size

linkedin#568 propagated the dataloader's read batch_size into datafusion.execution.batch_size.
When a transform is active, a small batch_size then forces DataFusion to slice each
transform input into batch_size-row pieces and invoke the (expensive) UDFs once per
piece -- the per-batch overhead collapses throughput on large rows. This is the perf
regression observed with batch_size=128 on embedding-sized rows (15s read_batch,
starved prefetch queue, jittery s/step).

- _create_transform_session now only RAISES execution.batch_size to honor a large
  requested batch_size (the case linkedin#568 fixed); it never LOWERS it below DataFusion's
  8192 default by default.
- Adds a tunable, engine-agnostic transform_batch_size parameter so callers can set
  the transform execution batch size explicitly (e.g. trade throughput for memory).
- Adds regression + resolution + end-to-end wiring tests.
ShreyeshArangath added a commit to ShreyeshArangath/openhouse that referenced this pull request May 30, 2026
…size

linkedin#568 propagated the dataloader's read batch_size into datafusion.execution.batch_size.
When a transform is active, a small batch_size then forces DataFusion to slice each
transform input into batch_size-row pieces and invoke the (expensive) UDFs once per
piece -- the per-batch overhead collapses throughput on large rows. This is the perf
regression observed with batch_size=128 on embedding-sized rows (15s read_batch,
starved prefetch queue, jittery s/step).

- _create_transform_session now only RAISES execution.batch_size to honor a large
  requested batch_size (the case linkedin#568 fixed); it never LOWERS it below DataFusion's
  8192 default by default.
- Adds a tunable, engine-agnostic transform_batch_size parameter so callers can set
  the transform execution batch size explicitly (e.g. trade throughput for memory).
- Adds regression + resolution + end-to-end wiring tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants