Skip to content

feat: thread data_size through decode pipeline#6391

Merged
westonpace merged 1 commit intolance-format:mainfrom
westonpace:feat/thread-data-size-through-decode
Apr 2, 2026
Merged

feat: thread data_size through decode pipeline#6391
westonpace merged 1 commit intolance-format:mainfrom
westonpace:feat/thread-data-size-through-decode

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

  • Threads accurate data_size (in bytes) from DataBlock::data_size() at the encoding layer through the full decode pipeline to the final RecordBatch
  • Implements DataBlock::data_size() for Struct and Dictionary variants (were todo!())
  • Uses the accurate data size for the "batch is too large" warning instead of Arrow's get_array_memory_size(), which over-reports due to shared page buffers
  • Changes DecodeArrayTask::decode() to return (ArrayRef, u64) so data size flows through naturally

Test plan

  • All 364 existing lance-encoding tests pass
  • cargo clippy -p lance-encoding --tests -- -D warnings clean
  • cargo clippy -p lance-file --tests -- -D warnings clean
  • cargo fmt --all -- --check clean

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@westonpace westonpace changed the title Thread data_size through decode pipeline feat: thread data_size through decode pipeline Apr 2, 2026
@github-actions github-actions bot added the enhancement New feature or request label Apr 2, 2026
The decode pipeline now tracks the actual data size (in bytes) of decoded
arrays from the encoding layer (DataBlock::data_size()) through to the
final RecordBatch. This replaces the use of Arrow's get_array_memory_size()
for the "batch is too large" warning, providing more accurate byte counts
that don't over-report due to shared page buffers.

Changes:
- Add data_size field to DecodedArray
- Implement DataBlock::data_size() for Struct and Dictionary (were todo!())
- Change DecodeArrayTask::decode() to return (ArrayRef, u64)
- Populate data_size in all 5 StructuralDecodeArrayTask implementations
- Update all 6 legacy DecodeArrayTask implementations to return (arr, 0)
- Thread data_size through NextDecodeTask::into_batch()
- Use data_size for the batch-too-large warning instead of Arrow overhead

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
// thread for a long time. By spawning it as a new task, we allow Tokio's
// worker threads to keep making progress.
tokio::spawn(async move { next_task.into_batch(emitted_batch_size_warning) })
let (batch, _data_size) =
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on using this in a future PR soon

@westonpace
Copy link
Copy Markdown
Member Author

Ostensibly this PR stands on its own because it makes the warning log message that we print more accurate. The true reason for the change though is to enable #6388 to be more accurate.

@westonpace westonpace force-pushed the feat/thread-data-size-through-decode branch from dbd186b to ac66494 Compare April 2, 2026 17:31
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm excited for byte-size batches!

@westonpace westonpace merged commit 36b344f into lance-format:main Apr 2, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants