Skip to content

feat(scan): deferred materialization for heavy columns#175

Merged
Xuanwo merged 2 commits intolance-format:mainfrom
jiaoew1991:feat/deferred-materialization
Mar 18, 2026
Merged

feat(scan): deferred materialization for heavy columns#175
Xuanwo merged 2 commits intolance-format:mainfrom
jiaoew1991:feat/deferred-materialization

Conversation

@jiaoew1991
Copy link
Copy Markdown
Contributor

Summary

  • Adds C++-side deferred materialization for heavy columns (BLOBs, large embeddings) when DuckDB's filter cannot be pushed down to Lance SDK
  • When filter IS pushed, Lance SDK's internal MaterializationStyle::Heuristic already handles late materialization — this optimization correctly does not activate
  • Adds Rust FFI lance_dataset_list_named_field_stats for stats-based heavy column detection
  • Controlled by SET lance_deferred_materialization = true|false (default: true)

How it works

  1. Detect heavy columns via bytes_on_disk stats (>1KB/row threshold) with type-based fallback (BLOB, LIST, ARRAY≥256, MAP)
  2. Phase 1: Scan only light columns + _rowid from Lance (without heavy cols)
  3. DuckDB filters the light columns post-scan
  4. Phase 2: take_rows(surviving_rowids, heavy_cols) fetches heavy data only for surviving rows

Benchmark

20K rows, 5 fragments, 100KB blob/row, ~2GB on disk, non-pushable filter (id::VARCHAR LIKE '%42%'):

Query selectivity Without optimization With optimization Speedup
~3% 555 ms 53 ms 10.5x
~10% 558 ms 162 ms 3.4x

Test plan

  • New test: test/sql/scan_deferred_materialization.test (37 assertions)
    • Plan verification: deferred ON for non-pushable filters, OFF for pushable/no-filter/light-only
    • Setting toggle: SET lance_deferred_materialization = false disables it
    • Correctness: correct data for single and multi-fragment datasets
  • All existing tests pass (42 test cases, 3904 assertions)

Closes #170

🤖 Generated with Claude Code

jiaoew1991 and others added 2 commits March 17, 2026 20:03
…shdown fails

When DuckDB's filter cannot be pushed down to Lance SDK (unsupported
functions, CAST+LIKE, join predicates, etc.), Lance reads all columns
eagerly since it has no filter to optimize around. This wastes I/O on
heavy columns (BLOBs, large vectors) for rows that DuckDB will filter
out post-scan.

This adds C++-side deferred materialization for this specific gap:

1. During scan init, detect heavy columns via `bytes_on_disk` stats
   (>1KB/row threshold) with type-based fallback (BLOB, LIST of
   numeric/BLOB, ARRAY>=256, MAP, recursive STRUCT).
2. Remove heavy columns from the Lance scan projection, add `_rowid`.
3. After DuckDB applies its post-scan filter, call
   `lance_create_dataset_take_stream_unfiltered` with surviving row IDs
   to fetch only the heavy columns for surviving rows.
4. Merge deferred columns into the output.

The optimization only activates when:
- Filter pushdown to Lance failed (`!filter_pushed_down`)
- DuckDB has pending table filters
- Heavy columns exist in the projection but not in the filter

Controlled by: `SET lance_deferred_materialization = true|false`
(default: true).

Benchmark (20K rows, 5 fragments, 100KB blob/row, ~2GB):
- Filter 3% selectivity: 555ms → 53ms (10.5x faster)
- Filter 10% selectivity: 558ms → 162ms (3.4x faster)

Note: When filter IS pushed to Lance, Lance SDK's internal
MaterializationStyle::Heuristic already handles late materialization
(LanceRead → TakeExec), so this optimization correctly does not
activate in that case.

Also adds:
- Rust FFI: `lance_dataset_list_named_field_stats` for stats-based
  heavy column detection (resolves field IDs to column names)
- Test: `scan_deferred_materialization.test` with plan verification,
  correctness, setting toggle, and multi-fragment coverage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Xuanwo Xuanwo merged commit e555fb8 into lance-format:main Mar 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: deferred materialization for heavy columns (blobs, large embeddings)

2 participants