feat(python): support nested blob fields in to_pandas by geruh · Pull Request #6791 · lance-format/lance

geruh · 2026-05-14T22:30:51Z

This PR adds support for blob fields inside of structs for to_pandas(), and take_blobs("nested.identifier").

In the original code done in #6424, the support was added for top level primitive type blobs. Here the flow was to drop the blob secriptors from the arrow table convert the rest to pandas, and insert the column back with the BlobFile types. This behavior remains unchanged.

Now for nested blobs it's possible that the parent struct has sibilings that aren't of blob type so we preserve. Converting the struct to pandas so we get a series of dicts, then walk row by row swapping each descriptor leaf with a BlobFile.

On the Rust side, validate_blob_column resolves dotted paths with the Schema::field now instead of top-level only, and leaf_descriptor_struct walks a descriptor RecordBatch down to nested structs. I also exposed parse_field_path / format_field_path to Python so it doesn't have to reimplement path parsing.

Lists and maps containing blobs are out of scope since that's 1:N cardinality and needs a different read path, and can be done as a follow up.

Testing

Added some tests covering nested replace
Existing tests continue to pass

cc: @BubbleCal @westonpace

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

westonpace · 2026-05-15T03:19:21Z

@claude review once

claude

No bugs found in the inline review, but this is a feature PR with non-trivial logic in the blob read path (both Python helpers and Rust descriptor walking) plus newly exposed pybindings — worth a human pass on the design before merging.

Extended reasoning...

Overview

This PR extends blob support to nested struct fields for both to_pandas() and take_blobs("nested.identifier"). The Python side (python/python/lance/dataset.py) replaces the top-level-only flow with path-based helpers: _field_blob_paths, _descriptors_at_path, _dict_with_value_at_path, _replace_in_struct_column, _materialize_blobs_for_paths, and _place_materialized_blobs. The Rust side (rust/lance/src/dataset/blob.rs) rewrites validate_blob_column to use Schema::field (which already understands dotted paths) and adds leaf_descriptor_struct to walk the descriptor RecordBatch through nested struct children. parse_field_path / format_field_path are newly exposed as private pyfunctions (_parse_field_path / _format_field_path) so Python does not reimplement quoting/parsing.

Security risks

No security-sensitive surface. No auth, crypto, permissions, or external I/O paths are touched. The new pyfunctions are thin wrappers around existing lance_core::datatypes helpers and propagate errors as PyValueError.

Level of scrutiny

Moderate-to-high. While the Rust delta is small, it changes a critical read-path validator and introduces expect() calls in leaf_descriptor_struct whose invariants depend on validate_blob_column having already vetted the path — that coupling is reasonable but worth a human confirming. The Python restructure is more substantial: a previously linear flow becomes a path-walking pattern across six new helpers. Lists/maps are explicitly out of scope, which is fine but worth a maintainer endorsing.

Other factors

Test coverage is solid: lazy/bytes/descriptions modes for nested blobs, nested take_blobs, and the dotted-field-name-inside-struct edge case are all covered. westonpace explicitly requested a review, which by itself argues against auto-approval — they want feedback, not a rubber stamp.

codecov · 2026-05-15T03:57:44Z

Codecov Report

❌ Patch coverage is 52.63158% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/blob.rs	52.63%	5 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

BubbleCal · 2026-05-15T09:08:50Z

+    (row,) = df["info"].tolist()
+
+    assert isinstance(row["blob.payload"], BlobFile)
+    assert row["blob.payload"].readall() == b"hello"


We need such test:

def test_scanner_to_pandas_nested_blob_projection_lazy(dataset_with_nested_blobs): df = dataset_with_nested_blobs.scanner(columns=["info.image"]).to_pandas() assert list(df.columns) == ["info.image"] assert isinstance(df.iloc[0]["info.image"], BlobFile) assert df.iloc[0]["info.image"].readall() == b"hello" assert df.iloc[1]["info.image"] is None assert df.iloc[2]["info.image"].readall() == b"world"

it seems nested blob can't work well with to_pandas() lazy mode.

westonpace

Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.

westonpace · 2026-05-15T16:29:17Z

+                "blob-aware to_pandas only supports direct blob column "
+                f"references; got {expr!r} for output column {segments[0]!r}"


What does it mean for a reference to be "direct"?

My understaning here is "direct" means the alias source is a single top level column I can remove probably remove this check and try it out.

geruh · 2026-05-16T01:11:09Z

Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.

The difference here is the control on which blobs get loaded. the bytes mode loads every blob in the result set immediately, so you pay the full cost upfront regardless of what you end up using. lazy mode returns BlobFile handles so we can decide which ones to actually read.

As far as I can tell, blob columns are dropped from the Arrow table before .to_pandas() is called, so Arrow never materializes the bytes. BlobFile is a rust object and is injected directly as plain python object-dtype columns. Pandas doesn't have any knowledge of, and bytes only hit memory when you call .read(), which I believe routes back through PyO3 into the Rust implementation.

does this sound correct @BubbleCal

BubbleCal · 2026-05-18T07:23:49Z

Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.

The difference here is the control on which blobs get loaded. the bytes mode loads every blob in the result set immediately, so you pay the full cost upfront regardless of what you end up using. lazy mode returns BlobFile handles so we can decide which ones to actually read.

As far as I can tell, blob columns are dropped from the Arrow table before .to_pandas() is called, so Arrow never materializes the bytes. BlobFile is a rust object and is injected directly as plain python object-dtype columns. Pandas doesn't have any knowledge of, and bytes only hit memory when you call .read(), which I believe routes back through PyO3 into the Rust implementation.

does this sound correct @BubbleCal

yeah it's accurate and correct

feat(python): support nested blob fields in to_pandas

7b8e393

claude Bot reviewed May 14, 2026

View reviewed changes

github-actions Bot added enhancement New feature or request A-python Python bindings labels May 14, 2026

claude Bot reviewed May 15, 2026

View reviewed changes

BubbleCal reviewed May 15, 2026

View reviewed changes

westonpace approved these changes May 15, 2026

View reviewed changes

address feedback add scanner support

933cea7

add aliasing support and refactor

cda37f9

BubbleCal merged commit 46e9efe into lance-format:main May 18, 2026
28 checks passed

claude Bot mentioned this pull request May 18, 2026

fix(python): pass batch size through fragment pandas export #6829

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): support nested blob fields in to_pandas#6791

feat(python): support nested blob fields in to_pandas#6791
BubbleCal merged 3 commits into
lance-format:mainfrom
geruh:nested-blobs

geruh commented May 14, 2026

Uh oh!

claude Bot left a comment

Uh oh!

westonpace commented May 15, 2026

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 15, 2026

Uh oh!

BubbleCal May 15, 2026

Uh oh!

westonpace left a comment

Uh oh!

westonpace May 15, 2026

Uh oh!

geruh May 16, 2026

Uh oh!

geruh commented May 16, 2026

Uh oh!

BubbleCal commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"blob-aware to_pandas only supports direct blob column "
		f"references; got {expr!r} for output column {segments[0]!r}"

Conversation

geruh commented May 14, 2026

Testing

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

westonpace commented May 15, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

codecov Bot commented May 15, 2026

Codecov Report

Uh oh!

BubbleCal May 15, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace May 15, 2026

Choose a reason for hiding this comment

Uh oh!

geruh May 16, 2026

Choose a reason for hiding this comment

Uh oh!

geruh commented May 16, 2026

Uh oh!

BubbleCal commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants