Skip to content

feat(python): support nested blob fields in to_pandas#6791

Merged
BubbleCal merged 3 commits into
lance-format:mainfrom
geruh:nested-blobs
May 18, 2026
Merged

feat(python): support nested blob fields in to_pandas#6791
BubbleCal merged 3 commits into
lance-format:mainfrom
geruh:nested-blobs

Conversation

@geruh
Copy link
Copy Markdown
Contributor

@geruh geruh commented May 14, 2026

closes #6436

This PR adds support for blob fields inside of structs for to_pandas(), and take_blobs("nested.identifier").

In the original code done in #6424, the support was added for top level primitive type blobs. Here the flow was to drop the blob secriptors from the arrow table convert the rest to pandas, and insert the column back with the BlobFile types. This behavior remains unchanged.

Now for nested blobs it's possible that the parent struct has sibilings that aren't of blob type so we preserve. Converting the struct to pandas so we get a series of dicts, then walk row by row swapping each descriptor leaf with a BlobFile.

On the Rust side, validate_blob_column resolves dotted paths with the Schema::field now instead of top-level only, and leaf_descriptor_struct walks a descriptor RecordBatch down to nested structs. I also exposed parse_field_path / format_field_path to Python so it doesn't have to reimplement path parsing.

Lists and maps containing blobs are out of scope since that's 1:N cardinality and needs a different read path, and can be done as a follow up.

Testing

  • Added some tests covering nested replace
  • Existing tests continue to pass

cc: @BubbleCal @westonpace

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request A-python Python bindings labels May 14, 2026
@westonpace
Copy link
Copy Markdown
Member

@claude review once

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No bugs found in the inline review, but this is a feature PR with non-trivial logic in the blob read path (both Python helpers and Rust descriptor walking) plus newly exposed pybindings — worth a human pass on the design before merging.

Extended reasoning...

Overview

This PR extends blob support to nested struct fields for both to_pandas() and take_blobs("nested.identifier"). The Python side (python/python/lance/dataset.py) replaces the top-level-only flow with path-based helpers: _field_blob_paths, _descriptors_at_path, _dict_with_value_at_path, _replace_in_struct_column, _materialize_blobs_for_paths, and _place_materialized_blobs. The Rust side (rust/lance/src/dataset/blob.rs) rewrites validate_blob_column to use Schema::field (which already understands dotted paths) and adds leaf_descriptor_struct to walk the descriptor RecordBatch through nested struct children. parse_field_path / format_field_path are newly exposed as private pyfunctions (_parse_field_path / _format_field_path) so Python does not reimplement quoting/parsing.

Security risks

No security-sensitive surface. No auth, crypto, permissions, or external I/O paths are touched. The new pyfunctions are thin wrappers around existing lance_core::datatypes helpers and propagate errors as PyValueError.

Level of scrutiny

Moderate-to-high. While the Rust delta is small, it changes a critical read-path validator and introduces expect() calls in leaf_descriptor_struct whose invariants depend on validate_blob_column having already vetted the path — that coupling is reasonable but worth a human confirming. The Python restructure is more substantial: a previously linear flow becomes a path-walking pattern across six new helpers. Lists/maps are explicitly out of scope, which is fine but worth a maintainer endorsing.

Other factors

Test coverage is solid: lazy/bytes/descriptions modes for nested blobs, nested take_blobs, and the dotted-field-name-inside-struct edge case are all covered. westonpace explicitly requested a review, which by itself argues against auto-approval — they want feedback, not a rubber stamp.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 52.63158% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/blob.rs 52.63% 5 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread python/python/tests/test_blob.py Outdated
(row,) = df["info"].tolist()

assert isinstance(row["blob.payload"], BlobFile)
assert row["blob.payload"].readall() == b"hello"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need such test:

def test_scanner_to_pandas_nested_blob_projection_lazy(dataset_with_nested_blobs):
    df = dataset_with_nested_blobs.scanner(columns=["info.image"]).to_pandas()

    assert list(df.columns) == ["info.image"]
    assert isinstance(df.iloc[0]["info.image"], BlobFile)
    assert df.iloc[0]["info.image"].readall() == b"hello"
    assert df.iloc[1]["info.image"] is None
    assert df.iloc[2]["info.image"].readall() == b"world"

it seems nested blob can't work well with to_pandas() lazy mode.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.

Comment thread python/python/lance/dataset.py Outdated
Comment on lines +190 to +191
"blob-aware to_pandas only supports direct blob column "
f"references; got {expr!r} for output column {segments[0]!r}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for a reference to be "direct"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understaning here is "direct" means the alias source is a single top level column I can remove probably remove this check and try it out.

@geruh
Copy link
Copy Markdown
Contributor Author

geruh commented May 16, 2026

Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.

The difference here is the control on which blobs get loaded. the bytes mode loads every blob in the result set immediately, so you pay the full cost upfront regardless of what you end up using. lazy mode returns BlobFile handles so we can decide which ones to actually read.

As far as I can tell, blob columns are dropped from the Arrow table before .to_pandas() is called, so Arrow never materializes the bytes. BlobFile is a rust object and is injected directly as plain python object-dtype columns. Pandas doesn't have any knowledge of, and bytes only hit memory when you call .read(), which I believe routes back through PyO3 into the Rust implementation.

does this sound correct @BubbleCal

@BubbleCal
Copy link
Copy Markdown
Contributor

Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.

The difference here is the control on which blobs get loaded. the bytes mode loads every blob in the result set immediately, so you pay the full cost upfront regardless of what you end up using. lazy mode returns BlobFile handles so we can decide which ones to actually read.

As far as I can tell, blob columns are dropped from the Arrow table before .to_pandas() is called, so Arrow never materializes the bytes. BlobFile is a rust object and is injected directly as plain python object-dtype columns. Pandas doesn't have any knowledge of, and bytes only hit memory when you call .read(), which I believe routes back through PyO3 into the Rust implementation.

does this sound correct @BubbleCal

yeah it's accurate and correct

@BubbleCal BubbleCal merged commit 46e9efe into lance-format:main May 18, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support to read nested blob field

3 participants