feat(python): support nested blob fields in to_pandas#6791
Conversation
|
@claude review once |
There was a problem hiding this comment.
No bugs found in the inline review, but this is a feature PR with non-trivial logic in the blob read path (both Python helpers and Rust descriptor walking) plus newly exposed pybindings — worth a human pass on the design before merging.
Extended reasoning...
Overview
This PR extends blob support to nested struct fields for both to_pandas() and take_blobs("nested.identifier"). The Python side (python/python/lance/dataset.py) replaces the top-level-only flow with path-based helpers: _field_blob_paths, _descriptors_at_path, _dict_with_value_at_path, _replace_in_struct_column, _materialize_blobs_for_paths, and _place_materialized_blobs. The Rust side (rust/lance/src/dataset/blob.rs) rewrites validate_blob_column to use Schema::field (which already understands dotted paths) and adds leaf_descriptor_struct to walk the descriptor RecordBatch through nested struct children. parse_field_path / format_field_path are newly exposed as private pyfunctions (_parse_field_path / _format_field_path) so Python does not reimplement quoting/parsing.
Security risks
No security-sensitive surface. No auth, crypto, permissions, or external I/O paths are touched. The new pyfunctions are thin wrappers around existing lance_core::datatypes helpers and propagate errors as PyValueError.
Level of scrutiny
Moderate-to-high. While the Rust delta is small, it changes a critical read-path validator and introduces expect() calls in leaf_descriptor_struct whose invariants depend on validate_blob_column having already vetted the path — that coupling is reasonable but worth a human confirming. The Python restructure is more substantial: a previously linear flow becomes a path-walking pattern across six new helpers. Lists/maps are explicitly out of scope, which is fine but worth a maintainer endorsing.
Other factors
Test coverage is solid: lazy/bytes/descriptions modes for nested blobs, nested take_blobs, and the dotted-field-name-inside-struct edge case are all covered. westonpace explicitly requested a review, which by itself argues against auto-approval — they want feedback, not a rubber stamp.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| (row,) = df["info"].tolist() | ||
|
|
||
| assert isinstance(row["blob.payload"], BlobFile) | ||
| assert row["blob.payload"].readall() == b"hello" |
There was a problem hiding this comment.
We need such test:
def test_scanner_to_pandas_nested_blob_projection_lazy(dataset_with_nested_blobs):
df = dataset_with_nested_blobs.scanner(columns=["info.image"]).to_pandas()
assert list(df.columns) == ["info.image"]
assert isinstance(df.iloc[0]["info.image"], BlobFile)
assert df.iloc[0]["info.image"].readall() == b"hello"
assert df.iloc[1]["info.image"] is None
assert df.iloc[2]["info.image"].readall() == b"world"
it seems nested blob can't work well with to_pandas() lazy mode.
westonpace
left a comment
There was a problem hiding this comment.
Approved but I'm not sure I really understand the purpose of _BLOB_PANDAS_MODE_LAZY. It seems to have the same end result as _BLOB_PANDAS_MODE_BYTES but involves more steps.
| "blob-aware to_pandas only supports direct blob column " | ||
| f"references; got {expr!r} for output column {segments[0]!r}" |
There was a problem hiding this comment.
What does it mean for a reference to be "direct"?
There was a problem hiding this comment.
My understaning here is "direct" means the alias source is a single top level column I can remove probably remove this check and try it out.
The difference here is the control on which blobs get loaded. the As far as I can tell, blob columns are dropped from the Arrow table before does this sound correct @BubbleCal |
yeah it's accurate and correct |
closes #6436
This PR adds support for blob fields inside of structs for
to_pandas(), andtake_blobs("nested.identifier").In the original code done in #6424, the support was added for top level primitive type blobs. Here the flow was to drop the blob secriptors from the arrow table convert the rest to pandas, and insert the column back with the
BlobFiletypes. This behavior remains unchanged.Now for nested blobs it's possible that the parent struct has sibilings that aren't of blob type so we preserve. Converting the struct to pandas so we get a series of dicts, then walk row by row swapping each descriptor leaf with a
BlobFile.On the Rust side,
validate_blob_columnresolves dotted paths with theSchema::fieldnow instead of top-level only, andleaf_descriptor_structwalks a descriptorRecordBatchdown to nested structs. I also exposedparse_field_path/format_field_pathto Python so it doesn't have to reimplement path parsing.Listsandmapscontaining blobs are out of scope since that's 1:N cardinality and needs a different read path, and can be done as a follow up.Testing
cc: @BubbleCal @westonpace