Version: Lance 3.0.0 (via pylance)
Description:
Calling LanceDataset.take_blobs(blob_column, indices=[...]) on a multi-fragment table with a blob column fails with a schema error. Lance internally tries to append a _rowaddr column, but the blob column's struct schema already contains a _rowaddr field, causing a collision.
Reproduction:
import lance
import random
ds = lance.dataset("path/to/multi_fragment_table.lance")
indices = sorted(random.sample(range(ds.count_rows()), 10))
# This fails:
blobs = ds.take_blobs("my_blob_column", indices=indices)
Error:
RuntimeError: LanceError(Arrow): Schema error: Can not append column _rowaddr on schema:
Schema { fields: [
Field { name: "my_blob_column", data_type: Struct([
Field { name: "position", data_type: UInt64, nullable: true },
Field { name: "size", data_type: UInt64, nullable: true }
]), metadata: {"lance-encoding:blob": "true"} },
Field { name: "_rowaddr", data_type: UInt64, nullable: true }
], metadata: {} },
/home/runner/work/lance/lance/rust/lance/src/dataset/take.rs:384:21
Root cause:
The indices code path in take.rs:384 attempts to add a _rowaddr column to resolve logical indices into physical row addresses. However, the internal blob schema already includes _rowaddr as a field, causing a duplicate column error.
Workaround:
Use ids= instead of indices= with properly encoded row IDs (fragment_id << 32 | row_offset):
fragments = ds.get_fragments()
frag = fragments[0]
offsets = random.sample(range(frag.count_rows()), 10)
row_ids = [(frag.fragment_id << 32) | offset for offset in offsets]
blobs = ds.take_blobs("my_blob_column", ids=row_ids) # works
Additional notes:
take_blobs(ids=...) with flat sequential integers (not encoded row IDs) also fails — the IDs must encode the fragment ID in the upper 32 bits.
LanceDataset.take() does not support with_row_id or with_row_address kwargs, so there's no convenient way to map logical indices to row IDs for use with take_blobs(ids=...).
- The
indices path is the only ergonomic API for random access to blobs by logical position, and it's broken on any table with more than one fragment.
Environment:
Version: Lance 3.0.0 (via pylance)
Description:
Calling
LanceDataset.take_blobs(blob_column, indices=[...])on a multi-fragment table with a blob column fails with a schema error. Lance internally tries to append a_rowaddrcolumn, but the blob column's struct schema already contains a_rowaddrfield, causing a collision.Reproduction:
Error:
Root cause:
The
indicescode path intake.rs:384attempts to add a_rowaddrcolumn to resolve logical indices into physical row addresses. However, the internal blob schema already includes_rowaddras a field, causing a duplicate column error.Workaround:
Use
ids=instead ofindices=with properly encoded row IDs (fragment_id << 32 | row_offset):Additional notes:
take_blobs(ids=...)with flat sequential integers (not encoded row IDs) also fails — the IDs must encode the fragment ID in the upper 32 bits.LanceDataset.take()does not supportwith_row_idorwith_row_addresskwargs, so there's no convenient way to map logical indices to row IDs for use withtake_blobs(ids=...).indicespath is the only ergonomic API for random access to blobs by logical position, and it's broken on any table with more than one fragment.Environment: