Skip to content

take_blobs(indices=...) fails with _rowaddr schema collision on multi-fragment tables #6227

@justinrmiller

Description

@justinrmiller

Version: Lance 3.0.0 (via pylance)

Description:

Calling LanceDataset.take_blobs(blob_column, indices=[...]) on a multi-fragment table with a blob column fails with a schema error. Lance internally tries to append a _rowaddr column, but the blob column's struct schema already contains a _rowaddr field, causing a collision.

Reproduction:

import lance
import random

ds = lance.dataset("path/to/multi_fragment_table.lance")

indices = sorted(random.sample(range(ds.count_rows()), 10))

# This fails:
blobs = ds.take_blobs("my_blob_column", indices=indices)

Error:

RuntimeError: LanceError(Arrow): Schema error: Can not append column _rowaddr on schema:
Schema { fields: [
  Field { name: "my_blob_column", data_type: Struct([
    Field { name: "position", data_type: UInt64, nullable: true },
    Field { name: "size", data_type: UInt64, nullable: true }
  ]), metadata: {"lance-encoding:blob": "true"} },
  Field { name: "_rowaddr", data_type: UInt64, nullable: true }
], metadata: {} },
/home/runner/work/lance/lance/rust/lance/src/dataset/take.rs:384:21

Root cause:

The indices code path in take.rs:384 attempts to add a _rowaddr column to resolve logical indices into physical row addresses. However, the internal blob schema already includes _rowaddr as a field, causing a duplicate column error.

Workaround:

Use ids= instead of indices= with properly encoded row IDs (fragment_id << 32 | row_offset):

fragments = ds.get_fragments()
frag = fragments[0]
offsets = random.sample(range(frag.count_rows()), 10)
row_ids = [(frag.fragment_id << 32) | offset for offset in offsets]
blobs = ds.take_blobs("my_blob_column", ids=row_ids)  # works

Additional notes:

  • take_blobs(ids=...) with flat sequential integers (not encoded row IDs) also fails — the IDs must encode the fragment ID in the upper 32 bits.
  • LanceDataset.take() does not support with_row_id or with_row_address kwargs, so there's no convenient way to map logical indices to row IDs for use with take_blobs(ids=...).
  • The indices path is the only ergonomic API for random access to blobs by logical position, and it's broken on any table with more than one fragment.

Environment:

  • Lance 3.0.0
  • Python 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions