take_blobs(indices=...) fails with _rowaddr schema collision on multi-fragment tables

**Version:** Lance 3.0.0 (via pylance)

**Description:**

Calling `LanceDataset.take_blobs(blob_column, indices=[...])` on a multi-fragment table with a blob column fails with a schema error. Lance internally tries to append a `_rowaddr` column, but the blob column's struct schema already contains a `_rowaddr` field, causing a collision.

**Reproduction:**

```python
import lance
import random

ds = lance.dataset("path/to/multi_fragment_table.lance")

indices = sorted(random.sample(range(ds.count_rows()), 10))

# This fails:
blobs = ds.take_blobs("my_blob_column", indices=indices)
```

**Error:**

```
RuntimeError: LanceError(Arrow): Schema error: Can not append column _rowaddr on schema:
Schema { fields: [
  Field { name: "my_blob_column", data_type: Struct([
    Field { name: "position", data_type: UInt64, nullable: true },
    Field { name: "size", data_type: UInt64, nullable: true }
  ]), metadata: {"lance-encoding:blob": "true"} },
  Field { name: "_rowaddr", data_type: UInt64, nullable: true }
], metadata: {} },
/home/runner/work/lance/lance/rust/lance/src/dataset/take.rs:384:21
```

**Root cause:**

The `indices` code path in `take.rs:384` attempts to add a `_rowaddr` column to resolve logical indices into physical row addresses. However, the internal blob schema already includes `_rowaddr` as a field, causing a duplicate column error.

**Workaround:**

Use `ids=` instead of `indices=` with properly encoded row IDs (`fragment_id << 32 | row_offset`):

```python
fragments = ds.get_fragments()
frag = fragments[0]
offsets = random.sample(range(frag.count_rows()), 10)
row_ids = [(frag.fragment_id << 32) | offset for offset in offsets]
blobs = ds.take_blobs("my_blob_column", ids=row_ids)  # works
```

**Additional notes:**

- `take_blobs(ids=...)` with flat sequential integers (not encoded row IDs) also fails — the IDs must encode the fragment ID in the upper 32 bits.
- `LanceDataset.take()` does not support `with_row_id` or `with_row_address` kwargs, so there's no convenient way to map logical indices to row IDs for use with `take_blobs(ids=...)`.
- The `indices` path is the only ergonomic API for random access to blobs by logical position, and it's broken on any table with more than one fragment.

**Environment:**
- Lance 3.0.0
- Python 3.12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

take_blobs(indices=...) fails with _rowaddr schema collision on multi-fragment tables #6227

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

take_blobs(indices=...) fails with _rowaddr schema collision on multi-fragment tables #6227

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions