Skip to content

Useless Scalar Index Optimization When No New Fragment Is Present #6932

@zhangyue19921010

Description

@zhangyue19921010

Phenomenon

Calling Dataset.optimize_indices() on a scalar index that is already in steady state (no unindexed fragments, num_indices_to_merge <= 1) is supposed to be a no-op, but instead it:

  • rebuilds the existing index segment,
  • assigns it a new UUID,
  • writes a new manifest, advancing the dataset version on every call.

Effectively, every "idle" optimize call churns the dataset and the index, producing useless versions and invalidating any cache keyed by the index UUID.

Reproduction

Repro script: write/verify_optimize_noop.py

import lance, pyarrow as pa, tempfile, shutil

tmp = tempfile.mkdtemp()
uri = f"{tmp}/ds"

ds = lance.write_dataset(pa.table({"id": [f"song-{i}" for i in range(32)]}), uri)
ds.create_scalar_index("id", index_type="BTREE", name="id_idx")
ds = lance.dataset(uri)

def uuid_of(d):
    return next(i["uuid"] for i in d._ds.load_indices() if i["name"] == "id_idx")

print(ds.version, uuid_of(ds))

ds.optimize.optimize_indices(num_indices_to_merge=1)   # should be a no-op
ds = lance.dataset(uri)
print(ds.version, uuid_of(ds))

ds.optimize.optimize_indices()                          # should be a no-op
ds = lance.dataset(uri)
print(ds.version, uuid_of(ds))

Observed output

uuids
2 → 3 → 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions