Phenomenon
Calling Dataset.optimize_indices() on a scalar index that is already in steady state (no unindexed fragments, num_indices_to_merge <= 1) is supposed to be a no-op, but instead it:
- rebuilds the existing index segment,
- assigns it a new UUID,
- writes a new manifest, advancing the dataset version on every call.
Effectively, every "idle" optimize call churns the dataset and the index, producing useless versions and invalidating any cache keyed by the index UUID.
Reproduction
Repro script: write/verify_optimize_noop.py
import lance, pyarrow as pa, tempfile, shutil
tmp = tempfile.mkdtemp()
uri = f"{tmp}/ds"
ds = lance.write_dataset(pa.table({"id": [f"song-{i}" for i in range(32)]}), uri)
ds.create_scalar_index("id", index_type="BTREE", name="id_idx")
ds = lance.dataset(uri)
def uuid_of(d):
return next(i["uuid"] for i in d._ds.load_indices() if i["name"] == "id_idx")
print(ds.version, uuid_of(ds))
ds.optimize.optimize_indices(num_indices_to_merge=1) # should be a no-op
ds = lance.dataset(uri)
print(ds.version, uuid_of(ds))
ds.optimize.optimize_indices() # should be a no-op
ds = lance.dataset(uri)
print(ds.version, uuid_of(ds))
Observed output
Phenomenon
Calling
Dataset.optimize_indices()on a scalar index that is already in steady state (no unindexed fragments,num_indices_to_merge <= 1) is supposed to be a no-op, but instead it:Effectively, every "idle" optimize call churns the dataset and the index, producing useless versions and invalidating any cache keyed by the index UUID.
Reproduction
Repro script:
write/verify_optimize_noop.pyObserved output
2 → 3 → 4