perf: remove O(n²) performance regression in take() with duplicate indices by YSBF · Pull Request #6351 · lance-format/lance

YSBF · 2026-03-30T22:47:07Z

Problem:
take() with sorted-but-duplicate indices (e.g., ML sliding window
sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
with 90% duplicates takes 57 seconds instead of 120ms.

Root cause (two bugs):

check_row_addrs() uses strict > to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path.
The unsorted fallback remaps via .position() linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds.

Fix:

Change > to >= in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch.
Replace .position() linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input.

Benchmark (1024 sliding windows, 2410 rows each, stride 241):
Before: 57,000ms
After: 120ms (475× speedup)

The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic.

Problem: take() with sorted-but-duplicate indices (e.g., ML sliding window sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices with 90% duplicates takes 57 seconds instead of 120ms. Root cause (two bugs): 1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path. 2. The unsorted fallback remaps via `.position()` linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds. Fix: 1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch. 2. Replace `.position()` linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input. Benchmark (1024 sliding windows, 2410 rows each, stride 241): Before: 57,000ms After: 120ms (475× speedup) The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic.

github-actions · 2026-03-30T22:47:23Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

codecov · 2026-03-30T23:21:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

wjones127

This looks good to me.

…dices (#6351) Problem: take() with sorted-but-duplicate indices (e.g., ML sliding window sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices with 90% duplicates takes 57 seconds instead of 120ms. Root cause (two bugs): 1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path. 2. The unsorted fallback remaps via `.position()` linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds. Fix: 1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch. 2. Replace `.position()` linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input. Benchmark (1024 sliding windows, 2410 rows each, stride 241): Before: 57,000ms After: 120ms (475× speedup) The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic. Co-authored-by: YSBF <noreply@users.noreply.github.com>

github-actions bot added the bug Something isn't working label Mar 30, 2026

wjones127 approved these changes Mar 31, 2026

View reviewed changes

wjones127 changed the title ~~fix: O(n²) performance regression in take() with duplicate indices~~ perf: remove O(n²) performance regression in take() with duplicate indices Mar 31, 2026

github-actions bot added the performance label Mar 31, 2026

wjones127 removed the bug Something isn't working label Mar 31, 2026

wjones127 merged commit 2e10283 into lance-format:main Mar 31, 2026
31 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: remove O(n²) performance regression in take() with duplicate indices#6351

perf: remove O(n²) performance regression in take() with duplicate indices#6351
wjones127 merged 1 commit intolance-format:mainfrom
YSBF:fix/take-duplicate-indices-perf

YSBF commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026

Uh oh!

wjones127 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YSBF commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants