Skip to content

perf: remove O(n²) performance regression in take() with duplicate indices#6351

Merged
wjones127 merged 1 commit intolance-format:mainfrom
YSBF:fix/take-duplicate-indices-perf
Mar 31, 2026
Merged

perf: remove O(n²) performance regression in take() with duplicate indices#6351
wjones127 merged 1 commit intolance-format:mainfrom
YSBF:fix/take-duplicate-indices-perf

Conversation

@YSBF
Copy link
Copy Markdown
Contributor

@YSBF YSBF commented Mar 30, 2026

Problem:
take() with sorted-but-duplicate indices (e.g., ML sliding window
sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
with 90% duplicates takes 57 seconds instead of 120ms.

Root cause (two bugs):

  1. check_row_addrs() uses strict > to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path.
  2. The unsorted fallback remaps via .position() linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds.

Fix:

  1. Change > to >= in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch.
  2. Replace .position() linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input.

Benchmark (1024 sliding windows, 2410 rows each, stride 241):
Before: 57,000ms
After: 120ms (475× speedup)

The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic.

Problem:
  take() with sorted-but-duplicate indices (e.g., ML sliding window
  sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
  with 90% duplicates takes 57 seconds instead of 120ms.

Root cause (two bugs):
  1. check_row_addrs() uses strict `>` to detect sorted order.
     Duplicate adjacent values (addr == last_offset) are misclassified
     as "unsorted", routing to the slow fallback path.
  2. The unsorted fallback remaps via `.position()` linear scan —
     O(N*M) where N=original indices, M=deduplicated rows.
     For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds.

Fix:
  1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates
     correctly takes the fast "sorted" branch.
  2. Replace `.position()` linear scan with HashMap lookup (O(1) per
     element) as defense-in-depth for truly unsorted input.

Benchmark (1024 sliding windows, 2410 rows each, stride 241):
  Before: 57,000ms
  After:    120ms  (475× speedup)

The sorted fast path already handles duplicates correctly via
fragment-level take_as_batch() which has its own dedup logic.
@github-actions github-actions bot added the bug Something isn't working label Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@wjones127 wjones127 changed the title fix: O(n²) performance regression in take() with duplicate indices perf: remove O(n²) performance regression in take() with duplicate indices Mar 31, 2026
@wjones127 wjones127 removed the bug Something isn't working label Mar 31, 2026
@wjones127 wjones127 merged commit 2e10283 into lance-format:main Mar 31, 2026
31 of 32 checks passed
eddyxu pushed a commit that referenced this pull request Mar 31, 2026
…dices (#6351)

Problem:
  take() with sorted-but-duplicate indices (e.g., ML sliding window
  sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
  with 90% duplicates takes 57 seconds instead of 120ms.

Root cause (two bugs):
1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate
adjacent values (addr == last_offset) are misclassified as "unsorted",
routing to the slow fallback path.
2. The unsorted fallback remaps via `.position()` linear scan — O(N*M)
where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308
billion comparisons ≈ 57 seconds.

Fix:
1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates
correctly takes the fast "sorted" branch.
2. Replace `.position()` linear scan with HashMap lookup (O(1) per
element) as defense-in-depth for truly unsorted input.

Benchmark (1024 sliding windows, 2410 rows each, stride 241):
  Before: 57,000ms
  After:    120ms  (475× speedup)

The sorted fast path already handles duplicates correctly via
fragment-level take_as_batch() which has its own dedup logic.

Co-authored-by: YSBF <noreply@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants