perf: remove O(n²) performance regression in take() with duplicate indices#6351
Merged
wjones127 merged 1 commit intolance-format:mainfrom Mar 31, 2026
Merged
Conversation
Problem:
take() with sorted-but-duplicate indices (e.g., ML sliding window
sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
with 90% duplicates takes 57 seconds instead of 120ms.
Root cause (two bugs):
1. check_row_addrs() uses strict `>` to detect sorted order.
Duplicate adjacent values (addr == last_offset) are misclassified
as "unsorted", routing to the slow fallback path.
2. The unsorted fallback remaps via `.position()` linear scan —
O(N*M) where N=original indices, M=deduplicated rows.
For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds.
Fix:
1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates
correctly takes the fast "sorted" branch.
2. Replace `.position()` linear scan with HashMap lookup (O(1) per
element) as defense-in-depth for truly unsorted input.
Benchmark (1024 sliding windows, 2410 rows each, stride 241):
Before: 57,000ms
After: 120ms (475× speedup)
The sorted fast path already handles duplicates correctly via
fragment-level take_as_batch() which has its own dedup logic.
Contributor
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
wjones127
approved these changes
Mar 31, 2026
Contributor
wjones127
left a comment
There was a problem hiding this comment.
This looks good to me.
eddyxu
pushed a commit
that referenced
this pull request
Mar 31, 2026
…dices (#6351) Problem: take() with sorted-but-duplicate indices (e.g., ML sliding window sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices with 90% duplicates takes 57 seconds instead of 120ms. Root cause (two bugs): 1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path. 2. The unsorted fallback remaps via `.position()` linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds. Fix: 1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch. 2. Replace `.position()` linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input. Benchmark (1024 sliding windows, 2410 rows each, stride 241): Before: 57,000ms After: 120ms (475× speedup) The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic. Co-authored-by: YSBF <noreply@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem:
take() with sorted-but-duplicate indices (e.g., ML sliding window
sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
with 90% duplicates takes 57 seconds instead of 120ms.
Root cause (two bugs):
>to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path..position()linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds.Fix:
>to>=in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch..position()linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input.Benchmark (1024 sliding windows, 2410 rows each, stride 241):
Before: 57,000ms
After: 120ms (475× speedup)
The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic.