Skip to content

fix: increase nprobes in test_query_delta_indices for deterministic recall#6953

Merged
Xuanwo merged 1 commit into
lance-format:mainfrom
Jay-ju:fix/nprobes-flaky-test
May 28, 2026
Merged

fix: increase nprobes in test_query_delta_indices for deterministic recall#6953
Xuanwo merged 1 commit into
lance-format:mainfrom
Jay-ju:fix/nprobes-flaky-test

Conversation

@Jay-ju
Copy link
Copy Markdown
Contributor

@Jay-ju Jay-ju commented May 27, 2026

Summary

Increase nprobes from the default (1) to 2 in test_query_delta_indices to fix a non-deterministic test failure.

Problem

The test creates an IVF index with 2 partitions but uses the default nprobes=1, which only probes 1 partition per segment. After optimize_indices, the index has 2 delta segments. With nprobes=1, the search may miss the partition containing ID 0 in the first segment, causing the assertion id_arr == [0, 1000] to fail non-deterministically (e.g., returning [889, 1000] instead).

This flaky failure was observed during development of PR #6890, where CI runs would intermittently fail on this test.

Fix

Setting nprobes=2 ensures all partitions are probed in every segment, making the vector search exhaustive and the test deterministic.

Test Plan

The fix is itself a test fix. The test test_query_delta_indices should no longer fail intermittently on CI.

…ecall

The test uses IVF with 2 partitions but default nprobes=1, which only
probes 1 partition per segment. With delta indices (2 segments), the
search may miss the partition containing ID 0 in the first segment,
causing the assertion to fail non-deterministically (e.g., returning
[889, 1000] instead of [0, 1000]).

Setting nprobes=2 ensures all partitions are probed, making the search
exhaustive and the test deterministic.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 27, 2026
Jay-ju added a commit to Jay-ju/lance that referenced this pull request May 27, 2026
The .nprobes(2) fix for test_query_delta_indices has been moved to
a separate PR (lance-format#6953) to keep this PR focused on compaction planning.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Xuanwo Xuanwo merged commit 5bd6412 into lance-format:main May 28, 2026
28 checks passed
BubbleCal pushed a commit that referenced this pull request May 29, 2026
)

## Problem

`test_build_ivf_pq::case_3` (`DistanceType::Dot`) fails on linux-arm
with:

```
assertion `left == right` failed
  left: 4
  right: 5
at rust/lance/src/index/vector/ivf/v2.rs:4321
```

The test partitions top-k results around `part_dist = dists[part_idx]`
using:
- left filter: strict `d < part_dist`, limit `part_idx`
- right filter: inclusive `d >= part_dist`, limit `k - part_idx`

It then asserts `left_res.num_rows() == part_idx` and that the row-id
ordering matches the original top-k.

## Root cause

When `dists[part_idx - 1] == dists[part_idx]`, two vectors tie at the
partition boundary:
- the strict-less left filter excludes both tied vectors → `left_res`
has `part_idx - 1` rows
- the inclusive right filter includes both tied vectors → one row shifts
from left to right, and the trailing original row drops off
`right_res`'s limit
- the original row-id ordering of the tied vectors is determined by an
internal tiebreaker, so the row-id assertions can fail even when the
count matches

The failure is deterministic per architecture but architecture-dependent
because **ARM NEON uses fused multiply-add (FMA) for SIMD dot products**
by default, while x86 (without `-mfma`) does separate mul+add. FMA
produces a single rounding for `a*b + c`; separate ops produce two. For
Dot distance specifically — which has fewer transformations than L2 or
Cosine — this ULP-level difference is enough to make two distances tie
on ARM that don't tie on x86. Random vectors generated from
`StdRng::seed_from_u64([13; 32])` happen to hit such a tie at the
partition boundary on ARM.

## Fix

Detect the tie explicitly (`dists[part_idx - 1] == part_dist`) and
weaken the count and row-id assertions for the affected positions only.
The unconditional distance-value assertions immediately below
(`left_dists < part_dist`, `right_dists >= part_dist`) still verify
partition correctness in both branches.

## Related

Similar test relaxations on this code area:
- #6620 — relaxed recall assertion for HNSW-based indexes
- #6953 — increased nprobes for deterministic recall in
`test_query_delta_indices`

## Test plan

- [x] `cargo test -p lance --lib --release test_build_ivf_pq` passes
locally on x86 macOS (exercises the no-tie branch, all 12 cases
including case_3)
- [ ] CI: linux-arm should now pass on `test_build_ivf_pq::case_3` (the
failing case from PR #6985)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants