Skip to content

Fix duplicate-cluster trigger to require >=2 kept candidates#327

Merged
ethanglaser merged 1 commit into
mainfrom
ib/fix_duplicates
May 5, 2026
Merged

Fix duplicate-cluster trigger to require >=2 kept candidates#327
ethanglaser merged 1 commit into
mainfrom
ib/fix_duplicates

Conversation

@ibhati
Copy link
Copy Markdown
Member

@ibhati ibhati commented May 5, 2026

PR #282 added a post-prune diversity step that fires when "all kept neighbors share the same distance". The original guard !result.empty() is also satisfied by result.size() == 1, because all_duplicates stays trivially true when only one candidate has been pushed into result. This causes the branch to fire on nodes whose alpha-pruning has collapsed result to a single retained neighbor, replacing that sole nearest-neighbor edge with a longer "diversity" edge -- which is unrelated to duplicate clusters.

Restrict the trigger to result.size() >= 2 so it only activates when two or more retained candidates actually share the same distance, matching the intent described in PR #282 / issue #80.

…llow-up)

PR #282 added a post-prune diversity step that fires when "all kept
neighbors share the same distance". The original guard
`!result.empty()` is also satisfied by `result.size() == 1`, because
`all_duplicates` stays trivially true when only one candidate has been
pushed into `result`. This causes the branch to fire on nodes whose
alpha-pruning has collapsed `result` to a single retained neighbor,
replacing that sole nearest-neighbor edge with a longer "diversity"
edge -- which is unrelated to duplicate clusters.

Restrict the trigger to `result.size() >= 2` so it only activates when
two or more retained candidates actually share the same distance,
matching the intent described in PR #282 / issue #80.

Empirical impact (sift-scaled, ws=64, deg=64, pool=32, alpha=1.2, L2):
  - Pre-#282 baseline:                            recall = 0.925
  - Post-#282 (current behavior):                 recall = 0.945
    (434 triggers, 184 replacements -- all from result.size()==1)
  - Post-#282 + this fix (result.size() >= 2):    recall = 0.925
    (0 triggers, 0 replacements -- no real duplicate clusters present)

This restores the documented baseline for sift-scaled CI while keeping
the genuine duplicate-cluster handling for datasets that actually
contain co-located vectors (e.g. issue #80 reproducer).
@ibhati ibhati requested a review from ahuber21 as a code owner May 5, 2026 18:49
@ibhati ibhati requested a review from ethanglaser May 5, 2026 18:49
@ethanglaser ethanglaser merged commit 9cee0f1 into main May 5, 2026
20 checks passed
@ethanglaser ethanglaser deleted the ib/fix_duplicates branch May 5, 2026 19:46
@ethanglaser ethanglaser mentioned this pull request May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants