IndexSVSIVF: enable IDFilter and runtime intra_query_threads#2
Draft
ibhati wants to merge 1 commit into
Draft
Conversation
- search() now plumbs SearchParameters::sel into the SVS runtime via
make_faiss_id_filter, mirroring IndexSVSVamana::search.
- search() also syncs intra_query_threads to the backend (in addition
to num_threads), so the field can be tuned post-train()/load().
- Tests:
* Rewrite IVFSearchWithIDSelector to actually verify returned labels
fall inside the selector range (drops the 'IVF does not support
IDSelector via SVS runtime' workaround comment).
* IVFSearchWithRestrictiveFilter: narrow filter (~10% of dataset),
asserts every non-sentinel label is in range.
* IVF/DynamicIVFSetIntraQueryThreadsAfterTrain: change
intra_query_threads after add(), confirm next search() rebuilds
the backend pool and get_intra_query_threads() reflects the new
value.
ibhati
pushed a commit
that referenced
this pull request
May 7, 2026
…ult handlers (facebookresearch#5185) Summary: Pull Request resolved: facebookresearch#5185 Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads. **Changes** 1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations. 2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`. 3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`. The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`. **Benchmark results** A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions. `HeapBlockResultHandler::end_multiple` (us, lower better): | nq | k | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16 | |------:|-----:|-----------:|---------:|------------:|----------:|--------------:| | 64 | 10 | 9.2 | 7.2 | 8.1 | 8.3 | 0.98× (gated) | | 64 | 100 | 340 | 345 | 318 | 67 | 4.79× | | 64 | 1000 | 5,796 | 5,700 | 5,886 | 501 | 11.76× | | 512 | 100 | 2,811 | 2,769 | 2,677 | 312 | 8.59× | | 512 | 1000 | 46,109 | 46,070 | 45,758 | 3,778 | 12.11× | | 4096 | 100 | 22,041 | 21,588 | 21,672 | 1,869 | 11.60× | | 4096 | 1000 | 369,069 | 376,541 | 372,481 | 25,442 | 14.64× | `ReservoirBlockResultHandler::end_multiple` (us): | nq | k | parent t=16 | this t=16 | speedup | |------:|-----:|------------:|----------:|--------------:| | 64 | 10 | 18.0 | 18.1 | 0.99× (gated) | | 64 | 100 | 659 | 96 | 6.86× | | 64 | 1000 | 7,592 | 553 | 13.73× | | 512 | 100 | 5,498 | 490 | 11.21× | | 512 | 1000 | 59,548 | 4,677 | 12.73× | | 4096 | 100 | 44,064 | 3,230 | 13.64× | | 4096 | 1000 | 476,388 | 32,237 | 14.78× | `IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms): | nb | nq | k | parent t=16 | this t=16 | speedup | |------:|------:|----:|------------:|----------:|--------:| | 1024 | 1024 | 10 | 1.71 | 1.45 | 1.18× | | 1024 | 4096 | 100 | 58.5 | 15.5 | 3.78× | | 4096 | 4096 | 100 | 76.9 | 39.4 | 1.95× | Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers. Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy. Reviewed By: mnorris11 Differential Revision: D103830810 fbshipit-source-id: 8434fa6f16b78c5ff7b2244ac5d5fe9cc8c012a5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to facebookresearch#4801. Wires up two
IndexSVSIVFfeatures that became available after the SVS runtime was extended.Changes
search()—SearchParameters::selis now plumbed into the SVS runtime viamake_faiss_id_filter, mirroringIndexSVSVamana::search(). Previously the selector was silently ignored.intra_query_threads—search()now syncsintra_query_threadsto the backend (alongsidenum_threads), so it can be tuned aftertrain()/deserialize_impl(). The doc comment inIndexSVSIVF.his updated; the prior "must be set beforetrain()" limitation no longer applies.Tests
IVFSearchWithIDSelector— rewritten to assert every non-sentinel label falls in the selector range.IVFSearchWithRestrictiveFilter(new) — narrow ID window (~10% of dataset) withn_probes = num_centroidsto exercise the adaptive filter loop end-to-end.IVFSetIntraQueryThreadsAfterTrain/DynamicIVFSetIntraQueryThreadsAfterTrain(new) — bumpintra_query_threadsafteradd(), then confirm the nextsearch()rebuilds the backend pool andget_intra_query_threads()reports the new value.