Skip to content

IndexSVSIVF: enable IDFilter and runtime intra_query_threads#2

Draft
ibhati wants to merge 1 commit into
ib/svs_ivffrom
ib/svs_ivf_filter_intraquery
Draft

IndexSVSIVF: enable IDFilter and runtime intra_query_threads#2
ibhati wants to merge 1 commit into
ib/svs_ivffrom
ib/svs_ivf_filter_intraquery

Conversation

@ibhati
Copy link
Copy Markdown
Owner

@ibhati ibhati commented Apr 29, 2026

Follow-up to facebookresearch#4801. Wires up two IndexSVSIVF features that became available after the SVS runtime was extended.

Changes

  • IDFilter support in search()SearchParameters::sel is now plumbed into the SVS runtime via make_faiss_id_filter, mirroring IndexSVSVamana::search(). Previously the selector was silently ignored.
  • Runtime intra_query_threadssearch() now syncs intra_query_threads to the backend (alongside num_threads), so it can be tuned after train() / deserialize_impl(). The doc comment in IndexSVSIVF.h is updated; the prior "must be set before train()" limitation no longer applies.

Tests

  • IVFSearchWithIDSelector — rewritten to assert every non-sentinel label falls in the selector range.
  • IVFSearchWithRestrictiveFilter (new) — narrow ID window (~10% of dataset) with n_probes = num_centroids to exercise the adaptive filter loop end-to-end.
  • IVFSetIntraQueryThreadsAfterTrain / DynamicIVFSetIntraQueryThreadsAfterTrain (new) — bump intra_query_threads after add(), then confirm the next search() rebuilds the backend pool and get_intra_query_threads() reports the new value.

- search() now plumbs SearchParameters::sel into the SVS runtime via
  make_faiss_id_filter, mirroring IndexSVSVamana::search.
- search() also syncs intra_query_threads to the backend (in addition
  to num_threads), so the field can be tuned post-train()/load().
- Tests:
  * Rewrite IVFSearchWithIDSelector to actually verify returned labels
    fall inside the selector range (drops the 'IVF does not support
    IDSelector via SVS runtime' workaround comment).
  * IVFSearchWithRestrictiveFilter: narrow filter (~10% of dataset),
    asserts every non-sentinel label is in range.
  * IVF/DynamicIVFSetIntraQueryThreadsAfterTrain: change
    intra_query_threads after add(), confirm next search() rebuilds
    the backend pool and get_intra_query_threads() reflects the new
    value.
ibhati pushed a commit that referenced this pull request May 7, 2026
…ult handlers (facebookresearch#5185)

Summary:
Pull Request resolved: facebookresearch#5185

Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads.

**Changes**

1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations.

2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`.

3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`.

The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`.

**Benchmark results**

A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions.

`HeapBlockResultHandler::end_multiple` (us, lower better):

| nq    | k    | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16  |
|------:|-----:|-----------:|---------:|------------:|----------:|--------------:|
| 64    | 10   | 9.2        | 7.2      | 8.1         | 8.3       | 0.98× (gated) |
| 64    | 100  | 340        | 345      | 318         | 67        | 4.79×         |
| 64    | 1000 | 5,796      | 5,700    | 5,886       | 501       | 11.76×        |
| 512   | 100  | 2,811      | 2,769    | 2,677       | 312       | 8.59×         |
| 512   | 1000 | 46,109     | 46,070   | 45,758      | 3,778     | 12.11×        |
| 4096  | 100  | 22,041     | 21,588   | 21,672      | 1,869     | 11.60×        |
| 4096  | 1000 | 369,069    | 376,541  | 372,481     | 25,442    | 14.64×        |

`ReservoirBlockResultHandler::end_multiple` (us):

| nq    | k    | parent t=16 | this t=16 | speedup       |
|------:|-----:|------------:|----------:|--------------:|
| 64    | 10   | 18.0        | 18.1      | 0.99× (gated) |
| 64    | 100  | 659         | 96        | 6.86×         |
| 64    | 1000 | 7,592       | 553       | 13.73×        |
| 512   | 100  | 5,498       | 490       | 11.21×        |
| 512   | 1000 | 59,548      | 4,677     | 12.73×        |
| 4096  | 100  | 44,064      | 3,230     | 13.64×        |
| 4096  | 1000 | 476,388     | 32,237    | 14.78×        |

`IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms):

| nb    | nq    | k   | parent t=16 | this t=16 | speedup |
|------:|------:|----:|------------:|----------:|--------:|
| 1024  | 1024  | 10  | 1.71        | 1.45      | 1.18×   |
| 1024  | 4096  | 100 | 58.5        | 15.5      | 3.78×   |
| 4096  | 4096  | 100 | 76.9        | 39.4      | 1.95×   |

Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers.

Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy.

Reviewed By: mnorris11

Differential Revision: D103830810

fbshipit-source-id: 8434fa6f16b78c5ff7b2244ac5d5fe9cc8c012a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant