Skip to content

Enable options for leanvec/LVQ data to reside in SSD or have primary only leanvec data#1

Draft
ibhati wants to merge 91 commits into
ib/svs_ivffrom
ib/svs_ssd
Draft

Enable options for leanvec/LVQ data to reside in SSD or have primary only leanvec data#1
ibhati wants to merge 91 commits into
ib/svs_ivffrom
ib/svs_ssd

Conversation

@ibhati
Copy link
Copy Markdown
Owner

@ibhati ibhati commented Apr 15, 2026

This PR adds primary_only support to LeanVec indexes in FAISS, allowing users to trade recall for significantly reduced memory usage. It also introduces IndexSVSVamanaSSD for SSD-backed Vamana search.

Changes
Core (faiss/svs/)

IndexSVSVamanaLeanVec: Add primary_only constructor parameter. When enabled, only reduced-dimension primary vectors are stored/used — no full-dimension secondary data for reranking. Serialization via index_write.cpp/index_read.cpp preserves the flag.
IndexSVSVamanaSSD (new): SSD-backed static Vamana index with configurable data placement (RAM/SSD for primary and secondary data). Supports LeanVec and LVQ compression, custom search parameters, and primary_only mode.

alibeklfc and others added 30 commits April 6, 2026 13:29
…form assertions (facebookresearch#5047)

Summary:
Pull Request resolved: facebookresearch#5047

## Summary

This diff fixes a bug and improves error message quality in `VectorTransform.cpp`.

### Bug Fix (line 153)
`VectorTransform::check_identical()` had a copy-paste bug where `d_in` was checked twice and `d_out` was never checked:
```cpp
// Before (buggy):
FAISS_THROW_IF_NOT(other.d_in == d_in && other.d_in == d_in);
// After (fixed):
FAISS_THROW_IF_NOT_MSG(
        other.d_in == d_in && other.d_out == d_out,
        "input and output dimensions must match");
```
This meant two VectorTransforms with matching `d_in` but different `d_out` would incorrectly pass the identity check. This could lead to subtle bugs when comparing or serializing transform chains (e.g., in `IndexPreTransform`).

### Error Message Improvements
All 28 bare `FAISS_THROW_IF_NOT()` calls in `VectorTransform.cpp` have been converted to `FAISS_THROW_IF_NOT_MSG()` with clear, actionable error messages. Previously, assertion failures would only show the raw C++ condition (e.g., `"Error: 'p > 0' failed"`), which is unhelpful for users. Now each assertion provides semantic context:

- **Dynamic cast failures**: `"failed to cast to HadamardRotation"` instead of `"hr"`
- **Dimension mismatches**: `"input and output dimensions must match when PCA is disabled"` instead of `"din == dout"`
- **Training state errors**: `"CenteringTransform has not been trained"` instead of `"is_trained"`
- **LAPACK errors**: `"LAPACK dgesvd workspace query failed"` instead of `"info == 0"`
- **Parameter validation**: `"map entries must be -1 (unused) or valid input dimension indices"` instead of raw condition

### Affected classes
- `VectorTransform` (base class)
- `LinearTransform`
- `HadamardRotation`
- `PCAMatrix`
- `ITQMatrix`
- `ITQTransform`
- `OPQMatrix`
- `NormalizationTransform`
- `CenteringTransform`
- `RemapDimensionsTransform`

### Design decisions
- Used `FAISS_THROW_IF_NOT_MSG` (not `FAISS_THROW_IF_NOT_FMT`) since all messages are static strings — no runtime formatting needed, keeping zero overhead.
- Error messages follow existing Faiss patterns seen in `index_read.cpp` and other files.
- Each message describes the semantic meaning of the condition, not just the code.

Reviewed By: mnorris11

Differential Revision: D99674067

fbshipit-source-id: cf0fe9a8a7f047013011683d76221682d97beb6c
…arch#4996)

Summary: Pull Request resolved: facebookresearch#4996

Reviewed By: alibeklfc

Differential Revision: D99569811

Pulled By: mnorris11

fbshipit-source-id: 127c6b6b771b81b1f11b0f28dc4936959fafac09
…ookresearch#5034)

Summary:
In GCC, `-mtune=sapphirerapids` sets prefer-vector-width to 256 via `X86_TUNE_AVX256_OPTIMAL`. In LLVM, the same default is set via the prefer-256-bit subtarget feature. This was originally added to avoid AVX-512 frequency throttling on Skylake-SP, but the penalty is negligible since Sapphire Rapids. Switching to explicit ISA flags allows the auto-vectorizer to use zmm registers. I don't see any performance regression. Below is an example.

```cpp
// bench_ip.cpp

#include <benchmark/benchmark.h>
#include <vector>
#include <random>
#include <thread>
#include <numeric>
#include <cstdlib>

_Pragma("GCC push_options") \
_Pragma("GCC optimize (\"unroll-loops,associative-math,no-signed-zeros\")")
static float inner_product(const float *a, const float *b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}
_Pragma("GCC pop_options")

static void BM_InnerProduct(benchmark::State &state) {
    const int n = state.range(0);

    std::mt19937 rng(42 + state.thread_index());
    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);

    std::vector<float> a(n), b(n);
    for (int i = 0; i < n; i++) {
        a[i] = dist(rng);
        b[i] = dist(rng);
    }

    for (auto _ : state) {
        float result = inner_product(a.data(), b.data(), n);
        benchmark::DoNotOptimize(result);
    }

    state.SetItemsProcessed(state.iterations() * n);
    state.SetBytesProcessed(state.iterations() * n * 2 * sizeof(float));
}

int main(int argc, char **argv) {
    int num_threads = 1;

    // Parse --threads=N before passing to benchmark
    for (int i = 1; i < argc; i++) {
        if (std::string(argv[i]).rfind("--threads=", 0) == 0) {
            num_threads = std::atoi(argv[i] + 10);
            // Remove from argv so benchmark doesn't choke
            for (int j = i; j < argc - 1; j++)
                argv[j] = argv[j + 1];
            argc--;
            i--;
        }
    }

    std::vector<int64_t> sizes = {384, 768, 1536};
    for (auto sz : sizes) {
        benchmark::RegisterBenchmark("BM_InnerProduct", BM_InnerProduct)
            ->Arg(sz)
            ->Threads(num_threads)
            ->UseRealTime();
    }

    benchmark::Initialize(&argc, argv);
    benchmark::RunSpecifiedBenchmarks();
    return 0;
}
```

**Current**:
`g++ -O3 -march=sapphirerapids -mtune=sapphirerapids  bench_ip.cpp -lbenchmark -lpthread`

```text
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_InnerProduct/384/real_time/threads:1        20.7 ns         20.6 ns     33860063 bytes_per_second=138.529Gi/s items_per_second=18.5931G/s
BM_InnerProduct/768/real_time/threads:1        43.0 ns         43.0 ns     16293169 bytes_per_second=133.043Gi/s items_per_second=17.8567G/s
BM_InnerProduct/1536/real_time/threads:1       86.5 ns         86.3 ns      7699745 bytes_per_second=132.321Gi/s items_per_second=17.7598G/s

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
BM_InnerProduct/384/real_time/threads:64        31.2 ns         31.1 ns     22750464 bytes_per_second=91.7077Gi/s items_per_second=12.3088G/s
BM_InnerProduct/768/real_time/threads:64        59.3 ns         59.2 ns     10872768 bytes_per_second=96.4611Gi/s items_per_second=12.9468G/s
BM_InnerProduct/1536/real_time/threads:64        130 ns          130 ns      5561152 bytes_per_second=87.9984Gi/s items_per_second=11.8109G/s
```

**This PR**: `g++ -O3 -mavx512f bench_ip.cpp -lbenchmark -lpthread`

```text
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_InnerProduct/384/real_time/threads:1        17.5 ns         17.5 ns     40056065 bytes_per_second=163.685Gi/s items_per_second=21.9695G/s
BM_InnerProduct/768/real_time/threads:1        34.2 ns         34.1 ns     20446203 bytes_per_second=167.326Gi/s items_per_second=22.4582G/s
BM_InnerProduct/1536/real_time/threads:1       72.4 ns         72.3 ns      9451952 bytes_per_second=158.094Gi/s items_per_second=21.219G/s

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
BM_InnerProduct/384/real_time/threads:64        23.4 ns         23.4 ns     27661760 bytes_per_second=122.456Gi/s items_per_second=16.4358G/s
BM_InnerProduct/768/real_time/threads:64        49.1 ns         48.9 ns     13923776 bytes_per_second=116.614Gi/s items_per_second=15.6517G/s
BM_InnerProduct/1536/real_time/threads:64        105 ns          105 ns      6088320 bytes_per_second=108.736Gi/s items_per_second=14.5943G/s
```

Pull Request resolved: facebookresearch#5034

Test Plan: Verified flag consistency across all three CMake files. Added missing -mavx512vpopcntdq required by hamming_distance/avx512-inl.h and rabitq_avx512.cpp.

Reviewed By: mnorris11

Differential Revision: D99687322

Pulled By: alibeklfc

fbshipit-source-id: 1a27191149f9d0ff9dc392183bbd3c97c9915aa3
…ch#5044)

Summary:
- Fix duplicate word "the the" to "the" in `faiss/utils/quantize_lut.h` (comment) and `benchs/README.md`
- Fix duplicate word "to to" to "to" in `faiss/IndexBinaryHNSW.cpp` (comment)
- Fix subject-verb agreement "This produce" to "This produces" in `INSTALL.md`
- Fix broken grammar "it does not the case" to "it is not the case" in `tests/test_residual_quantizer.py`

Pull Request resolved: facebookresearch#5044

Test Plan:
- [ ] No functional code changes; only comments, documentation, and test comments are modified
- [ ] Verified each fix is a clear typo/grammar correction

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed By: mnorris11

Differential Revision: D99692631

Pulled By: alibeklfc

fbshipit-source-id: e3ae88ba4ca732e9f620e1c205f51e4b827d0730
Summary:
- Replace two `http://github.com` links with `https://github.com` in the README
  - Wiki page link (line 38)
  - Issues page link (line 85)

Pull Request resolved: facebookresearch#5043

Test Plan:
- [x] Both links resolve correctly with HTTPS
- [x] No other changes

Reviewed By: mnorris11

Differential Revision: D99690282

Pulled By: alibeklfc

fbshipit-source-id: 42e94ac7e45e457b1d6b5b511c4092990c696c54
Summary:
- Update C++ language level from C++17 to C++20 in `CONTRIBUTING.md` to match the actual CMake configuration (`CMAKE_CXX_STANDARD 20` in the root `CMakeLists.txt`)
- Remove outdated "progressively dropping python2 support" note from `contrib/README.md` (Python 2 reached end-of-life in January 2020 and Faiss requires Python 3)
- Update shebangs from `python2` to `python3` in three benchmark scripts: `benchs/kmeans_mnist.py`, `benchs/bench_gpu_1bn.py`, and `benchs/bench_vector_ops.py`

Pull Request resolved: facebookresearch#5045

Test Plan:
- [ ] No functional code changes; only documentation and shebangs are modified
- [ ] Verified C++20 is the actual standard by checking `CMakeLists.txt` (`set(CMAKE_CXX_STANDARD 20)`) and `INSTALL.md` (which already references "a C++20 compiler")

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed By: mnorris11

Differential Revision: D99691731

Pulled By: alibeklfc

fbshipit-source-id: 579cf21054cdf6bdaea27f5abb1c56e0b709a922
Summary:
Pull Request resolved: facebookresearch#5025

In c++ when running in dev mode:
```
  Crash chain:
  1. index->add(1, vec) → IndexIVF::add_with_ids() (line 190)
  2. quantizer->assign() → Index::assign() → IndexHNSW::search() — searches the HNSW coarse quantizer with the NaN
  vector
  3. HNSW::search() computes d_nearest = qdis(entry_point) — this returns NaN because the input vector has NaN
  4. NaN is pushed into MinimaxHeap candidates. All comparisons with NaN return false, corrupting heap ordering
  5. search_from_candidates() pops a garbage node ID v0 from the corrupted heap
  6. neighbor_range(v0, ...) hits FAISS_CHECK_RANGE_DEBUG — v0 is out of bounds → crash
```

In python test or opt mode:
```
The test timed out — it hung instead of crashing. This is consistent with the analysis: in the Python bindings (which
  run with mode/opt, so no debug assertions), pop_min returns -1, neighbor_range(-1, ...) doesn't crash on the
  assertion (it's a debug-only check), and instead accesses offsets[-1] which is undefined behavior. The NaN corrupts
  the heap, the search loops forever because nvalid never decrements when pop_min returns -1.

  So in Python/opt mode: no stack trace, just a hang. The C++ debug build crashes with the assertion. Both are fixed by
  the same push() NaN→+inf change in HNSW.cpp.
```

Performance impact??
--
it is in the tightest loop in search_from_candidates.  Claude seems to think it is fine:
```
  In search_from_candidates (line 707), push() is called for every neighbor visited during greedy graph traversal — batched 4 at a time via
  add_to_heap. This is indeed the tightest inner loop in HNSW search.

  What std::isnan actually costs

  std::isnan(float) on x86_64 compiles to essentially v != v — a single ucomiss + conditional jump (jp). That's 1-2 instructions.

  Context: what else happens per push call

  Each push call is preceded by significantly more expensive work:
  1. Distance computation (distances_batch_4 at line 718): ~2×d FLOPs per vector (e.g., ~256 FLOPs for d=128)  NOTE: HNSW.cpp:718-726 (qdis.distances_batch_4(...))
  2. Existing comparisons in push: k == n, v >= dis[0], potential heap_pop/heap_push which are O(log efSearch) with memory accesses    NOTE: HNSW.cpp:1401  (k == n), 1402 (v >= dis[0]), 1408 (heap_pop), 1410 (heap_push)
  3. Visited table check (vt.set) with prefetching   NOTE:  HNSW.cpp:689 (vt.prefetch(v1)) and 714 (vt.set(v1))

  The isnan check is ~1 instruction compared to hundreds for the distance computation. Even against just the heap operations (several
  comparisons + pointer chasing), it's a rounding error.

  Branch prediction

  Since NaN distances essentially never occur in normal operation, this branch is always not-taken. The branch predictor will learn this
  within the first few iterations and predict it perfectly for the remainder of the search. A perfectly-predicted not-taken branch is ~0
  cycles on modern CPUs.
```

And mini benchmark:
```
  ┌────────┬────────────────────────┬────────────────────────┬───────────────────┐
  │        │ With NaN check (run 1) │ With NaN check (run 2) │ Without NaN check │
  ├────────┼────────────────────────┼────────────────────────┼───────────────────┤
  │ Mean   │ 401.146 ms             │ 400.515 ms             │ 401.191 ms        │
  ├────────┼────────────────────────┼────────────────────────┼───────────────────┤
  │ Median │ 396.746 ms             │ 398.172 ms             │ 398.520 ms        │
  ├────────┼────────────────────────┼────────────────────────┼───────────────────┤
  │ Min    │ 389.939 ms             │ 388.670 ms             │ 391.850 ms        │
  ├────────┼────────────────────────┼────────────────────────┼───────────────────┤
  │ Stddev │ 18.903 ms              │ 7.833 ms               │ 8.164 ms          │
  └────────┴────────────────────────┴────────────────────────┴───────────────────┘
```

I wanted to make sure it worked regardless of metric type:
```
  For inner product (and cosine, which is IP after normalization), HNSW wraps the distance computer in NegativeDistanceComputer, which
  negates the result. This means all metrics flow through the MinimaxHeap with the same convention: smaller = better.

  So the NaN → +inf replacement works correctly for all metrics:

  ┌────────┬──────────────────────────────┬──────┬───────┬─────────────────┐
  │ Metric │        qdis() returns        │ Best │ Worst │   +inf means    │
  ├────────┼──────────────────────────────┼──────┼───────┼─────────────────┤
  │ L2     │ ‖x-q‖²                       │ 0    │ +inf  │ worst (correct) │
  ├────────┼──────────────────────────────┼──────┼───────┼─────────────────┤
  │ IP     │ -<x,q> (negated)             │ -∞   │ +inf  │ worst (correct) │
  ├────────┼──────────────────────────────┼──────┼───────┼─────────────────┤
  │ Cosine │ -<x̂,q̂> (negated, normalized) │ -1   │ +inf  │ worst (correct) │
  └────────┴──────────────────────────────┴──────┴───────┴─────────────────┘

  In all cases, +inf sits at the top of the CMax heap and gets evicted first when the heap is full (v >= dis[0] returns early at line 1402).
   And pop_min() will never select it over any finite-distance candidate. The semantics are correct regardless of metric type.
```

Reviewed By: mdouze

Differential Revision: D99036639

fbshipit-source-id: e5d6392e800f243f66ce283c8cd35fe0e7558229
Summary:
Pull Request resolved: facebookresearch#4854

This re-enables the AMD ROCm runner that was previously disabled in D86250489.

Changes from the original configuration:
- Updated runner from `faiss-amd-MI200` to `linux-amd-rocm-mi325-ubuntu-24` to match the currently available GitHub Actions runner
- Updated container image from Ubuntu 22.04 to Ubuntu 24.04 to align with the runner environment

Test change:
- seems like cuda and hip disagree about some small rounding errors, so AI updated the test.

Reviewed By: subhadeepkaran

Differential Revision: D94941142

fbshipit-source-id: d5158b7939e3b7327432aa89a9a0d2e5ed1ad190
…decode_impl (facebookresearch#5051)

Summary:
Pull Request resolved: facebookresearch#5051

In `sa_decode_impl<StorageMinMaxT>()`, a local variable `std::vector<StorageMinMaxFP16> minmax(...)` was:

1. Using the wrong type: It hardcoded `StorageMinMaxFP16` instead of the template parameter `StorageMinMaxT`. When this function is instantiated with `StorageMinMaxFP32` (via `IndexRowwiseMinMax::sa_decode`), this would create a vector of the wrong type.

2. Dead code: The vector was allocated but never actually used in the function body. The decoding logic reads `StorageMinMaxT` values directly from the input byte buffer, making this allocation unnecessary.

This change removes the unused variable, eliminating both the type mismatch and the unnecessary memory allocation. The allocation was O(min(chunk_size, n)) per decode call, so removing it also provides a minor performance benefit.

Note: The corresponding `sa_encode_impl` function does NOT have this issue (it correctly uses a local `minmax` that IS used), and the `train_inplace_impl` / `train_impl` functions also correctly use their `minmax` vectors. Only `sa_decode_impl` had this issue.

Reviewed By: mnorris11

Differential Revision: D99851973

fbshipit-source-id: a288a4cd355ccc7d9e13d1f7d61bc54fc524675c
…cebookresearch#5052)

Summary:
Pull Request resolved: facebookresearch#5052

Multiple core Faiss C++ source files use bare `assert()` for runtime invariant checks. Since `assert()` is compiled out in release builds (when `NDEBUG` is defined), these checks silently disappear in production, potentially masking bugs and data corruption.

This diff replaces bare `assert()` calls with Faiss's own `FAISS_THROW_IF_NOT` / `FAISS_THROW_IF_NOT_MSG` macros in 11 core index files. These macros throw `FaissException` with descriptive error messages and remain active in all build modes (debug and release).

Files modified:
- AutoTune.cpp
- Clustering.cpp
- IVFlib.cpp
- IndexBinaryIVF.cpp
- IndexIVFFlat.cpp
- IndexIVFPQ.cpp
- IndexIVFPQFastScan.cpp
- IndexIVFPQR.cpp
- IndexLSH.cpp
- IndexPQ.cpp
- IndexRefine.cpp

Impact:
- Prevents silent failures in release builds
- Provides actionable error messages for debugging
- Aligns with Faiss coding conventions (most of the codebase already uses FAISS_THROW_IF_NOT)

Reviewed By: mnorris11

Differential Revision: D99857988

fbshipit-source-id: 89b01e022958495b0883c5faebc82bfe9b17da18
Summary:
Pull Request resolved: facebookresearch#5050

- Implement balanced assignment in clustering.py based on notebook N10159950
- add a test that shows we improve the imabalace at some cost in MSE

Reviewed By: algoriddle

Differential Revision: D99819394

fbshipit-source-id: 568b6deb7d2b95b8228dbb276c5578df23b01a96
Summary: Pull Request resolved: facebookresearch#5059

Reviewed By: DenisYaroshevskiy

Differential Revision: D99854747

fbshipit-source-id: 8bca36ec90475771ef17356d9a16b0d680a6296b
Summary:
Pull Request resolved: facebookresearch#5060

Five small fixes for Dynamic Dispatch (DD) mode issues found during the DD migration audit.

**Preprocessor guard fixes (3):**

These files use `#ifdef __AVX2__` (or `__AVX__`) in common translation units. In DD mode, common TUs are compiled without `-mavx2`, so `__AVX2__` is never defined and the guarded code is silently disabled. The DD-era equivalent is `COMPILE_SIMD_AVX2`, which is defined target-wide for all TUs in DD mode (and also in static AVX2 builds).

- `ProductQuantizer.cpp`: The dsub=2 fast path for `compute_distance_tables` and `compute_inner_prod_tables` was gated on `__AVX2__ || __aarch64__`. Changed to `COMPILE_SIMD_AVX2 || COMPILE_SIMD_ARM_NEON`. Without this, PQ distance table computation for dsub=2 falls back to the generic path in DD mode.

- `LocalSearchQuantizer.cpp`: The prefetch include and usage were gated on `__AVX2__`. Changed to `COMPILE_SIMD_AVX2`. Without this, LSQ's ICM encoding loop loses prefetch hints in DD mode.

- `prefetch.h`: The x86 prefetch path (`_mm_prefetch` via `<xmmintrin.h>`) was gated on `__AVX__`. This is an SSE intrinsic available on all x86_64 — the correct guard is `__x86_64__ || _M_X64`. The `__AVX__` guard was too restrictive even outside DD mode (it excluded SSE-only x86 builds, though those are rare in practice).

**CMake build fixes (2):**

- CMake DD mode (`FAISS_OPT_LEVEL=dd`) was missing `COMPILE_SIMD_AVX512_SPR` from the x86 compile definitions. Buck defines it via `arch_specific_compiler_flags`, but CMake only had `COMPILE_SIMD_AVX2 COMPILE_SIMD_AVX512`. Any code guarded on `COMPILE_SIMD_AVX512_SPR` (AMX-based fast scan kernels) was dead in CMake DD builds.

- `distances_dispatch.h` was listed in `FAISS_HEADERS` (the CMake install list), but it's a private DD-internal header not meant for downstream consumers. Removed from the install list. (It remains in Buck's `header_files()` since Buck uses that list for compilation visibility, not just install.)

Reviewed By: mdouze

Differential Revision: D99966090

fbshipit-source-id: 05f0d5ee6353f850671f8be3932eb18e84cf8f92
Summary:
Pull Request resolved: facebookresearch#5061

Roll out the `for_all_simd_levels` decorator to 62 test classes across
11 test files. This ensures that every available SIMD level (NONE, AVX2,
AVX512, etc.) is exercised in CI, rather than only testing at the
auto-detected (highest) level.

Previously only `TestExtraDistances` in `test_extra_distances.py` used
the decorator. Now all test files covering DD-dispatched code paths are
parameterized: distances, PQ, SQ, fast scan, RaBitQ, partitioning,
HNSW, and binary indices.

Changes:
- Add `for_all_simd_levels` to 62 test classes across 11 test files
- BUCK: move decorated test targets to `supports_static_listing = False`
  (required because the decorator replaces class names with None,
  breaking TPX's static test enumeration)
- test_fast_scan.py, test_fast_scan_ivf.py: apply decorator manually
  after dynamic `setattr` method generation for TestAQFastScan and
  TestIVFAQFastScan (the `setattr` loops reference the class by name)
- test_rabitq_fastscan.py: extract `_create_fastscan_index` as
  module-level helper to fix cross-class method reference that broke
  when TestRaBitQFastScan was replaced by the decorator
- IndexFastScan.cpp: fix `search_implem_14` to auto-cap `qbs` based
  on `bbs`. The accumulate loop dispatch table only instantiates
  certain (nq, BB) pairs (e.g. BB=2 only has nq=1,2). Previously,
  using bbs=64 with the default qbs (batch of 4) would crash with
  "nq=3 bbs=64 not instantiated". Now the query batch size is
  automatically capped to the max nq supported for the given BB.
  Exposed by per-level testing of test_factory_with_batch_size.

Reviewed By: mdouze

Differential Revision: D99978401

fbshipit-source-id: 54858f3810bab91ac5ec8bf0ce0d55a77710727b
Summary:
Pull Request resolved: facebookresearch#4842

Some tools which depend on FAISS recently became much slower because they were accidentally changed to depend on `faiss:faiss_no_multithreading` instead of `faiss:faiss`.

This adds `faiss::has_omp()`, which returns true if a `#pragma omp parallel` region had any effect through the use of a `reduction(max)` which would otherwise be stripped out.

Note:
1. Compile-time check is not sufficient, as the `faiss_no_multithreading` and/or `faiss_omp_mock` targets control whether the `faiss/*.cpp` implementations have effective `#pragma omp` blocks.
2. Depending on the BUCK build mode, a `cpp_binary` which depends on `faiss:faiss_no_multithreading` and `faiss:faiss` *may or may not* link to implementations with OpenMP support.

Reviewed By: subhadeepkaran

Differential Revision: D94394555

fbshipit-source-id: ea477dd4146619ba1106a16f3021e0437ece9074
…#5064)

Summary:
Pull Request resolved: facebookresearch#5064

Useful when splitting codes that do and don't need packing. For example, in rabitq, output codes from encode_vectors look like [rqfs_codes][flat_factors], and input to pq4_pack_codes should only be a block of [rqfs_codes][rqfs_codes]...

Reviewed By: alibeklfc

Differential Revision: D100047797

fbshipit-source-id: 9cbad95beba8ddbbbe4e6ce8c4541692d3d7b0fa
…acebookresearch#5065)

Summary:
Pull Request resolved: facebookresearch#5065

The `reservePriorityQueue` helper in HNSW.cpp defines a local `Access` struct inheriting from `std::priority_queue` but uses parenthesized initialization `Access access(std::move(q))`. Apple Clang with libc++ on macOS-14 correctly rejects this because the implicit move constructor of `Access` takes `Access&&`, not `std::priority_queue&&`.

The fix changes from parenthesized initialization to brace initialization `Access access{std::move(q)}`, which uses C++17 aggregate initialization. Since `Access` is an aggregate (no user-declared constructors, only a `using` declaration for member access), brace initialization directly initializes the base class sub-object from the `priority_queue&&` argument. This is backward-compatible with GCC, MSVC, and all Clang versions.

Note: The alternative approach of adding `using std::priority_queue<T, Container, Compare>::priority_queue;` to inherit base constructors was considered but rejected because it removes `Access`'s aggregate status, breaking C++20 parenthesized aggregate initialization that the Linux toolchain (clang19) relies on.

Reviewed By: junjieqi

Differential Revision: D100090366

fbshipit-source-id: 07821475d2f2fbc205fb3288cf25a6ebea0ca3a5
Summary:
Pull Request resolved: facebookresearch#5062

Convert `utils/partitioning.cpp` SIMD code to dynamic dispatch so that
IVF partition assignment and histogram functions use the correct SIMD
implementation at runtime instead of being dead-coded in DD mode.

The partitioning code has two SIMD blocks:
1. `simd_partitioning` namespace — SIMD-accelerated uint16_t partition
   using simdlib wrappers (simd16uint16, simd32uint8)
2. Histogram subroutines — SIMD 8-bin and 16-bin histogram computation

Both were guarded by `#ifdef __AVX2__` / `#if defined(__AVX2__) || defined(__aarch64__)`
which are always false in DD mode on x86, silently disabling the fast paths.

Approach:
- Extract all SIMD code into `partitioning_simdlib256.h`, a shared header
  included once per ISA TU (AVX2, NEON). The code uses simdlib portable
  wrappers so it works on both x86 and ARM without changes.
- Create per-ISA TUs (`partitioning_avx2.cpp`, `partitioning_neon.cpp`)
  that include the shared header with the correct compiler flags.
- In the common TU, replace `#ifdef` guards with `with_simd_level_256bit`
  dispatch. NONE level falls through to dedicated scalar fallbacks.
- No AVX512 TU needed — code uses only 256-bit ops; AVX512 falls through
  to AVX2 via the dispatch mechanism.

Reviewed By: mdouze

Differential Revision: D99991775

fbshipit-source-id: 726cdf3a46db31ed1ff1f9a8966e471d9f5ac0b1
…earch#5069)

Summary:
Pull Request resolved: facebookresearch#5069

Remove the 10 global bare-name using declarations (simd16uint16,
simd32uint8, simd8uint32, simd8float32, simd256bit, simd512bit,
simd32uint16, simd64uint8, simd16float32) from simdlib_dispatch.h.

These aliases resolved through SINGLE_SIMD_LEVEL which is NONE in DD
mode, creating a trap where per-ISA TU code accidentally uses scalar
emulation. Each file that needs the aliases now declares its own using
with an explicit SIMD level, making the dependency visible.

Behavior-preserving: all files use the same SINGLE_SIMD_LEVEL they
used before, but now explicitly.

Reviewed By: mdouze

Differential Revision: D100033901

fbshipit-source-id: 2db034c2868de275a5f018d886264762776548c5
Summary:
Pull Request resolved: facebookresearch#5063

Fix two bugs in `fbcode/faiss/impl/NSG.cpp`:

**Bug 1: `init_ids[i]` → `init_ids[num_ids]` in `search_on_graph`**

The init loop in `search_on_graph` reads neighbors of the enterpoint from the
knn_graph. When an entry has `id >= ntotal`, it is skipped via `continue`. The
loop variable `i` advances but `num_ids` (the write pointer) does not. The old
code wrote `init_ids[i] = id`, placing valid entries at non-contiguous positions
and leaving gaps in between. The gap-filling loop that follows starts from
`num_ids`, so it never overwrites the internal gaps.

Example with neighbors `[5, 99999, 3, 99999, 7]` and `ntotal=100`:

| i | id    | old: `init_ids[i]=id` | new: `init_ids[num_ids]=id` |
|---|-------|-----------------------|-----------------------------|
| 0 | 5     | `init_ids[0] = 5`     | `init_ids[0] = 5`           |
| 1 | 99999 | skip (gap at [1])     | skip                        |
| 2 | 3     | `init_ids[2] = 3`     | `init_ids[1] = 3`           |
| 3 | 99999 | skip (gap at [3])     | skip                        |
| 4 | 7     | `init_ids[4] = 7`     | `init_ids[2] = 7`           |

Old result: `[5, 0, 3, 0, 7, ...]` — gaps contain 0 (from `vector<int>`
zero-initialization). The consumption loop reads these zeros as node IDs,
biasing the search pool toward node 0 during graph construction and degrading
graph quality.

Fixed result: `[5, 3, 7, ...]` — valid entries packed contiguously, gap-filling
starts from the correct position.

**Bug 2: `sync_prune` out-of-bounds access**

`sync_prune` accesses `pool[start]` without bounds checking. Two cases cause
out-of-bounds reads:
- Pool is empty after augmentation
- Pool contains only the query node itself (e.g., `ntotal=1`)

In both cases `start` advances past `pool.size()` and `pool[start]` is an
out-of-bounds vector read — undefined behavior that crashes under ASAN and
silently corrupts the graph in release builds.

Trace for `ntotal=1, L=1`:
1. `search_on_graph` returns `pool = [{id:0, dist:0}]`
2. `sync_prune(q=0)`: `pool[0].id == q` → `start++` → `start=1 == pool.size()`
3. Old code: `result.push_back(pool[1])` — OOB read
4. Fix: guard checks `start >= pool.size()`, fills graph row with `EMPTY_ID`

**Other fixes:**
- Replace `min = 1e6` (float-to-int truncation) with `std::numeric_limits<int>::max()`
- Remove `srand(0x1998)` from the NSG constructor (global RNG side effect)

Reviewed By: mnorris11

Differential Revision: D100024850

fbshipit-source-id: 0d290801658e381e198b6c6ab54ebe981e0f09f3
…region (facebookresearch#5053)

Summary:
Pull Request resolved: facebookresearch#5053

C++ exceptions thrown inside `#pragma omp parallel` regions that are not
caught within the region call `std::terminate` — they cannot propagate
across thread boundaries.

`IndexIVF::range_search_preassigned` had the same class of bugs fixed in
`search_preassigned` by D99455250:

1. **`scan_list_func` lambda**:
   `FAISS_THROW_IF_NOT_FMT(key < nlist)` was above the try-catch block,
   so a corrupt key >= nlist would throw uncaught and call
   `std::terminate`.

2. **Outer parallel region**:
   `get_InvertedListScanner()`, `scanner->set_query()`, and
   `FAISS_THROW_IF_NOT(scanner.get())` had no try-catch at all.

Fixes:

1. Moved the existing try-catch in `scan_list_func` up to also cover the
   key validation.

2. Wrapped the entire `#pragma omp parallel` body in a try-catch that
   uses the existing `interrupt`/`exception_string`/`exception_mutex`
   pattern to safely propagate exceptions out of the parallel region.

Reviewed By: mnorris11

Differential Revision: D99879998

fbshipit-source-id: e768372cbaf8a22a9459fc3fd9b9df6e019897a6
…5068)

Summary:
fix bug to make parallel happens

Pull Request resolved: facebookresearch#5068

Reviewed By: junjieqi

Differential Revision: D100213935

Pulled By: alibeklfc

fbshipit-source-id: 31f4d8b05843d0b97f0539f69db0aebecb0063a8
…random bounds (facebookresearch#5072)

Summary:
Pull Request resolved: facebookresearch#5072

Fix correctness bugs in `NNDescent` `Nhood` copy/move operations and `gen_random` bounds.

## Bug 1: Broken Nhood copy constructor and copy assignment operator

The copy constructor and copy assignment operator for `Nhood` were incomplete:
- Copy assignment used `std::back_inserter` to append to `nn_new` instead of replacing it, leading to duplicate entries on reassignment and heap-use-after-free on self-assignment.
- Neither operation copied `pool`, `nn_old`, `rnn_old`, or `rnn_new`, meaning copied `Nhood` objects had missing neighbor data.
- This caused data loss when `std::vector<Nhood>` reallocated during `push_back`.

Fixed both operations to properly copy all 6 data members. Added self-assignment guard (`if (this != &other)`) to the copy assignment operator. Changed the copy constructor to use a member initializer list in declaration order to avoid `-Wreorder` warnings.

**Proof:** `NhoodCopy.CopyConstructorPreservesAllFields` and `NhoodCopy.CopyAssignmentPreservesAllFields` fail without the fix — `pool.size()` is 0 (expected 3), `nn_old`, `rnn_new`, `rnn_old` are all empty. `NhoodCopy.CopyAssignmentSelfAssign` triggers heap-use-after-free without the self-assignment guard. `NhoodCopy.VectorReallocationPreservesData` shows data loss during `std::vector<Nhood>` reallocation.

## Bug 2: Division by zero in `gen_random`

When `size == N`, the expression `rng() % (N - size)` is a division by zero (undefined behavior). This occurs in `search()` when `search_L` or `topk` equals `ntotal`, because `L_2 = max(search_L, topk)` is passed to `gen_random(rng, init_ids.data(), L_2, ntotal)`.

Added a precondition assertion (`size > 0 && size <= N`) and a Fisher-Yates shuffle for the `size == N` special case.

**Proof:** `TestNNDescentGenRandom.test_search_L_equals_ntotal` crashes (process killed by SIGFPE) without the fix, passes with it.

## Performance: Added move constructor and move assignment operator

Since `std::mutex` is neither copyable nor movable, the compiler cannot generate implicit move operations for `Nhood`. With user-defined copy operations, implicit move generation is suppressed entirely. Without explicit move operations, `std::vector<Nhood>::push_back(Nhood&&)` falls back to copy — 5 unnecessary vector allocations per element.

Added `noexcept` move constructor and move assignment operator. `noexcept` is required for `std::vector` to prefer move over copy during reallocation. The move assignment operator is included for Rule of Five consistency.

**Proof:** All tests pass both with and without move operations, confirming these are a performance optimization, not a correctness fix.

## Cleanup: Removed misleading `omp_get_thread_num()`

The RNG seed in `nndescent()` was `random_seed * 6577 + omp_get_thread_num()`. This function is not inside any `#pragma omp parallel` region — the call chain is `IndexNNDescent::add()` -> `NNDescent::build()` -> `NNDescent::nndescent()`, all sequential. Per the OpenMP specification, `omp_get_thread_num()` returns 0 in sequential context. The `+ 0` is dead code.

**Proof:** No behavioral change. The seed was always `random_seed * 6577`.

Reviewed By: mnorris11

Differential Revision: D100155792

fbshipit-source-id: 042a7d0a53a7696915a96bf1e48a464507f044b3
…h#5040)

Summary:
Pull Request resolved: facebookresearch#5040

Add the missing `key < nlist` upper-bound check in
`IndexIVF::search1()`, which was the only IVF search entry point
lacking this validation. The other two paths —
`search_preassigned()` and `range_search_preassigned()` — already
had this check.

Also add deserialization acceptance tests verifying that IVF indexes
with various quantizer states deserialize correctly:

1. **Surplus centroids** (`ntotal > nlist`):
   Produced by `shard_ivf_index_centroids()`, which distributes
   all of the original quantizer's centroids across shards without
   adjusting `nlist`. The search-time `key < nlist` bounds check
   prevents OOB access if the quantizer returns out-of-range keys.

2. **Trained quantizer** (`ntotal == nlist`):
   The normal trained IVF state.

3. **Sharded quantizer** (`0 < ntotal < nlist`):
   Also produced by `shard_ivf_index_centroids()`, when the
   original quantizer has `ntotal == nlist` and centroids are
   split across shards.

4. **Untrained quantizer** (`ntotal == 0`):
   Legitimate for custom inverted list management.

Reviewed By: mnorris11

Differential Revision: D99494237

fbshipit-source-id: 6a76b55f104b9c233dfdd2625bb0336ed8061463
…#5054)

Summary:
Pull Request resolved: facebookresearch#5054

`IndexSVSVamana::storage_kind` was declared without a default initializer,
and the default constructor is `= default`, so the field was left
uninitialized in default-constructed instances. This is undefined behavior
any time the value is read — including serialization via `write_index`,
which writes the garbage value to disk.

Add `= SVS_FP32` as the default initializer, matching the default used by
the parameterized constructor `IndexSVSVamana(d, degree, metric, storage)`.

This is a safe, behavior-preserving change:

- The parameterized constructor already defaults to `SVS_FP32`, so any
  code constructing an index with arguments is unaffected.

- The two derived classes (`IndexSVSVamanaLVQ`, `IndexSVSVamanaLeanVec`)
  explicitly set `storage_kind` in their own default constructors, so
  they are also unaffected.

- The only code path that changes behavior is default construction of
  `IndexSVSVamana` itself, which previously produced an uninitialized
  (UB) value and now produces `SVS_FP32`.

Reviewed By: mnorris11

Differential Revision: D99891611

fbshipit-source-id: da6acff7bdeb5668a2bf5f3b585bc1a3179004b9
Summary:
Pull Request resolved: facebookresearch#5055

Add deserialization-time validation for the `storage_kind` field in SVS
index types (IndexSVSVamana, IndexSVSVamanaLVQ, IndexSVSVamanaLeanVec)
to reject corrupted or malicious index files before they can cause
crashes.

1. **Read `storage_kind` as int and range-check before cast**:
   `storage_kind` was previously read directly into the `SVSStorageKind`
   enum via `READ1`, which is undefined behavior for out-of-range values.
   Now read into a temporary `int`, validate the value is in
   `[0, SVS_count)`, and only then cast to `SVSStorageKind`. This rejects
   invalid values at deserialization time with a `FaissException` instead
   of reaching `to_svs_storage_kind()` where the `default` branch calls
   `FAISS_ASSERT(false)` and aborts the process.

2. **Add `SVS_count` sentinel to `SVSStorageKind` enum**:
   Follows the convention used by `QT_count`, `ST_count`, and
   `DMT_count` in other FAISS enums. The deserialization validation
   uses this sentinel so it automatically stays correct when new
   storage kinds are added.

Reviewed By: mnorris11

Differential Revision: D99722676

fbshipit-source-id: 557cf91a963d1d93171fea2a67ba99f19b9b3420
…arch#5056)

Summary:
Pull Request resolved: facebookresearch#5056

Add deserialization-time validation for FastScan `M2` across all six
FastScan index types to reject corrupted or malicious index files that
would cause heap buffer overflows during search.

During normal construction, `M2 = roundup(M, 2)` is an invariant
maintained by `init_fastscan()`. During deserialization, `M2` is read
directly from the file and was not validated. A corrupted file with
`M2 < M` causes `compute_quantized_LUT` to write `M * ksub` bytes
into a buffer sized for `M2 * ksub` bytes, producing an out-of-bounds
write. The `memset` that zeroes padding from M to M2 additionally
underflows as an unsigned subtraction when `M2 < M`.

1. **Added `validate_fastscan_fields()` helper**:
   Consolidates all FastScan field validation into a single function:
   M > 0, ksub > 0, bbs > 0 and 32-aligned, M2 == roundup(M, 2),
   and overflow checks for ksub * M and ksub * M2.

2. **Non-IVF FastScan paths (already had partial validation)**:
   Replaced inline checks in IndexPQFastScan, IndexAdditiveQuantizer-
   FastScan, and IndexRaBitQFastScan with calls to the new helper,
   adding the missing M2 consistency check.

3. **IVF FastScan paths (had no validation at all)**:
   Added validation to IndexIVFPQFastScan, IndexIVFAdditiveQuantizer-
   FastScan, and IndexIVFRaBitQFastScan, which previously had zero
   checks on any FastScan fields.

Reviewed By: mnorris11

Differential Revision: D99738294

fbshipit-source-id: 8e334993b0e8b4375f9ec173c20754e301b7c9f6
…earch#5077)

Summary:
Pull Request resolved: facebookresearch#5077

Mirror of D100047797 for the non-IVF IndexFastScan hierarchy.
Adds a pure virtual fast_scan_code_size() to IndexFastScan with
concrete implementations in IndexPQFastScan (M2/2),
IndexAdditiveQuantizerFastScan (M2/2), and IndexRaBitQFastScan ((d+7)/8).

Reviewed By: alibeklfc

Differential Revision: D100342866

fbshipit-source-id: 3e5edcb1f45d53eec2b41ca63c7854fd8f1f4280
Summary:
Extend the POSIX mmap reader to Apple platforms and use MAP_FAILED for mmap error checks while keeping madvise best-effort.

Update the C++ and Python mmap tests to exercise Darwin, and stop linking faiss_test against the Python example extension so Python-enabled test builds can run on macOS.

Pull Request resolved: facebookresearch#5058

Reviewed By: junjieqi

Differential Revision: D100351635

Pulled By: alibeklfc

fbshipit-source-id: f384251a634c7d3154103dbc763293b52b093ee8
…okup in sorting.cpp (facebookresearch#5078)

Summary:
Pull Request resolved: facebookresearch#5078

Four fixes in `faiss/utils/sorting.cpp`:

**1. OpenMP directive fix in `fvec_argsort_parallel`**

The initialization loop used `#pragma omp parallel` without the `for` directive. This caused every thread to execute the entire loop independently rather than distributing iterations. With `nt` threads, each `permA[i]` was written by all `nt` threads concurrently — a data race under the C++ memory model (multiple unsynchronized writes to the same non-atomic location), and O(n * nt) wasted work instead of O(n). Fixed by changing to `#pragma omp parallel for`.

In practice, all threads write the same value (`permA[i] = i`), so the output was always correct despite the UB. The fix eliminates the undefined behavior and the redundant work.

**2. RAII memory management in `fvec_argsort_parallel`**

Replaced `new size_t[n]` / `delete[] perm2` with `std::vector<size_t>`. The old code had no realistic exception path between allocation and deallocation (all intermediate code is either C functions or non-throwing OpenMP regions), but the manual `new`/`delete` pattern is fragile against future edits that might introduce a throwing path. The `std::vector` provides RAII lifetime management with no behavioral change.

**3. Removed debug `printf` in `fvec_argsort_parallel`**

A leftover `printf("merge %d %d, %d threads\n", ...)` in the parallel merge loop wrote to stdout during normal operation. Removed.

**4. Missing early termination in `hashtable_int64_to_int64_lookup`**

The linear probing loop did not check for empty slots (`tab[slot * 2] == -1`). In an open-addressing hash table with no deletion support, an empty slot is definitive proof that the key was not inserted — the insert function would have placed it there or earlier. Without this check, lookups for absent keys probed every slot in the bucket before the wrap-around termination at `slot == hk_i`. The fix adds the standard empty-slot check, matching the structure of the insert function (`hashtable_int64_to_int64_add`). This is a performance optimization — the old code always returned the correct result (`-1` after a full bucket scan), just slower.

Reviewed By: mnorris11

Differential Revision: D100317917

fbshipit-source-id: aadfe33b1d76c34e04db7fe0c9b7ca53b4a30c71
scsiguy and others added 29 commits April 19, 2026 08:54
…rch#5112)

Summary:
Pull Request resolved: facebookresearch#5112

Add validation in read_ivf_header() to reject a null quantizer sub-index read from serialized data. The IVF deserialization reads the quantizer via read_index(), which returns nullptr when the stream contains the "null" fourcc. A null quantizer is fundamentally invalid for any IVF index type. Without this check, downstream code (e.g. initialize_IVFPQ_precomputed_table, IndexIVF::search) dereferences the null pointer.

This single validation protects all IVF index types that share read_ivf_header: IndexIVFFlat, IndexIVFPQ, IndexIVFScalarQuantizer, IndexIVFAdditiveQuantizer, and others.

Reviewed By: mnorris11

Differential Revision: D101236489

fbshipit-source-id: d9eb6759024ee2a4a59b838367ebf9299759ff23
…ch (facebookresearch#5113)

Summary:
Pull Request resolved: facebookresearch#5113

Add validation that IndexHNSW2Level (fourcc "IHN2") has storage of an appropriate type, both at deserialization time and at search time.

IndexHNSW2Level::search() uses dynamic_cast to dispatch between Index2Layer and IndexIVFPQ storage types. When storage is null or a different type (e.g. IndexFlat from corrupt serialized data, or a programmatically misconfigured index), the dynamic_cast returns nullptr which is then unconditionally dereferenced, causing a segfault.

Deserialization-time fix: After reading the HNSW storage sub-index for IHN2, validate that storage is non-null and is either Index2Layer or IndexIVFPQ.

Search-time defense-in-depth: Add a FAISS_THROW_IF_NOT check on the dynamic_cast result in IndexHNSW2Level::search() before dereferencing. This protects against programmatically constructed indexes that bypass deserialization validation.

Reviewed By: mnorris11

Differential Revision: D101243603

fbshipit-source-id: f3d75c1b19e68bf8539c55877c94749ef2899445
…uerying untrained indexes (facebookresearch#5114)

Summary:
Pull Request resolved: facebookresearch#5114

Add FAISS_THROW_IF_NOT(is_trained) to IndexIVF::search(), IndexIVF::search_preassigned(), IndexIVF::range_search(), and IndexIVF::range_search_preassigned(), mirroring the existing check in IndexScalarQuantizer::search(). This prevents querying untrained IVF indexes deserialized from corrupt data where the ScalarQuantizer trained vector is empty.

The existing deserialization validation in read_ScalarQuantizer correctly allows untrained indexes (is_trained=false with empty trained) to be deserialized, since these are legitimately produced by index_factory before training. However, IndexIVF search methods lacked the is_trained guard that IndexScalarQuantizer::search() has, allowing a deserialized untrained IndexIVFScalarQuantizer to be queried, which causes null-deref in QuantizerTemplate when it indexes into the empty trained vector.

Reviewed By: mnorris11

Differential Revision: D101243973

fbshipit-source-id: eca68dc82e5cca37d4c461b735c5d59a66349248
…facebookresearch#5115)

Summary:
Pull Request resolved: facebookresearch#5115

Add deserialization-time validation for VectorTransform dimension invariants that are enforced by constructors but not by deserialization:

1. NormalizationTransform (VNrm): Require d_in == d_out. The constructor enforces this (both set to d), but deserialization reads them independently. A crafted file with d_in > d_out causes memcpy in apply_noalloc to overflow the output buffer (allocated as n * d_out floats but copied as n * d_in).

2. CenteringTransform (VCnt): Same d_in == d_out invariant.

3. IndexPreTransform (IxPT) chain consistency: Validate that chain[0].d_in == index.d, chain[i].d_in == chain[i-1].d_out for consecutive transforms, and chain.back().d_out == sub_index.d. Without this, mismatched dimensions between transforms cause out-of-bounds reads when one transform produces fewer elements than the next expects.

Reviewed By: mnorris11

Differential Revision: D101244181

fbshipit-source-id: fbe88bf63d42381297319d4125e750f6d47bc333
…facebookresearch#5117)

Summary:
Pull Request resolved: facebookresearch#5117

Add deserialization byte limit checks before vector::resize calls in read_InvertedLists_up() for both ArrayInvertedListsPanorama ("ilpn") and ArrayInvertedLists ("ilar") paths. Previously, per-list sizes read from serialized data were used directly in .resize() calls without checking against get_deserialization_vector_byte_limit(). The READVECTOR macro enforces this limit, but explicit .resize() calls bypassed it.

Also add mul_no_overflow protection for the ilpn codes allocation (num_elems * code_size) which previously had no overflow check.

Reviewed By: mnorris11

Differential Revision: D101260923

fbshipit-source-id: 24287740642cc9647115676c71508faf8bf8f48e
…rrupt index data (facebookresearch#5118)

Summary:
Pull Request resolved: facebookresearch#5118

Add per-read byte limit enforcement to the ReaderStreambuf bridge between faiss IOReader and std::istream, used by SVS index deserialization. SVS third-party code reads sizes from the stream and immediately allocates (e.g. string::resize, vector::resize) without any size validation. Since SVS operates through std::istream, it completely bypasses faiss's deserialization_vector_byte_limit mechanism enforced in the IOReader/READVECTOR layer.

The fix adds a per_read_byte_limit parameter to ReaderStreambuf. When set, xsgetn() rejects individual read requests that meet or exceed the limit by returning 0 (EOF). This matches READVECTOR semantics where each individual vector allocation is independently checked against deserialization_vector_byte_limit. Small reads (header fields, size values) pass through unimpeded; only oversized bulk reads that correspond to data allocations in the SVS code are rejected. All three SVS deserialization call sites now pass get_deserialization_vector_byte_limit() as the per-read limit.

Reviewed By: mnorris11

Differential Revision: D101261327

fbshipit-source-id: 6e45aec63de42e5b5eaf811f4bb9b06732b09eb5
facebookresearch#5125)

Summary:
Pull Request resolved: facebookresearch#5125

The faiss-gpu conda recipe pins `{{ compiler('cxx') }} =12.4` (GCC 12.4). GCC 12.4 miscompiles the 16-bin SIMD histogram reduction in `partitioning_simdlib256.h`, producing correct results for bins 0-7 but near-zero for bins 8-15. This causes `test_16bin_bounded_bigrange` in `TestHistograms_AVX2` to fail in the CUDA 12.6 GPU nightly.

The bug is in GCC 12's code generation for the AVX2 cross-lane reduction chain (`_mm256_hadd_epi16` → `_mm256_permute2f128_si256` → `_mm256_permutevar8x32_epi32`). GCC 13 and 14 both compile this correctly. The CPU-only `faiss/meta.yaml` leaves the compiler unpinned (gets GCC 14), which is why only the GPU nightly fails.

The GCC 12.4 pin was introduced in D84193438 as part of a batch nightly fix — not a deliberate CUDA compatibility constraint. CUDA 12.6 supports up to GCC 13.x as host compiler (GCC 14 requires CUDA 12.9+), so we widen the pin to `>=12.4,<14`.

Reproduced locally: GCC 12.4 fails, GCC 13.4 passes, GCC 14.2 passes — all on the same faiss source, same test, same machine.

Reviewed By: mdouze

Differential Revision: D101601476

fbshipit-source-id: 8e36c83a9df67ba66408faa4ca392e1bd46d7c87
)

Summary:
Pull Request resolved: facebookresearch#5074

Move `with_simd_level` / `with_simd_level_256bit` calls outside the
enclosing loops so the SIMD level is resolved once rather than on every
iteration.

Sites fixed:
- distances.cpp: knn_inner_products_by_idx, knn_L2sqr_by_idx
- NeuralNet.cpp: ZnLUTCodec::encode
- ClusteringInitialization.cpp: init_kmpp_plus_plus

Reviewed By: mdouze

Differential Revision: D100144174

fbshipit-source-id: bd2369ed4fd9c3b5b54e435c7ee66a03f0e152df
…esearch#5126)

Summary:
Pull Request resolved: facebookresearch#5126

Replace the dispatch_HammingComputer + Run_XXX consumer struct pattern with
with_HammingComputer that takes a C++20 template lambda directly. This
eliminates boilerplate wrapper structs across 8 files.

Before:
  struct Run_foo { using T = void; template<class HC, class... T> void f(T... a) { foo<HC>(a...); } };
  Run_foo r; dispatch_HammingComputer(code_size, r, args...);

After:
  with_HammingComputer(code_size, [&]<class HC>() { foo<HC>(args...); });

Reviewed By: algoriddle

Differential Revision: D101350351

fbshipit-source-id: 02a346e8c33ffdb49153cbe13415b748f0a1e847
Summary: Pull Request resolved: facebookresearch#5048

Reviewed By: mnorris11, hanle11

Differential Revision: D99419595

fbshipit-source-id: 9c1214c6f4b88bf41e9d1851dd0acb5c7c5001ef
…#5132)

Summary: Pull Request resolved: facebookresearch#5132

Reviewed By: limqiying, junjieqi

Differential Revision: D101359141

fbshipit-source-id: 7d78875eed114367d4a45215e058f5fa9ebf06a1
…bookresearch#5031)

Summary:
Pull Request resolved: facebookresearch#5031

`IndexHNSW` allocates an initializes locks for `ntotal+n` nodes on every call to `add()`. This makes batched insertion very costly, and incremental insertion prohibitively so.

This diff introduces optional persistent locks for `IndexHNSW` to improve incremental `add()` performance. Previously, `omp_lock_t` arrays of size `ntotal+n` were created/destroyed on each `add()` call. Now locks can be retained via a new `retain_locks` flag (default: false), using a new `HNSW::Lock` RAII wrapper with geometric growth.

RFC: Instead of `retain_locks` being the only way to opt into this new behavior, this could be inferred on the first incremental add. That is, clear the locks after insertion iff `n0 == 0`. Workloads which call `add()` once would be unaffected, but workloads which call `add()` repeatedly would 1) forego the clearing of the lock vector after `add()` call #2, and reuse locks for all subsequent calls. The downside would be the lack of the ability to reclaim the locks after insertion without HNSW-specific behavior at the call site.

Reviewed By: mdouze

Differential Revision: D98232750

fbshipit-source-id: ef55cd9e4eb79793267a29a06502a582873e6a74
Summary:
Pull Request resolved: facebookresearch#5129

## Bug

`HNSW::add_with_locks()` updates two shared member variables — `max_level` and `entry_point` — after releasing the per-node lock, without any synchronization:

```cpp
omp_unset_lock(&locks[pt_id]);

if (pt_level > max_level) {   // read shared state
    max_level = pt_level;     // write shared state
    entry_point = pt_id;      // write shared state
}
```

This function is called from inside `#pragma omp for` in `hnsw_add_vertices()` (IndexHNSW.cpp), meaning multiple threads execute it concurrently. The unprotected check-then-act pattern is a classic TOCTOU race condition.

## Proof by interleaving

Suppose `max_level = 2` and two threads finish their link-building simultaneously:

- Thread A: `pt_level = 4`, `pt_id = 100`
- Thread B: `pt_level = 3`, `pt_id = 200`

| Step | Thread A                              | Thread B                              | max_level | entry_point |
|------|---------------------------------------|---------------------------------------|-----------|-------------|
| 0    | —                                     | —                                     | 2         | (level-2 node) |
| 1    | reads `4 > 2` -> true                 |                                       | 2         |             |
| 2    |                                       | reads `3 > 2` -> true                 | 2         |             |
| 3    | writes `max_level = 4`                |                                       | 4         |             |
| 4    | writes `entry_point = 100`            |                                       | 4         | 100         |
| 5    |                                       | writes `max_level = 3`                | 3         | 100         |
| 6    |                                       | writes `entry_point = 200`            | 3         | 200         |

**Result**: `max_level = 3`, `entry_point = 200` (a node at level 3). But node 100 exists at level 4 — the true maximum. The HNSW invariant that `entry_point` is a node at `max_level` is violated.

## Consequence

Search starts from `entry_point` and walks down from `max_level`. With a wrong entry point at a lower level, the upper levels of the graph are never traversed during search, leading to silently degraded recall. The index does not crash and still returns results — they are just worse.

## Fix

Wrap the check-and-update in `#pragma omp critical` to make it atomic:

```cpp
#pragma omp critical
{
    if (pt_level > max_level) {
        max_level = pt_level;
        entry_point = pt_id;
    }
}
```

This guarantees that only one thread executes the block at a time. In the interleaving above, Thread B would enter the critical section after Thread A completes, see `max_level = 4`, evaluate `3 > 4` as false, and correctly skip the write.

## Note on the read at line 561

`int level = max_level` reads `max_level` without synchronization. This is technically a data race under the C++ memory model, but it is benign: reading a stale value just means the greedy search starts one level too low, which the algorithm handles correctly (it still finds correct neighbors, just slightly less efficiently). Adding synchronization here would introduce overhead on every iteration of a hot loop for negligible benefit.

## Why existing tests did not catch this

1. **Tiny race window**: both threads must pass the `if` check in the few nanoseconds before either writes — extremely unlikely per run.
2. **Subtle consequence**: a wrong entry point degrades recall slightly but does not crash or return wrong types. Tests assert recall thresholds (e.g., recall > 0.9), not exact values.
3. **Rare trigger condition**: the race only fires when two nodes added concurrently both exceed the current `max_level`. Higher HNSW levels are exponentially less probable by design — most nodes are level 0, and the highest levels typically have only 1-2 nodes, making concurrent contention on `max_level` near-impossible in practice.

Reviewed By: mnorris11

Differential Revision: D101444067

fbshipit-source-id: 82b9fdafed0b7c3cc26eb4d6c7e3536e6e12bee3
…search#5130)

Summary:
Pull Request resolved: facebookresearch#5130

This diff fixes four bugs in `Clustering.cpp`, four of which trigger for datasets with more than 2,147,483,647 vectors (`INT_MAX`), and one that can trigger regardless of dataset size.

## Bug 1: Integer truncation in fast subsampling — out-of-bounds memory access

**Location**: `subsample_training_set()`, line 96

**Before**:
```cpp
std::vector<int> perm;
// ...
perm[i] = rng.rand_int(nx);
```

**Bug**: `rand_int(int max)` takes an `int` parameter. When `nx` is `idx_t` (`int64_t`) and exceeds `INT_MAX`, the implicit narrowing conversion truncates `nx` to `int`. On two's complement (all target platforms), a value like `3,000,000,000` becomes `-1,294,967,296`. The function then generates a "random" index in a garbage range. These values are stored in `perm` and used as array indices:

```cpp
memcpy(x_new + i * line_size, x + perm[i] * line_size, line_size);
```

A negative `perm[i]` produces an out-of-bounds read from before the start of `x`. This is undefined behavior that can crash or silently corrupt data.

**Fix**:
```cpp
std::vector<idx_t> perm;
// ...
perm[i] = rng.rand_int64() % nx;
```

Two changes: (1) `perm` is now `std::vector<idx_t>` so it can hold indices > `INT_MAX`. (2) `rand_int64()` returns `int64_t`, and `% nx` produces a value in `[0, nx)` without any narrowing. The result is stored losslessly in `idx_t`.

## Bug 2: Missing guard in standard subsampling path

**Location**: `subsample_training_set()`, lines 99-108

**Before**:
```cpp
} else {
    perm.resize(nx);
    rand_perm(perm.data(), nx, actual_seed);
}
```

**Bug**: `rand_perm(int* perm, size_t n, int64_t seed)` takes `int*` and internally does `perm[i] = i`. When `nx > INT_MAX`, the value `i` (a `size_t`) is narrowed to `int` on assignment, wrapping to negative values. These negative values are then used as dataset indices — same out-of-bounds access as Bug 1.

**Fix**:
```cpp
} else {
    FAISS_THROW_IF_NOT_FMT(
            nx <= static_cast<idx_t>(std::numeric_limits<int>::max()),
            "Dataset too large (%" PRId64
            ") for standard subsampling; "
            "set use_faster_subsampling=true",
            nx);
    std::vector<int> int_perm(nx);
    rand_perm(int_perm.data(), nx, actual_seed);
    perm.assign(int_perm.begin(), int_perm.end());
}
```

Three parts: (1) A guard that fails early with a clear error message directing the user to the fast path (which handles large datasets correctly via Bug 1 fix). (2) A temporary `std::vector<int>` to satisfy `rand_perm`'s `int*` API — safe because the guard guarantees `nx <= INT_MAX`. (3) Copy into the `idx_t` perm vector so both paths produce the same type for downstream code.

We chose not to change `rand_perm`'s signature from `int*` to `idx_t*` because it is a public API in `faiss/utils/random.h` and changing it would break all callers.

## Bug 3: Infinite loop in split_clusters

**Location**: `split_clusters()`, lines 239-265

**Before**:
```cpp
for (cj = 0; true; cj = (cj + 1) % k) {
    float p = (hassign[cj] - 1.0) / (float)(n - k);
    float r = rng.rand_float();
    if (r < p) {
        break;
    }
}
```

**Bug**: This loop probabilistically selects a cluster to split (to replace an empty cluster). The probability of picking cluster `cj` is `p = (hassign[cj] - 1) / (n - k)`. When `hassign[cj] = 1` (cluster has exactly one vector), `p = 0 / (n - k) = 0`. No random float `r` satisfies `r < 0`, so that cluster is never picked.

**Proof of infinite loop**: If all non-empty clusters have exactly 1 vector assigned (which happens with bad initialization, adversarial data, or too many clusters), then every `p = 0` and the loop condition `true` is never broken. The loop spins forever, hanging the process.

Even in non-degenerate cases, the loop can be extremely slow. Example: `n = 10,000,000`, `k = 1000`, largest cluster has 50,000 vectors. Per-cluster probability: `p = 49999 / 9999000 ≈ 0.005`. Expected iterations to find a match: ~200. But with smaller clusters or larger `n`, this grows without bound.

**Fix**:
```cpp
size_t max_tries = 10 * k;
size_t n_tries = 0;
bool found = false;
for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) {
    float p = (hassign[cj] - 1.0) / (float)(n - k);
    float r = rng.rand_float();
    if (r < p) {
        found = true;
        break;
    }
    n_tries++;
}
if (!found) {
    cj = 0;
    for (size_t j = 1; j < k; j++) {
        if (hassign[j] > hassign[cj]) {
            cj = j;
        }
    }
}
```

After `10 * k` attempts (10 full passes through all clusters), the loop falls back to deterministically picking the largest cluster. This is semantically correct because the probabilistic selection is already weighted by cluster size — larger clusters have higher `p`. The deterministic fallback produces the most likely outcome of the probabilistic selection. Termination is guaranteed in O(k) time.

## Bug 4: Integer overflow in objective accumulation loop

**Location**: `Clustering::train_encoded()`, line 535

**Before**:
```cpp
for (int j = 0; j < nx; j++) {
    obj += dis[j];
}
```

**Bug**: `nx` is `idx_t` (`int64_t`). When `nx > INT_MAX`, `int j` overflows at 2,147,483,647. Signed integer overflow is undefined behavior per the C++ standard. In practice on two's complement, `j` wraps to `-2,147,483,648`, which satisfies `j < nx`, so the loop continues with a negative index. `dis[j]` with negative `j` is an out-of-bounds read — crash or garbage accumulation.

**Proof**: For `nx = 3,000,000,000`:
- `j` increments from 0 to 2,147,483,647 (correct)
- Next increment: UB, typically wraps to -2,147,483,648
- `-2,147,483,648 < 3,000,000,000` is true (signed/unsigned comparison promotes to unsigned, but even with signed comparison it's true)
- `dis[-2147483648]` — out-of-bounds access

**Fix**:
```cpp
for (idx_t j = 0; j < nx; j++) {
    obj += dis[j];
}
```

`idx_t` matches `nx`'s type. The loop variable can represent all valid indices up to `nx`.

Reviewed By: mnorris11

Differential Revision: D101624009

fbshipit-source-id: b961f2677f7e7b93642fe795cfe6ca77812573d3
…#5075)

Summary:
Pull Request resolved: facebookresearch#5075

In DD mode, the QBS (bbs=32) accumulate path always used 256-bit kernels,
even in the AVX512 per-ISA TU. The 512-bit kernels in kernels_simd512.h
were dead because bare simdlib aliases resolve to _tpl<NONE> in DD mode,
and 512-bit NONE types don't exist (empty primary templates).

Fix: add function-local using declarations in both 512-bit kernel functions
to bind types to explicit AVX512/AVX2 levels. Create accumulate_loops_512.h
with FixedStorage512 (a non-virtual intermediate handler that bridges the
AVX2→NONE type gap via storeu/loadu at the handler boundary) and the 512-bit
QBS accumulate loop. Wire it into dispatching.h's ScannerMixIn behind an

Reviewed By: mdouze

Differential Revision: D100151879

fbshipit-source-id: b801f897f2d061a8448842f42edcdeb3a447eafd
…ebookresearch#5136)

Summary:
Pull Request resolved: facebookresearch#5136

Fixes integer truncation in `IDSelectorBatch::is_member` on platforms where `long` is 32-bit (Windows LLP64).

**Root cause.** `IDSelectorBatch::mask` is declared as `idx_t` (i.e. `int64_t` — see `MetricType.h:51`) and is computed in the constructor as `mask = ((idx_t)1 << nbits) - 1`, where `nbits = ceil(log2(n)) + 5`. For bloom filters sized for `n >= ~134M` ids, `nbits >= 32` and `mask` requires more than 32 bits to represent. The expression `i & mask` therefore yields a 64-bit `idx_t`. The previous code stored the result in a local `long im`:

- LP64 ABI (Linux, macOS x86_64/arm64): `long` is 64-bit — no truncation, behaves correctly.
- LLP64 ABI (Windows x86_64, MinGW): `long` is 32-bit — silently truncates the upper bits.

After truncation, `im >> 3` indexes the wrong bloom slot and `1 << (im & 7)` tests the wrong bit. This produces false negatives in the bloom filter, causing `is_member` to incorrectly return `false` for ids that are in the set, which silently drops legitimate matches during selection.

**Fix.** Change the local from `long im` to `idx_t im` so its type matches both operands of `i & mask`. This eliminates the platform-dependent truncation. As a small follow-on cleanup, change the early `return 0;` to `return false;` to match the function's `bool` return type (no behavior change — both compile to the same value).

**Scope.** Intentionally narrow. Earlier iterations of this diff also widened `DirectMap::update_codes` from `int n` to `idx_t n` and added an `if (ii < 0) return false;` guard in `IDSelectorBitmap::is_member`. Both were reverted after review:

- The `DirectMap::update_codes` widening was cosmetic: its sole caller `IndexIVF::update_vectors` still takes `int n` (see `IndexIVF.h:357`), so widening the inner type cannot unlock any larger batch size. Lifting the 2^31 cap would require widening the public virtual `update_vectors`, all overrides, and the C API in `IndexIVFFlat_c.{h,cpp}` — out of scope here, and a separate diff if desired.
- The `IDSelectorBitmap` negative-id guard was redundant: per `[conv.integral]` the existing `uint64_t i = ii;` for negative `ii` produces a value in `[2^63, 2^64)`, so `i >> 3 >= 2^60`, which is unconditionally `>= n` for any physically realizable bitmap (`n` is bounded far below 2^60 by addressable memory). The pre-existing `(i >> 3) >= n` check already handles the case correctly.

Reviewed By: mnorris11

Differential Revision: D101801522

fbshipit-source-id: 719d6dcc26ece5faf0dfb927e4639e322cf1a6fd
…ch#5134)

Summary:
Pull Request resolved: facebookresearch#5134

Expand DD test coverage by applying for_all_simd_levels to existing test
classes that exercise DD-dispatched code paths but were previously pinned
to a single SIMD level.

This is the "mega decorator" diff from the DD test coverage gaps plan --
pure decorator additions, no new test logic. Follow-up diffs add new
test files and numerical cross-level assertions for gaps the decorator
alone cannot close.

Classes decorated (grouped by area):

* Binary Hamming non-IVF: TestRange and TestKnn in
  test_binary_hashindex.py; TestBinarySearchParams in
  test_binary_search_params.py; TestIndexBinaryFromFloat in
  test_index_binary_from_float.py; TestSpectralHash in
  test_index_accuracy.py.

* IVFPQ / search: EvalIVFPQAccuracy in test_index.py;
  TestSelector and TestSearchParams in test_search_params.py.

* Flat / refine / Panorama: TestIndexFlat, TestIndexFlatL2,
  TestIndexFlatL2Panorama, TestScalarQuantizer in test_index.py;
  TestDistanceComputer, TestIndexRefineSearchParams,
  TestIndexRefineRangeSearch in test_refine.py;
  TestIndexRefinePanorama, TestIndexFlatPanorama,
  TestIndexHNSWFlatPanorama, TestIndexIVFFlatPanorama in their
  respective files.

* Quantizer encode: TestResidualQuantizer,
  TestIndexResidualQuantizerSearch in test_residual_quantizer.py;
  TestComponents, TestLocalSearchQuantizer in
  test_local_search_quantizer.py.

* Fast scan: TestFastScanFiltering, TestBlockSkipConsistency,
  TestFastScanRangeSearchFilter in test_fastscan_filter.py.

* Broader index tests (search-exercising classes only):
  TestParameterSpace in test_autotune.py; TestSpectralHash in
  test_factory.py; TestMerge1, TestMerge2, TestRemoveFastScan in
  test_merge_index.py.

BUCK changes move decorated tests from the legacy-listing lists to the
simd_levels lists (which use supports_static_listing = False, required
for dynamic class name generation), and add the new entries
test_binary_search_params, test_fastscan_filter, test_refine_panorama,
test_hnsw_panorama.

___

overriding_review_checks_triggers_an_audit_and_retroactive_review
Oncall Short Name: fair_umami_cluster

Differential Revision: D101822335

fbshipit-source-id: 9a432c6d6bee201c3731713226405a8c8ecebbe6
Summary:
_Note: Should be merged before facebookresearch#4970 (IVFPQPanorama)._

## Changes
### Performance

This PR implements various optimizations to Panorama (L2Flat and IVFFlat).
1. Disaggregate distance computation from pruning decisions to avoid branches in distance computation hotpath.
2. Early batch processing termination when no points are remaining.
3. Manually unrolled distance kernel.
4. Template distance computation on level width for autovectorization.
5. `if constexpr (C::is_max)` instead of `C::cmp` for autovectorized pruning.
6. Byteset for vectorized compacting of active indices using `_pext_u64`.
7. Template distance computation and pruning on first level (no `active_indices` indirection) to let it autovectorize.
8. Hoist buffer allocations into `IndexFlat`/`IVFFlatScannerPanorama`.
9. Expose `batch_size` as a parameter for IVFFlatPanorama (for consistency with `IndexFlatPanorama` but also because 1024 `batch_size` can improve performance).

### Other

 - Define `kDefaultBatchSize` once in `Panorama.h` (previously defined in 5 separate locations).
 - Allow `bench_flat_l2_panorama.py` and `bench_ivf_flat_panorama.py` to accept `gist1M` or `sift1M` as dataset to bench on.

## Results

Together, these optimizations enable powerful additional speedups, especially on lower-dimensional datasets like SIFT (128d), by dramatically minimizing Panorama's overhead:

**GIST1M (IVF128, nlist=128, nlevels=16)**
| nprobe | Recall@10 | Old Speedup | New Speedup | _Additional_ Speedup |
|--------|-----------|-------------------------|-------------------------|--------------------|
|      1 | 0.1439    |                    3.92x |                    3.93x |               1.00x |
|      2 | 0.2605    |                    4.71x |                    5.19x |               1.10x |
|      4 | 0.4369    |                    5.53x |                    6.75x |               1.22x |
|      8 | 0.6470    |                    6.37x |                    8.21x |               1.29x |
|     16 | 0.8780    |                    7.30x |                    9.74x |               1.33x |
|     32 | 0.9764    |                    8.33x |                   11.29x |               1.36x |
|     64 | 0.9868    |                    9.38x |                   12.74x |               1.36x |

**SIFT1M (IVF128, nlist=128, nlevels=8)**
| nprobe | Recall@10 | Old Speedup | New Speedup | _Additional_ Speedup |
|--------|-----------|-------------------------|-------------------------|--------------------|
|      1 | 0.2678    |                    1.20x |                    1.81x |               1.52x |
|      2 | 0.4584    |                    1.38x |                    2.23x |               1.62x |
|      4 | 0.6855    |                    1.59x |                    2.70x |               1.70x |
|      8 | 0.8760    |                    1.83x |                    3.44x |               1.88x |
|     16 | 0.9679    |                    2.11x |                    4.72x |               2.24x |
|     32 | 0.9855    |                    2.44x |                    5.61x |               2.30x |
|     64 | 0.9861    |                    2.74x |                    6.39x |               2.33x |

### Raw Data

Collected by running the new benches on `main` and this branch. On main, you cannot specify `batch_size` so remove the `{1024}` from the factory string in the new benches to run them there. The results above are calculated from the following raw data as follows:
1. For each experiment (i.e., GIST (old) or SIFT (new), calculate the Panorama speedups for each `nprobe` ((original ms per query) / (pano ms per query))
2. For each pairing of (old) and (new) results, calculate the additional speedup by calculating (new speedup) / (old speedup).

#### Before (`main`)

GIST1M:
```
======IVF128,Flat
	nprobe   1, Recall@10: 0.145200, speed: 2.705442 ms/query, dims scanned: 100.00%
	nprobe   2, Recall@10: 0.260800, speed: 5.456891 ms/query, dims scanned: 100.00%
	nprobe   4, Recall@10: 0.441900, speed: 10.895120 ms/query, dims scanned: 100.00%
	nprobe   8, Recall@10: 0.648200, speed: 21.676788 ms/query, dims scanned: 100.00%
	nprobe  16, Recall@10: 0.878000, speed: 43.142261 ms/query, dims scanned: 100.00%
	nprobe  32, Recall@10: 0.975400, speed: 84.498397 ms/query, dims scanned: 100.00%
	nprobe  64, Recall@10: 0.986800, speed: 160.092644 ms/query, dims scanned: 100.00%
======PCA960,IVF128,FlatPanorama16
	nprobe   1, Recall@10: 0.143900, speed: 0.689507 ms/query, dims scanned: 12.96%
	nprobe   2, Recall@10: 0.260500, speed: 1.158416 ms/query, dims scanned: 11.18%
	nprobe   4, Recall@10: 0.436900, speed: 1.968814 ms/query, dims scanned: 9.90%
	nprobe   8, Recall@10: 0.647000, speed: 3.401469 ms/query, dims scanned: 8.91%
	nprobe  16, Recall@10: 0.878000, speed: 5.912757 ms/query, dims scanned: 8.10%
	nprobe  32, Recall@10: 0.976400, speed: 10.147847 ms/query, dims scanned: 7.44%
	nprobe  64, Recall@10: 0.986800, speed: 17.074573 ms/query, dims scanned: 6.93%
```

SIFT1M:

```
======IVF128,Flat
	nprobe   1, Recall@10: 0.267480, speed: 0.285990 ms/query, dims scanned: 100.00%
	nprobe   2, Recall@10: 0.457520, speed: 0.564067 ms/query, dims scanned: 100.00%
	nprobe   4, Recall@10: 0.685320, speed: 1.111833 ms/query, dims scanned: 100.00%
	nprobe   8, Recall@10: 0.877210, speed: 2.195088 ms/query, dims scanned: 100.00%
	nprobe  16, Recall@10: 0.967730, speed: 4.338444 ms/query, dims scanned: 100.00%
	nprobe  32, Recall@10: 0.985400, speed: 8.500538 ms/query, dims scanned: 100.00%
	nprobe  64, Recall@10: 0.986100, speed: 16.349893 ms/query, dims scanned: 100.00%
======PCA128,IVF128,FlatPanorama8
	nprobe   1, Recall@10: 0.267670, speed: 0.239243 ms/query, dims scanned: 27.97%
	nprobe   2, Recall@10: 0.458320, speed: 0.408590 ms/query, dims scanned: 24.42%
	nprobe   4, Recall@10: 0.685480, speed: 0.699694 ms/query, dims scanned: 21.50%
	nprobe   8, Recall@10: 0.875930, speed: 1.197310 ms/query, dims scanned: 19.06%
	nprobe  16, Recall@10: 0.967760, speed: 2.055968 ms/query, dims scanned: 16.98%
	nprobe  32, Recall@10: 0.985370, speed: 3.481555 ms/query, dims scanned: 15.26%
	nprobe  64, Recall@10: 0.985980, speed: 5.977346 ms/query, dims scanned: 14.02%
```

#### After (`optimize-pano`)

GIST1M:
```
======IVF128,Flat
	nprobe   1, Recall@10: 0.145200, speed: 2.625779 ms/query, dims scanned: 100.00%
	nprobe   2, Recall@10: 0.260800, speed: 5.285007 ms/query, dims scanned: 100.00%
	nprobe   4, Recall@10: 0.441900, speed: 10.555867 ms/query, dims scanned: 100.00%
	nprobe   8, Recall@10: 0.648200, speed: 21.012494 ms/query, dims scanned: 100.00%
	nprobe  16, Recall@10: 0.878000, speed: 41.794143 ms/query, dims scanned: 100.00%
	nprobe  32, Recall@10: 0.975400, speed: 81.865038 ms/query, dims scanned: 100.00%
	nprobe  64, Recall@10: 0.986800, speed: 155.067333 ms/query, dims scanned: 100.00%
======PCA960,IVF128,FlatPanorama16_1024
	nprobe   1, Recall@10: 0.143900, speed: 0.668800 ms/query, dims scanned: 20.33%
	nprobe   2, Recall@10: 0.260500, speed: 1.018440 ms/query, dims scanned: 14.81%
	nprobe   4, Recall@10: 0.436900, speed: 1.563622 ms/query, dims scanned: 11.72%
	nprobe   8, Recall@10: 0.647000, speed: 2.557981 ms/query, dims scanned: 9.82%
	nprobe  16, Recall@10: 0.878000, speed: 4.292616 ms/query, dims scanned: 8.56%
	nprobe  32, Recall@10: 0.976400, speed: 7.248832 ms/query, dims scanned: 7.68%
	nprobe  64, Recall@10: 0.986800, speed: 12.171319 ms/query, dims scanned: 7.06%
```

SIFT1M:

```
======IVF128,Flat
        nprobe   1, Recall@10: 0.267480, speed: 0.295904 ms/query, dims scanned: 100.00%
        nprobe   2, Recall@10: 0.457520, speed: 0.583204 ms/query, dims scanned: 100.00%
        nprobe   4, Recall@10: 0.685320, speed: 1.150055 ms/query, dims scanned: 100.00%
        nprobe   8, Recall@10: 0.877210, speed: 2.425575 ms/query, dims scanned: 100.00%
        nprobe  16, Recall@10: 0.967730, speed: 5.509365 ms/query, dims scanned: 100.00%
        nprobe  32, Recall@10: 0.985400, speed: 10.794491 ms/query, dims scanned: 100.00%
        nprobe  64, Recall@10: 0.986100, speed: 20.727924 ms/query, dims scanned: 100.00%
======PCA128,IVF128,FlatPanorama8_1024
        nprobe   1, Recall@10: 0.267750, speed: 0.163266 ms/query, dims scanned: 34.97%
        nprobe   2, Recall@10: 0.458370, speed: 0.261109 ms/query, dims scanned: 27.97%
        nprobe   4, Recall@10: 0.685540, speed: 0.425977 ms/query, dims scanned: 23.30%
        nprobe   8, Recall@10: 0.875990, speed: 0.704580 ms/query, dims scanned: 19.98%
        nprobe  16, Recall@10: 0.967860, speed: 1.167465 ms/query, dims scanned: 17.45%
        nprobe  32, Recall@10: 0.985470, speed: 1.925296 ms/query, dims scanned: 15.50%
        nprobe  64, Recall@10: 0.986080, speed: 3.245793 ms/query, dims scanned: 14.14%
```

Pull Request resolved: facebookresearch#5041

Reviewed By: alibeklfc

Differential Revision: D101753364

Pulled By: mnorris11

fbshipit-source-id: e6da1aa05e465e83632239bc69548bf8f5353d49
…acebookresearch#5138)

Summary:
Pull Request resolved: facebookresearch#5138

## Summary

This diff fixes several bugs and memory safety issues in VectorTransform.cpp:

### 1. Bug fix: Wrong beta parameter in PCAMatrix::train (sgemm_ call)
In the code path where n < d_in (Gram matrix approach), the sgemm_ call that computes
`PCAMat = xc * gram` incorrectly uses beta=1.0 instead of beta=0.0. This means the
computation is actually `PCAMat = xc * gram + PCAMat` instead of `PCAMat = xc * gram`.

On the first training call this works by accident because std::vector::resize
zero-initializes new elements. However, if PCAMatrix::train() is called a second time
(e.g., retraining with different data), PCAMat retains stale values from the previous
training, corrupting the PCA matrix and producing incorrect dimensionality reduction results.

### 2. Memory leak fix: eig() function
Replaced raw `new double[]` with `std::vector<double>` for the LAPACK workspace buffer.
The old code would leak memory if dsyev_ threw an exception.

### 3. Memory leak fix: LinearTransform::transform_transpose()
Replaced raw `new float[]` with `std::vector<float>` for the bias-corrected buffer.
The old code would leak memory if sgemm_ threw an exception between allocation and
the manual delete[].

### 4. Missing error check: OPQ SVD workspace query
Added FAISS_THROW_IF_NOT_FMT check after the sgesvd_ workspace query in OPQMatrix::train().
Previously, if the workspace query failed, the returned workspace size would be garbage,
leading to either a crash or silent data corruption in the subsequent SVD computation.

Reviewed By: junjieqi, mnorris11

Differential Revision: D101975473

fbshipit-source-id: 57e74d8cc55d119bfee99f164caaf4d64b08a7ce
…ontrib.TestBigBatchSearch) (facebookresearch#5139)

Summary: Pull Request resolved: facebookresearch#5139

Reviewed By: junjieqi

Differential Revision: D101979704

fbshipit-source-id: b3b1575fd3431dca7d3b5e7ec86e4009810095fb
Summary:
Pull Request resolved: facebookresearch#5093

Fix remaining miscellaneous lint warnings across 10 files:
- `facebook-hte-MultTypeDeclaration`: Split mixed-type declaration in AutoTune.cpp
- `facebook-hte-IdenticalOperands`: Rename variable in build.cpp to avoid false positive
- `facebook-hte-BadImplicitCast`: Add explicit cast in Index.cpp
- `performance-inefficient-vector-operation`: Add reserve() in IndexBinaryIVF.cpp
- `performance-for-range-copy`: Use const reference in IndexBinaryHash.cpp range-for
- `facebook-hte-UnassignedReleasedUniquePointer`: Capture release() results in IVFlib.cpp, IndexPreTransform.cpp
- `facebook-hte-UnqualifiedCall-sqrt`: Use std::sqrt() in MatrixStats.cpp
- `facebook-unused-include-check`: Remove unused includes in IndexIVF.cpp, IndexNNDescent.cpp, IndexNSG.cpp
- `clang-diagnostic-switch-enum`: Add missing enum cases in IndexAdditiveQuantizer.cpp, IndexIVFAdditiveQuantizer.cpp

Reviewed By: pankajsingh88

Differential Revision: D100592786

fbshipit-source-id: 1d324e30d79e967c345737bae6991e9a443622ee
…search#5091)

Summary:
Pull Request resolved: facebookresearch#5091

Fix 172 `clang-diagnostic-shorten-64-to-32` lint warnings across 29 files by adding explicit `static_cast<int>()` or widening variable types where `size_t`/`idx_t` (64-bit) values were implicitly narrowed to `int`/`int32_t` (32-bit).

The fixes fall into two categories:
- **Explicit casts**: Where the receiving API requires `int` and the value is known to fit (e.g., vector dimensions, sub-quantizer counts, cluster counts, BLAS parameters)
- **Type widening**: Where the variable was unnecessarily narrow (e.g., `int nprobe` → `size_t nprobe`, `int list_no` → `size_t list_no`)

Reviewed By: limqiying

Differential Revision: D100588996

fbshipit-source-id: 153fe3d557a102c5adb7831915c1b3c8cecae22b
Summary:
Pull Request resolved: facebookresearch#5140

Full logs: P2284358998

## TLDR: Fix vs No-fix: Is the improvement consistent? yes, seems like doing this fix is better than doing nothing.

  Net result: faster overall. 26/60 configs improve by >5%, only 4 regress by >5%.

  ┌───────────────────────────────┬─────────────┬────────────┐
  │       Worst regressions       │ ms increase │ % increase │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=100, bs=10000, np=16, k=100 │ +10.0 ms    │ +8.4%      │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=50, bs=10000, np=16, k=100  │ +6.3 ms     │ +7.7%      │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=50, bs=10000, np=64, k=100  │ +10.2 ms    │ +6.2%      │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=100, bs=1000, np=32, k=100  │ +11.8 ms    │ +5.2%      │
  └───────────────────────────────┴─────────────┴────────────┘

  ┌────────────────────────────────┬───────────┬──────────┐
  │       Best improvements        │ ms saved  │ % faster │
  ├────────────────────────────────┼───────────┼──────────┤
  │ M=50, bs=1000, np=16, k=100    │ -35.8 ms  │ -22.7%   │
  ├────────────────────────────────┼───────────┼──────────┤
  │ M=100, bs=10000, np=256, k=500 │ -359.8 ms │ -10.7%   │
  ├────────────────────────────────┼───────────┼──────────┤
  │ M=100, bs=1000, np=128, k=1000 │ -240.9 ms │ -10.4%   │
  └────────────────────────────────┴───────────┴──────────┘

  No fix in Faiss D101399711 (migration only, no fix) vs baseline

  More regressions: 14/60 configs regress >5%, 2 exceed 10%.

  ┌───────────────────────────────┬─────────────┬────────────┐
  │       Worst regressions       │ ms increase │ % increase │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=50, bs=10000, np=32, k=1000 │ +110.5 ms   │ +10.6%     │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=50, bs=10000, np=16, k=1000 │ +68.3 ms    │ +10.1%     │
  ├───────────────────────────────┼─────────────┼────────────┤
  │ M=50, bs=10000, np=64, k=1000 │ +147.7 ms   │ +9.2%      │
  └───────────────────────────────┴─────────────┴────────────┘

## what about variance?

 Variance summary (see mnorris11 notes, I don't agree with everything)

  The search is mostly stable, with occasional single-run outliers.

  ┌─────────────────────┬───────────────────────────────────┐
  │       Metric        │               Value               │
  ├─────────────────────┼───────────────────────────────────┤
  │ Median CV           │ 3.3%                              │
  ├─────────────────────┼───────────────────────────────────┤
  │ 90th percentile CV  │ 9.8%                              │
  ├─────────────────────┼───────────────────────────────────┤
  │ Most stable configs │ bs=1000, nprobe≥64: CV 0.3-1.7%   │
  ├─────────────────────┼───────────────────────────────────┤
  │ Noisiest configs    │ bs=10000, nprobe=16: CV up to 39% │
  └─────────────────────┴───────────────────────────────────┘

  What drives variance: absolute runtime. Sub-200ms configs (low nprobe, k=100) are noisy because GPU
  scheduling jitter is proportionally large. Configs >1 second are rock-solid (CV < 2%).

  Outlier pattern: Every high-CV config is caused by a single extreme outlier (e.g., one run at 1124ms
   when 49 others are at 120-155ms), not general measurement noise. Using median instead of mean would
   remove these. The median values in the results file should be more reliable for these fast configs.

  Are the cross-trial differences real?

  - For stable configs (bs=1000, nprobe≥64, CV < 2%): the 8-11% improvements at high nprobe are
  definitely real — intra-trial noise is well below 2%, so a 10% shift is ~5 standard errors.
  - For noisy configs (bs=10000, nprobe=16, CV 5-15%): the 5-8% regressions are borderline — they're
  statistically detectable with 50 runs, but could partly reflect system-state differences between
  trials (GPU thermals, background load) rather than code changes. ******[mnorris11 note: this explanation
  sounds like BS, there was nothing else going on in the GPU...]******
  - To be fully confident about the small regressions at low nprobe, you'd want same-machine A/B
  testing (build both versions, alternate runs within one script).  ******[mnorris11 note: this is nonsense,
  this is what we did?]******

Reviewed By: weidbd2025

Differential Revision: D101500046

fbshipit-source-id: 9a2759a64a3ba8e398b6b36ca40e31cf6aaa5ba0
Summary:
Pull Request resolved: facebookresearch#5127

Attempt to enable FAISS dynamic dispatch (DD) mode on Windows/MSVC.

Changes:
- CMakeLists.txt: Remove if(NOT WIN32) guard from DD section, add MSVC per-file
  SIMD flags (/arch:AVX2, /arch:AVX512, /bigobj) alongside existing GCC/Clang flags
- build-pull-request.yml: Add windows-x86_64-DD-cmake job that runs immediately
  (no dependency on linux-x86_64-cmake), builds with MSVC and FAISS_OPT_LEVEL=dd,
  runs C++ tests and Python tests

This diff is expected to fail on Windows due to MSVC requiring explicit template
specialization declarations (C++ §17.8.3) which GCC/Clang don't enforce. The CI
failure will surface the exact errors to guide the fix.

UPDATE: in fact, MSVC does not seem to require this. After fixing build scripts and a few classical Windows error, the C++ and python tests pass.

Reviewed By: algoriddle

Differential Revision: D101649751

fbshipit-source-id: 765ef8483d02652ce58625cd22baab8870acf718
…kresearch#5143)

Summary:
Pull Request resolved: facebookresearch#5143

This diff fixes four bugs in `IndexRefine.cpp` and adds a regression test.

**Bug 1 (critical): `sa_decode` reads wrong bytes**

`IndexRefine::sa_encode` writes each vector's codes as `[base_codes (cs1 bytes) | refine_codes (cs2 bytes)]` (lines 197-199). The encode writes base codes at offset 0 and refine codes at offset `cs1`:

```
memcpy(b, tmp1.get() + cs1 * i, cs1);        // base at b+0
memcpy(b + cs1, tmp2.get() + cs2 * i, cs2);  // refine at b+cs1
```

`sa_decode` must extract the refine portion to pass to `refine_index->sa_decode`. The old code read from `bytes + i * (cs1 + cs2)` (offset 0), which extracted the base codes instead of the refine codes. The fix adds `+ cs1` to skip past the base codes:

```
// Old (wrong): memcpy(..., bytes + i * (cs1 + cs2),       cs2);
// New (fixed): memcpy(..., bytes + i * (cs1 + cs2) + cs1, cs2);
```

This mirrors the write offset in `sa_encode` line 199. Without this fix, `sa_decode` silently produces wrong reconstructions by feeding base-index codes to the refine decoder.

**Bug 2: `int` loop variable causes overflow with large inputs**

Three validation loops used `int i` as the loop counter:
```
for (int i = 0; i < n * k_base; i++)
```

Both `n` and `k_base` are `idx_t` (int64_t), so `n * k_base` can exceed `INT_MAX` (2^31 - 1). When the `int` counter reaches `INT_MAX`, incrementing it is signed integer overflow (undefined behavior). In practice this causes an infinite loop or out-of-bounds access. Changed to `idx_t i` in all three search methods: `IndexRefine::search`, `IndexRefineFlat::search`, and `IndexRefinePanorama::search`.

**Bug 3: Wrong class name in error message**

`IndexRefinePanorama::search` had the error message `"IndexRefineFlat params have incorrect type"` -- a copy-paste error from `IndexRefineFlat::search`. Fixed to `"IndexRefinePanorama params have incorrect type"`.

**Bug 4: Missing overflow guard on `n * k_base` allocation**

The product `n * k_base` is used in `new idx_t[n * k_base]` before the loop. If the product overflows int64_t, it could wrap to a small positive value, causing a too-small allocation followed by out-of-bounds writes from `base_index->search`. Added `FAISS_THROW_IF_NOT_MSG(n <= INT64_MAX / k_base, ...)` before the allocation in all three search methods. Division by zero is impossible because `k_base >= k > 0` is checked earlier.

**Cleanup: Redundant `sa_code_size()` call**

The allocation in `sa_decode` called `refine_index->sa_code_size()` a second time instead of using the already-computed `cs2`. Replaced with `cs2`.

Reviewed By: junjieqi

Differential Revision: D101903134

fbshipit-source-id: 37848d280d50447216ec3c76c598c3e212ea0971
- Fix double negatives: use FAISS_THROW_IF_MSG(is_static, ...) instead of
  dynamic_impl()
- Make LVQ default storage consistent: both default and parameterized
  constructors now use SVS_LVQ4x0 for IndexSVSIVFLVQ and IndexSVSVamanaLVQ
- Document intra_query_threads limitation: must be set before train() or
  deserialize_impl(); runtime changes not yet supported by SVS runtime API
- Fix is_lvq_leanvec_enabled() to check both IVFIndex and DynamicIVFIndex
  storage kind availability
Adds IVFIntraQueryThreadsSetBeforeTrain test that demonstrates the
supported usage pattern: set intra_query_threads before train(), not
after. Changes after index creation are silently ignored due to a
current SVS runtime API limitation.
Document that IVF search-time ID filtering is a pending item in the
SVS runtime — IVFIndex::search() does not yet accept an IDFilter
parameter (unlike VamanaIndex::search()). Once exposed, it can be
wired up using the same make_faiss_id_filter() pattern as Vamana.
ibhati pushed a commit that referenced this pull request May 7, 2026
…ult handlers (facebookresearch#5185)

Summary:
Pull Request resolved: facebookresearch#5185

Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads.

**Changes**

1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations.

2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`.

3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`.

The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`.

**Benchmark results**

A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions.

`HeapBlockResultHandler::end_multiple` (us, lower better):

| nq    | k    | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16  |
|------:|-----:|-----------:|---------:|------------:|----------:|--------------:|
| 64    | 10   | 9.2        | 7.2      | 8.1         | 8.3       | 0.98× (gated) |
| 64    | 100  | 340        | 345      | 318         | 67        | 4.79×         |
| 64    | 1000 | 5,796      | 5,700    | 5,886       | 501       | 11.76×        |
| 512   | 100  | 2,811      | 2,769    | 2,677       | 312       | 8.59×         |
| 512   | 1000 | 46,109     | 46,070   | 45,758      | 3,778     | 12.11×        |
| 4096  | 100  | 22,041     | 21,588   | 21,672      | 1,869     | 11.60×        |
| 4096  | 1000 | 369,069    | 376,541  | 372,481     | 25,442    | 14.64×        |

`ReservoirBlockResultHandler::end_multiple` (us):

| nq    | k    | parent t=16 | this t=16 | speedup       |
|------:|-----:|------------:|----------:|--------------:|
| 64    | 10   | 18.0        | 18.1      | 0.99× (gated) |
| 64    | 100  | 659         | 96        | 6.86×         |
| 64    | 1000 | 7,592       | 553       | 13.73×        |
| 512   | 100  | 5,498       | 490       | 11.21×        |
| 512   | 1000 | 59,548      | 4,677     | 12.73×        |
| 4096  | 100  | 44,064      | 3,230     | 13.64×        |
| 4096  | 1000 | 476,388     | 32,237    | 14.78×        |

`IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms):

| nb    | nq    | k   | parent t=16 | this t=16 | speedup |
|------:|------:|----:|------------:|----------:|--------:|
| 1024  | 1024  | 10  | 1.71        | 1.45      | 1.18×   |
| 1024  | 4096  | 100 | 58.5        | 15.5      | 3.78×   |
| 4096  | 4096  | 100 | 76.9        | 39.4      | 1.95×   |

Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers.

Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy.

Reviewed By: mnorris11

Differential Revision: D103830810

fbshipit-source-id: 8434fa6f16b78c5ff7b2244ac5d5fe9cc8c012a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.