Enable options for leanvec/LVQ data to reside in SSD or have primary only leanvec data by ibhati · Pull Request #1 · ibhati/faiss

ibhati · 2026-04-15T22:11:00Z

This PR adds primary_only support to LeanVec indexes in FAISS, allowing users to trade recall for significantly reduced memory usage. It also introduces IndexSVSVamanaSSD for SSD-backed Vamana search.

Changes
Core (faiss/svs/)

IndexSVSVamanaLeanVec: Add primary_only constructor parameter. When enabled, only reduced-dimension primary vectors are stored/used — no full-dimension secondary data for reranking. Serialization via index_write.cpp/index_read.cpp preserves the flag.
IndexSVSVamanaSSD (new): SSD-backed static Vamana index with configurable data placement (RAM/SSD for primary and secondary data). Supports LeanVec and LVQ compression, custom search parameters, and primary_only mode.

…form assertions (facebookresearch#5047) Summary: Pull Request resolved: facebookresearch#5047 ## Summary This diff fixes a bug and improves error message quality in `VectorTransform.cpp`. ### Bug Fix (line 153) `VectorTransform::check_identical()` had a copy-paste bug where `d_in` was checked twice and `d_out` was never checked: ```cpp // Before (buggy): FAISS_THROW_IF_NOT(other.d_in == d_in && other.d_in == d_in); // After (fixed): FAISS_THROW_IF_NOT_MSG( other.d_in == d_in && other.d_out == d_out, "input and output dimensions must match"); ``` This meant two VectorTransforms with matching `d_in` but different `d_out` would incorrectly pass the identity check. This could lead to subtle bugs when comparing or serializing transform chains (e.g., in `IndexPreTransform`). ### Error Message Improvements All 28 bare `FAISS_THROW_IF_NOT()` calls in `VectorTransform.cpp` have been converted to `FAISS_THROW_IF_NOT_MSG()` with clear, actionable error messages. Previously, assertion failures would only show the raw C++ condition (e.g., `"Error: 'p > 0' failed"`), which is unhelpful for users. Now each assertion provides semantic context: - **Dynamic cast failures**: `"failed to cast to HadamardRotation"` instead of `"hr"` - **Dimension mismatches**: `"input and output dimensions must match when PCA is disabled"` instead of `"din == dout"` - **Training state errors**: `"CenteringTransform has not been trained"` instead of `"is_trained"` - **LAPACK errors**: `"LAPACK dgesvd workspace query failed"` instead of `"info == 0"` - **Parameter validation**: `"map entries must be -1 (unused) or valid input dimension indices"` instead of raw condition ### Affected classes - `VectorTransform` (base class) - `LinearTransform` - `HadamardRotation` - `PCAMatrix` - `ITQMatrix` - `ITQTransform` - `OPQMatrix` - `NormalizationTransform` - `CenteringTransform` - `RemapDimensionsTransform` ### Design decisions - Used `FAISS_THROW_IF_NOT_MSG` (not `FAISS_THROW_IF_NOT_FMT`) since all messages are static strings — no runtime formatting needed, keeping zero overhead. - Error messages follow existing Faiss patterns seen in `index_read.cpp` and other files. - Each message describes the semantic meaning of the condition, not just the code. Reviewed By: mnorris11 Differential Revision: D99674067 fbshipit-source-id: cf0fe9a8a7f047013011683d76221682d97beb6c

…arch#4996) Summary: Pull Request resolved: facebookresearch#4996 Reviewed By: alibeklfc Differential Revision: D99569811 Pulled By: mnorris11 fbshipit-source-id: 127c6b6b771b81b1f11b0f28dc4936959fafac09

…ookresearch#5034) Summary: In GCC, `-mtune=sapphirerapids` sets prefer-vector-width to 256 via `X86_TUNE_AVX256_OPTIMAL`. In LLVM, the same default is set via the prefer-256-bit subtarget feature. This was originally added to avoid AVX-512 frequency throttling on Skylake-SP, but the penalty is negligible since Sapphire Rapids. Switching to explicit ISA flags allows the auto-vectorizer to use zmm registers. I don't see any performance regression. Below is an example. ```cpp // bench_ip.cpp #include <benchmark/benchmark.h> #include <vector> #include <random> #include <thread> #include <numeric> #include <cstdlib> _Pragma("GCC push_options") \ _Pragma("GCC optimize (\"unroll-loops,associative-math,no-signed-zeros\")") static float inner_product(const float *a, const float *b, int n) { float sum = 0.0f; for (int i = 0; i < n; i++) { sum += a[i] * b[i]; } return sum; } _Pragma("GCC pop_options") static void BM_InnerProduct(benchmark::State &state) { const int n = state.range(0); std::mt19937 rng(42 + state.thread_index()); std::uniform_real_distribution<float> dist(-1.0f, 1.0f); std::vector<float> a(n), b(n); for (int i = 0; i < n; i++) { a[i] = dist(rng); b[i] = dist(rng); } for (auto _ : state) { float result = inner_product(a.data(), b.data(), n); benchmark::DoNotOptimize(result); } state.SetItemsProcessed(state.iterations() * n); state.SetBytesProcessed(state.iterations() * n * 2 * sizeof(float)); } int main(int argc, char **argv) { int num_threads = 1; // Parse --threads=N before passing to benchmark for (int i = 1; i < argc; i++) { if (std::string(argv[i]).rfind("--threads=", 0) == 0) { num_threads = std::atoi(argv[i] + 10); // Remove from argv so benchmark doesn't choke for (int j = i; j < argc - 1; j++) argv[j] = argv[j + 1]; argc--; i--; } } std::vector<int64_t> sizes = {384, 768, 1536}; for (auto sz : sizes) { benchmark::RegisterBenchmark("BM_InnerProduct", BM_InnerProduct) ->Arg(sz) ->Threads(num_threads) ->UseRealTime(); } benchmark::Initialize(&argc, argv); benchmark::RunSpecifiedBenchmarks(); return 0; } ``` **Current**: `g++ -O3 -march=sapphirerapids -mtune=sapphirerapids bench_ip.cpp -lbenchmark -lpthread` ```text --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- BM_InnerProduct/384/real_time/threads:1 20.7 ns 20.6 ns 33860063 bytes_per_second=138.529Gi/s items_per_second=18.5931G/s BM_InnerProduct/768/real_time/threads:1 43.0 ns 43.0 ns 16293169 bytes_per_second=133.043Gi/s items_per_second=17.8567G/s BM_InnerProduct/1536/real_time/threads:1 86.5 ns 86.3 ns 7699745 bytes_per_second=132.321Gi/s items_per_second=17.7598G/s ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------- BM_InnerProduct/384/real_time/threads:64 31.2 ns 31.1 ns 22750464 bytes_per_second=91.7077Gi/s items_per_second=12.3088G/s BM_InnerProduct/768/real_time/threads:64 59.3 ns 59.2 ns 10872768 bytes_per_second=96.4611Gi/s items_per_second=12.9468G/s BM_InnerProduct/1536/real_time/threads:64 130 ns 130 ns 5561152 bytes_per_second=87.9984Gi/s items_per_second=11.8109G/s ``` **This PR**: `g++ -O3 -mavx512f bench_ip.cpp -lbenchmark -lpthread` ```text --------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------- BM_InnerProduct/384/real_time/threads:1 17.5 ns 17.5 ns 40056065 bytes_per_second=163.685Gi/s items_per_second=21.9695G/s BM_InnerProduct/768/real_time/threads:1 34.2 ns 34.1 ns 20446203 bytes_per_second=167.326Gi/s items_per_second=22.4582G/s BM_InnerProduct/1536/real_time/threads:1 72.4 ns 72.3 ns 9451952 bytes_per_second=158.094Gi/s items_per_second=21.219G/s ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------- BM_InnerProduct/384/real_time/threads:64 23.4 ns 23.4 ns 27661760 bytes_per_second=122.456Gi/s items_per_second=16.4358G/s BM_InnerProduct/768/real_time/threads:64 49.1 ns 48.9 ns 13923776 bytes_per_second=116.614Gi/s items_per_second=15.6517G/s BM_InnerProduct/1536/real_time/threads:64 105 ns 105 ns 6088320 bytes_per_second=108.736Gi/s items_per_second=14.5943G/s ``` Pull Request resolved: facebookresearch#5034 Test Plan: Verified flag consistency across all three CMake files. Added missing -mavx512vpopcntdq required by hamming_distance/avx512-inl.h and rabitq_avx512.cpp. Reviewed By: mnorris11 Differential Revision: D99687322 Pulled By: alibeklfc fbshipit-source-id: 1a27191149f9d0ff9dc392183bbd3c97c9915aa3

…ch#5044) Summary: - Fix duplicate word "the the" to "the" in `faiss/utils/quantize_lut.h` (comment) and `benchs/README.md` - Fix duplicate word "to to" to "to" in `faiss/IndexBinaryHNSW.cpp` (comment) - Fix subject-verb agreement "This produce" to "This produces" in `INSTALL.md` - Fix broken grammar "it does not the case" to "it is not the case" in `tests/test_residual_quantizer.py` Pull Request resolved: facebookresearch#5044 Test Plan: - [ ] No functional code changes; only comments, documentation, and test comments are modified - [ ] Verified each fix is a clear typo/grammar correction 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed By: mnorris11 Differential Revision: D99692631 Pulled By: alibeklfc fbshipit-source-id: e3ae88ba4ca732e9f620e1c205f51e4b827d0730

Summary: - Replace two `http://github.com` links with `https://github.com` in the README - Wiki page link (line 38) - Issues page link (line 85) Pull Request resolved: facebookresearch#5043 Test Plan: - [x] Both links resolve correctly with HTTPS - [x] No other changes Reviewed By: mnorris11 Differential Revision: D99690282 Pulled By: alibeklfc fbshipit-source-id: 42e94ac7e45e457b1d6b5b511c4092990c696c54

Summary: - Update C++ language level from C++17 to C++20 in `CONTRIBUTING.md` to match the actual CMake configuration (`CMAKE_CXX_STANDARD 20` in the root `CMakeLists.txt`) - Remove outdated "progressively dropping python2 support" note from `contrib/README.md` (Python 2 reached end-of-life in January 2020 and Faiss requires Python 3) - Update shebangs from `python2` to `python3` in three benchmark scripts: `benchs/kmeans_mnist.py`, `benchs/bench_gpu_1bn.py`, and `benchs/bench_vector_ops.py` Pull Request resolved: facebookresearch#5045 Test Plan: - [ ] No functional code changes; only documentation and shebangs are modified - [ ] Verified C++20 is the actual standard by checking `CMakeLists.txt` (`set(CMAKE_CXX_STANDARD 20)`) and `INSTALL.md` (which already references "a C++20 compiler") 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed By: mnorris11 Differential Revision: D99691731 Pulled By: alibeklfc fbshipit-source-id: 579cf21054cdf6bdaea27f5abb1c56e0b709a922

Summary: Pull Request resolved: facebookresearch#5025 In c++ when running in dev mode: ``` Crash chain: 1. index->add(1, vec) → IndexIVF::add_with_ids() (line 190) 2. quantizer->assign() → Index::assign() → IndexHNSW::search() — searches the HNSW coarse quantizer with the NaN vector 3. HNSW::search() computes d_nearest = qdis(entry_point) — this returns NaN because the input vector has NaN 4. NaN is pushed into MinimaxHeap candidates. All comparisons with NaN return false, corrupting heap ordering 5. search_from_candidates() pops a garbage node ID v0 from the corrupted heap 6. neighbor_range(v0, ...) hits FAISS_CHECK_RANGE_DEBUG — v0 is out of bounds → crash ``` In python test or opt mode: ``` The test timed out — it hung instead of crashing. This is consistent with the analysis: in the Python bindings (which run with mode/opt, so no debug assertions), pop_min returns -1, neighbor_range(-1, ...) doesn't crash on the assertion (it's a debug-only check), and instead accesses offsets[-1] which is undefined behavior. The NaN corrupts the heap, the search loops forever because nvalid never decrements when pop_min returns -1. So in Python/opt mode: no stack trace, just a hang. The C++ debug build crashes with the assertion. Both are fixed by the same push() NaN→+inf change in HNSW.cpp. ``` Performance impact?? -- it is in the tightest loop in search_from_candidates. Claude seems to think it is fine: ``` In search_from_candidates (line 707), push() is called for every neighbor visited during greedy graph traversal — batched 4 at a time via add_to_heap. This is indeed the tightest inner loop in HNSW search. What std::isnan actually costs std::isnan(float) on x86_64 compiles to essentially v != v — a single ucomiss + conditional jump (jp). That's 1-2 instructions. Context: what else happens per push call Each push call is preceded by significantly more expensive work: 1. Distance computation (distances_batch_4 at line 718): ~2×d FLOPs per vector (e.g., ~256 FLOPs for d=128) NOTE: HNSW.cpp:718-726 (qdis.distances_batch_4(...)) 2. Existing comparisons in push: k == n, v >= dis[0], potential heap_pop/heap_push which are O(log efSearch) with memory accesses NOTE: HNSW.cpp:1401 (k == n), 1402 (v >= dis[0]), 1408 (heap_pop), 1410 (heap_push) 3. Visited table check (vt.set) with prefetching NOTE: HNSW.cpp:689 (vt.prefetch(v1)) and 714 (vt.set(v1)) The isnan check is ~1 instruction compared to hundreds for the distance computation. Even against just the heap operations (several comparisons + pointer chasing), it's a rounding error. Branch prediction Since NaN distances essentially never occur in normal operation, this branch is always not-taken. The branch predictor will learn this within the first few iterations and predict it perfectly for the remainder of the search. A perfectly-predicted not-taken branch is ~0 cycles on modern CPUs. ``` And mini benchmark: ``` ┌────────┬────────────────────────┬────────────────────────┬───────────────────┐ │ │ With NaN check (run 1) │ With NaN check (run 2) │ Without NaN check │ ├────────┼────────────────────────┼────────────────────────┼───────────────────┤ │ Mean │ 401.146 ms │ 400.515 ms │ 401.191 ms │ ├────────┼────────────────────────┼────────────────────────┼───────────────────┤ │ Median │ 396.746 ms │ 398.172 ms │ 398.520 ms │ ├────────┼────────────────────────┼────────────────────────┼───────────────────┤ │ Min │ 389.939 ms │ 388.670 ms │ 391.850 ms │ ├────────┼────────────────────────┼────────────────────────┼───────────────────┤ │ Stddev │ 18.903 ms │ 7.833 ms │ 8.164 ms │ └────────┴────────────────────────┴────────────────────────┴───────────────────┘ ``` I wanted to make sure it worked regardless of metric type: ``` For inner product (and cosine, which is IP after normalization), HNSW wraps the distance computer in NegativeDistanceComputer, which negates the result. This means all metrics flow through the MinimaxHeap with the same convention: smaller = better. So the NaN → +inf replacement works correctly for all metrics: ┌────────┬──────────────────────────────┬──────┬───────┬─────────────────┐ │ Metric │ qdis() returns │ Best │ Worst │ +inf means │ ├────────┼──────────────────────────────┼──────┼───────┼─────────────────┤ │ L2 │ ‖x-q‖² │ 0 │ +inf │ worst (correct) │ ├────────┼──────────────────────────────┼──────┼───────┼─────────────────┤ │ IP │ -<x,q> (negated) │ -∞ │ +inf │ worst (correct) │ ├────────┼──────────────────────────────┼──────┼───────┼─────────────────┤ │ Cosine │ -<x̂,q̂> (negated, normalized) │ -1 │ +inf │ worst (correct) │ └────────┴──────────────────────────────┴──────┴───────┴─────────────────┘ In all cases, +inf sits at the top of the CMax heap and gets evicted first when the heap is full (v >= dis[0] returns early at line 1402). And pop_min() will never select it over any finite-distance candidate. The semantics are correct regardless of metric type. ``` Reviewed By: mdouze Differential Revision: D99036639 fbshipit-source-id: e5d6392e800f243f66ce283c8cd35fe0e7558229

Summary: Pull Request resolved: facebookresearch#4854 This re-enables the AMD ROCm runner that was previously disabled in D86250489. Changes from the original configuration: - Updated runner from `faiss-amd-MI200` to `linux-amd-rocm-mi325-ubuntu-24` to match the currently available GitHub Actions runner - Updated container image from Ubuntu 22.04 to Ubuntu 24.04 to align with the runner environment Test change: - seems like cuda and hip disagree about some small rounding errors, so AI updated the test. Reviewed By: subhadeepkaran Differential Revision: D94941142 fbshipit-source-id: d5158b7939e3b7327432aa89a9a0d2e5ed1ad190

…decode_impl (facebookresearch#5051) Summary: Pull Request resolved: facebookresearch#5051 In `sa_decode_impl<StorageMinMaxT>()`, a local variable `std::vector<StorageMinMaxFP16> minmax(...)` was: 1. Using the wrong type: It hardcoded `StorageMinMaxFP16` instead of the template parameter `StorageMinMaxT`. When this function is instantiated with `StorageMinMaxFP32` (via `IndexRowwiseMinMax::sa_decode`), this would create a vector of the wrong type. 2. Dead code: The vector was allocated but never actually used in the function body. The decoding logic reads `StorageMinMaxT` values directly from the input byte buffer, making this allocation unnecessary. This change removes the unused variable, eliminating both the type mismatch and the unnecessary memory allocation. The allocation was O(min(chunk_size, n)) per decode call, so removing it also provides a minor performance benefit. Note: The corresponding `sa_encode_impl` function does NOT have this issue (it correctly uses a local `minmax` that IS used), and the `train_inplace_impl` / `train_impl` functions also correctly use their `minmax` vectors. Only `sa_decode_impl` had this issue. Reviewed By: mnorris11 Differential Revision: D99851973 fbshipit-source-id: a288a4cd355ccc7d9e13d1f7d61bc54fc524675c

…cebookresearch#5052) Summary: Pull Request resolved: facebookresearch#5052 Multiple core Faiss C++ source files use bare `assert()` for runtime invariant checks. Since `assert()` is compiled out in release builds (when `NDEBUG` is defined), these checks silently disappear in production, potentially masking bugs and data corruption. This diff replaces bare `assert()` calls with Faiss's own `FAISS_THROW_IF_NOT` / `FAISS_THROW_IF_NOT_MSG` macros in 11 core index files. These macros throw `FaissException` with descriptive error messages and remain active in all build modes (debug and release). Files modified: - AutoTune.cpp - Clustering.cpp - IVFlib.cpp - IndexBinaryIVF.cpp - IndexIVFFlat.cpp - IndexIVFPQ.cpp - IndexIVFPQFastScan.cpp - IndexIVFPQR.cpp - IndexLSH.cpp - IndexPQ.cpp - IndexRefine.cpp Impact: - Prevents silent failures in release builds - Provides actionable error messages for debugging - Aligns with Faiss coding conventions (most of the codebase already uses FAISS_THROW_IF_NOT) Reviewed By: mnorris11 Differential Revision: D99857988 fbshipit-source-id: 89b01e022958495b0883c5faebc82bfe9b17da18

Summary: Pull Request resolved: facebookresearch#5050 - Implement balanced assignment in clustering.py based on notebook N10159950 - add a test that shows we improve the imabalace at some cost in MSE Reviewed By: algoriddle Differential Revision: D99819394 fbshipit-source-id: 568b6deb7d2b95b8228dbb276c5578df23b01a96

Summary: Pull Request resolved: facebookresearch#5059 Reviewed By: DenisYaroshevskiy Differential Revision: D99854747 fbshipit-source-id: 8bca36ec90475771ef17356d9a16b0d680a6296b

Summary: Pull Request resolved: facebookresearch#5060 Five small fixes for Dynamic Dispatch (DD) mode issues found during the DD migration audit. **Preprocessor guard fixes (3):** These files use `#ifdef __AVX2__` (or `__AVX__`) in common translation units. In DD mode, common TUs are compiled without `-mavx2`, so `__AVX2__` is never defined and the guarded code is silently disabled. The DD-era equivalent is `COMPILE_SIMD_AVX2`, which is defined target-wide for all TUs in DD mode (and also in static AVX2 builds). - `ProductQuantizer.cpp`: The dsub=2 fast path for `compute_distance_tables` and `compute_inner_prod_tables` was gated on `__AVX2__ || __aarch64__`. Changed to `COMPILE_SIMD_AVX2 || COMPILE_SIMD_ARM_NEON`. Without this, PQ distance table computation for dsub=2 falls back to the generic path in DD mode. - `LocalSearchQuantizer.cpp`: The prefetch include and usage were gated on `__AVX2__`. Changed to `COMPILE_SIMD_AVX2`. Without this, LSQ's ICM encoding loop loses prefetch hints in DD mode. - `prefetch.h`: The x86 prefetch path (`_mm_prefetch` via `<xmmintrin.h>`) was gated on `__AVX__`. This is an SSE intrinsic available on all x86_64 — the correct guard is `__x86_64__ || _M_X64`. The `__AVX__` guard was too restrictive even outside DD mode (it excluded SSE-only x86 builds, though those are rare in practice). **CMake build fixes (2):** - CMake DD mode (`FAISS_OPT_LEVEL=dd`) was missing `COMPILE_SIMD_AVX512_SPR` from the x86 compile definitions. Buck defines it via `arch_specific_compiler_flags`, but CMake only had `COMPILE_SIMD_AVX2 COMPILE_SIMD_AVX512`. Any code guarded on `COMPILE_SIMD_AVX512_SPR` (AMX-based fast scan kernels) was dead in CMake DD builds. - `distances_dispatch.h` was listed in `FAISS_HEADERS` (the CMake install list), but it's a private DD-internal header not meant for downstream consumers. Removed from the install list. (It remains in Buck's `header_files()` since Buck uses that list for compilation visibility, not just install.) Reviewed By: mdouze Differential Revision: D99966090 fbshipit-source-id: 05f0d5ee6353f850671f8be3932eb18e84cf8f92

Summary: Pull Request resolved: facebookresearch#5061 Roll out the `for_all_simd_levels` decorator to 62 test classes across 11 test files. This ensures that every available SIMD level (NONE, AVX2, AVX512, etc.) is exercised in CI, rather than only testing at the auto-detected (highest) level. Previously only `TestExtraDistances` in `test_extra_distances.py` used the decorator. Now all test files covering DD-dispatched code paths are parameterized: distances, PQ, SQ, fast scan, RaBitQ, partitioning, HNSW, and binary indices. Changes: - Add `for_all_simd_levels` to 62 test classes across 11 test files - BUCK: move decorated test targets to `supports_static_listing = False` (required because the decorator replaces class names with None, breaking TPX's static test enumeration) - test_fast_scan.py, test_fast_scan_ivf.py: apply decorator manually after dynamic `setattr` method generation for TestAQFastScan and TestIVFAQFastScan (the `setattr` loops reference the class by name) - test_rabitq_fastscan.py: extract `_create_fastscan_index` as module-level helper to fix cross-class method reference that broke when TestRaBitQFastScan was replaced by the decorator - IndexFastScan.cpp: fix `search_implem_14` to auto-cap `qbs` based on `bbs`. The accumulate loop dispatch table only instantiates certain (nq, BB) pairs (e.g. BB=2 only has nq=1,2). Previously, using bbs=64 with the default qbs (batch of 4) would crash with "nq=3 bbs=64 not instantiated". Now the query batch size is automatically capped to the max nq supported for the given BB. Exposed by per-level testing of test_factory_with_batch_size. Reviewed By: mdouze Differential Revision: D99978401 fbshipit-source-id: 54858f3810bab91ac5ec8bf0ce0d55a77710727b

Summary: Pull Request resolved: facebookresearch#4842 Some tools which depend on FAISS recently became much slower because they were accidentally changed to depend on `faiss:faiss_no_multithreading` instead of `faiss:faiss`. This adds `faiss::has_omp()`, which returns true if a `#pragma omp parallel` region had any effect through the use of a `reduction(max)` which would otherwise be stripped out. Note: 1. Compile-time check is not sufficient, as the `faiss_no_multithreading` and/or `faiss_omp_mock` targets control whether the `faiss/*.cpp` implementations have effective `#pragma omp` blocks. 2. Depending on the BUCK build mode, a `cpp_binary` which depends on `faiss:faiss_no_multithreading` and `faiss:faiss` *may or may not* link to implementations with OpenMP support. Reviewed By: subhadeepkaran Differential Revision: D94394555 fbshipit-source-id: ea477dd4146619ba1106a16f3021e0437ece9074

…#5064) Summary: Pull Request resolved: facebookresearch#5064 Useful when splitting codes that do and don't need packing. For example, in rabitq, output codes from encode_vectors look like [rqfs_codes][flat_factors], and input to pq4_pack_codes should only be a block of [rqfs_codes][rqfs_codes]... Reviewed By: alibeklfc Differential Revision: D100047797 fbshipit-source-id: 9cbad95beba8ddbbbe4e6ce8c4541692d3d7b0fa

…acebookresearch#5065) Summary: Pull Request resolved: facebookresearch#5065 The `reservePriorityQueue` helper in HNSW.cpp defines a local `Access` struct inheriting from `std::priority_queue` but uses parenthesized initialization `Access access(std::move(q))`. Apple Clang with libc++ on macOS-14 correctly rejects this because the implicit move constructor of `Access` takes `Access&&`, not `std::priority_queue&&`. The fix changes from parenthesized initialization to brace initialization `Access access{std::move(q)}`, which uses C++17 aggregate initialization. Since `Access` is an aggregate (no user-declared constructors, only a `using` declaration for member access), brace initialization directly initializes the base class sub-object from the `priority_queue&&` argument. This is backward-compatible with GCC, MSVC, and all Clang versions. Note: The alternative approach of adding `using std::priority_queue<T, Container, Compare>::priority_queue;` to inherit base constructors was considered but rejected because it removes `Access`'s aggregate status, breaking C++20 parenthesized aggregate initialization that the Linux toolchain (clang19) relies on. Reviewed By: junjieqi Differential Revision: D100090366 fbshipit-source-id: 07821475d2f2fbc205fb3288cf25a6ebea0ca3a5

Summary: Pull Request resolved: facebookresearch#5062 Convert `utils/partitioning.cpp` SIMD code to dynamic dispatch so that IVF partition assignment and histogram functions use the correct SIMD implementation at runtime instead of being dead-coded in DD mode. The partitioning code has two SIMD blocks: 1. `simd_partitioning` namespace — SIMD-accelerated uint16_t partition using simdlib wrappers (simd16uint16, simd32uint8) 2. Histogram subroutines — SIMD 8-bin and 16-bin histogram computation Both were guarded by `#ifdef __AVX2__` / `#if defined(__AVX2__) || defined(__aarch64__)` which are always false in DD mode on x86, silently disabling the fast paths. Approach: - Extract all SIMD code into `partitioning_simdlib256.h`, a shared header included once per ISA TU (AVX2, NEON). The code uses simdlib portable wrappers so it works on both x86 and ARM without changes. - Create per-ISA TUs (`partitioning_avx2.cpp`, `partitioning_neon.cpp`) that include the shared header with the correct compiler flags. - In the common TU, replace `#ifdef` guards with `with_simd_level_256bit` dispatch. NONE level falls through to dedicated scalar fallbacks. - No AVX512 TU needed — code uses only 256-bit ops; AVX512 falls through to AVX2 via the dispatch mechanism. Reviewed By: mdouze Differential Revision: D99991775 fbshipit-source-id: 726cdf3a46db31ed1ff1f9a8966e471d9f5ac0b1

…earch#5069) Summary: Pull Request resolved: facebookresearch#5069 Remove the 10 global bare-name using declarations (simd16uint16, simd32uint8, simd8uint32, simd8float32, simd256bit, simd512bit, simd32uint16, simd64uint8, simd16float32) from simdlib_dispatch.h. These aliases resolved through SINGLE_SIMD_LEVEL which is NONE in DD mode, creating a trap where per-ISA TU code accidentally uses scalar emulation. Each file that needs the aliases now declares its own using with an explicit SIMD level, making the dependency visible. Behavior-preserving: all files use the same SINGLE_SIMD_LEVEL they used before, but now explicitly. Reviewed By: mdouze Differential Revision: D100033901 fbshipit-source-id: 2db034c2868de275a5f018d886264762776548c5

Summary: Pull Request resolved: facebookresearch#5063 Fix two bugs in `fbcode/faiss/impl/NSG.cpp`: **Bug 1: `init_ids[i]` → `init_ids[num_ids]` in `search_on_graph`** The init loop in `search_on_graph` reads neighbors of the enterpoint from the knn_graph. When an entry has `id >= ntotal`, it is skipped via `continue`. The loop variable `i` advances but `num_ids` (the write pointer) does not. The old code wrote `init_ids[i] = id`, placing valid entries at non-contiguous positions and leaving gaps in between. The gap-filling loop that follows starts from `num_ids`, so it never overwrites the internal gaps. Example with neighbors `[5, 99999, 3, 99999, 7]` and `ntotal=100`: | i | id | old: `init_ids[i]=id` | new: `init_ids[num_ids]=id` | |---|-------|-----------------------|-----------------------------| | 0 | 5 | `init_ids[0] = 5` | `init_ids[0] = 5` | | 1 | 99999 | skip (gap at [1]) | skip | | 2 | 3 | `init_ids[2] = 3` | `init_ids[1] = 3` | | 3 | 99999 | skip (gap at [3]) | skip | | 4 | 7 | `init_ids[4] = 7` | `init_ids[2] = 7` | Old result: `[5, 0, 3, 0, 7, ...]` — gaps contain 0 (from `vector<int>` zero-initialization). The consumption loop reads these zeros as node IDs, biasing the search pool toward node 0 during graph construction and degrading graph quality. Fixed result: `[5, 3, 7, ...]` — valid entries packed contiguously, gap-filling starts from the correct position. **Bug 2: `sync_prune` out-of-bounds access** `sync_prune` accesses `pool[start]` without bounds checking. Two cases cause out-of-bounds reads: - Pool is empty after augmentation - Pool contains only the query node itself (e.g., `ntotal=1`) In both cases `start` advances past `pool.size()` and `pool[start]` is an out-of-bounds vector read — undefined behavior that crashes under ASAN and silently corrupts the graph in release builds. Trace for `ntotal=1, L=1`: 1. `search_on_graph` returns `pool = [{id:0, dist:0}]` 2. `sync_prune(q=0)`: `pool[0].id == q` → `start++` → `start=1 == pool.size()` 3. Old code: `result.push_back(pool[1])` — OOB read 4. Fix: guard checks `start >= pool.size()`, fills graph row with `EMPTY_ID` **Other fixes:** - Replace `min = 1e6` (float-to-int truncation) with `std::numeric_limits<int>::max()` - Remove `srand(0x1998)` from the NSG constructor (global RNG side effect) Reviewed By: mnorris11 Differential Revision: D100024850 fbshipit-source-id: 0d290801658e381e198b6c6ab54ebe981e0f09f3

…region (facebookresearch#5053) Summary: Pull Request resolved: facebookresearch#5053 C++ exceptions thrown inside `#pragma omp parallel` regions that are not caught within the region call `std::terminate` — they cannot propagate across thread boundaries. `IndexIVF::range_search_preassigned` had the same class of bugs fixed in `search_preassigned` by D99455250: 1. **`scan_list_func` lambda**: `FAISS_THROW_IF_NOT_FMT(key < nlist)` was above the try-catch block, so a corrupt key >= nlist would throw uncaught and call `std::terminate`. 2. **Outer parallel region**: `get_InvertedListScanner()`, `scanner->set_query()`, and `FAISS_THROW_IF_NOT(scanner.get())` had no try-catch at all. Fixes: 1. Moved the existing try-catch in `scan_list_func` up to also cover the key validation. 2. Wrapped the entire `#pragma omp parallel` body in a try-catch that uses the existing `interrupt`/`exception_string`/`exception_mutex` pattern to safely propagate exceptions out of the parallel region. Reviewed By: mnorris11 Differential Revision: D99879998 fbshipit-source-id: e768372cbaf8a22a9459fc3fd9b9df6e019897a6

…5068) Summary: fix bug to make parallel happens Pull Request resolved: facebookresearch#5068 Reviewed By: junjieqi Differential Revision: D100213935 Pulled By: alibeklfc fbshipit-source-id: 31f4d8b05843d0b97f0539f69db0aebecb0063a8

…random bounds (facebookresearch#5072) Summary: Pull Request resolved: facebookresearch#5072 Fix correctness bugs in `NNDescent` `Nhood` copy/move operations and `gen_random` bounds. ## Bug 1: Broken Nhood copy constructor and copy assignment operator The copy constructor and copy assignment operator for `Nhood` were incomplete: - Copy assignment used `std::back_inserter` to append to `nn_new` instead of replacing it, leading to duplicate entries on reassignment and heap-use-after-free on self-assignment. - Neither operation copied `pool`, `nn_old`, `rnn_old`, or `rnn_new`, meaning copied `Nhood` objects had missing neighbor data. - This caused data loss when `std::vector<Nhood>` reallocated during `push_back`. Fixed both operations to properly copy all 6 data members. Added self-assignment guard (`if (this != &other)`) to the copy assignment operator. Changed the copy constructor to use a member initializer list in declaration order to avoid `-Wreorder` warnings. **Proof:** `NhoodCopy.CopyConstructorPreservesAllFields` and `NhoodCopy.CopyAssignmentPreservesAllFields` fail without the fix — `pool.size()` is 0 (expected 3), `nn_old`, `rnn_new`, `rnn_old` are all empty. `NhoodCopy.CopyAssignmentSelfAssign` triggers heap-use-after-free without the self-assignment guard. `NhoodCopy.VectorReallocationPreservesData` shows data loss during `std::vector<Nhood>` reallocation. ## Bug 2: Division by zero in `gen_random` When `size == N`, the expression `rng() % (N - size)` is a division by zero (undefined behavior). This occurs in `search()` when `search_L` or `topk` equals `ntotal`, because `L_2 = max(search_L, topk)` is passed to `gen_random(rng, init_ids.data(), L_2, ntotal)`. Added a precondition assertion (`size > 0 && size <= N`) and a Fisher-Yates shuffle for the `size == N` special case. **Proof:** `TestNNDescentGenRandom.test_search_L_equals_ntotal` crashes (process killed by SIGFPE) without the fix, passes with it. ## Performance: Added move constructor and move assignment operator Since `std::mutex` is neither copyable nor movable, the compiler cannot generate implicit move operations for `Nhood`. With user-defined copy operations, implicit move generation is suppressed entirely. Without explicit move operations, `std::vector<Nhood>::push_back(Nhood&&)` falls back to copy — 5 unnecessary vector allocations per element. Added `noexcept` move constructor and move assignment operator. `noexcept` is required for `std::vector` to prefer move over copy during reallocation. The move assignment operator is included for Rule of Five consistency. **Proof:** All tests pass both with and without move operations, confirming these are a performance optimization, not a correctness fix. ## Cleanup: Removed misleading `omp_get_thread_num()` The RNG seed in `nndescent()` was `random_seed * 6577 + omp_get_thread_num()`. This function is not inside any `#pragma omp parallel` region — the call chain is `IndexNNDescent::add()` -> `NNDescent::build()` -> `NNDescent::nndescent()`, all sequential. Per the OpenMP specification, `omp_get_thread_num()` returns 0 in sequential context. The `+ 0` is dead code. **Proof:** No behavioral change. The seed was always `random_seed * 6577`. Reviewed By: mnorris11 Differential Revision: D100155792 fbshipit-source-id: 042a7d0a53a7696915a96bf1e48a464507f044b3

…h#5040) Summary: Pull Request resolved: facebookresearch#5040 Add the missing `key < nlist` upper-bound check in `IndexIVF::search1()`, which was the only IVF search entry point lacking this validation. The other two paths — `search_preassigned()` and `range_search_preassigned()` — already had this check. Also add deserialization acceptance tests verifying that IVF indexes with various quantizer states deserialize correctly: 1. **Surplus centroids** (`ntotal > nlist`): Produced by `shard_ivf_index_centroids()`, which distributes all of the original quantizer's centroids across shards without adjusting `nlist`. The search-time `key < nlist` bounds check prevents OOB access if the quantizer returns out-of-range keys. 2. **Trained quantizer** (`ntotal == nlist`): The normal trained IVF state. 3. **Sharded quantizer** (`0 < ntotal < nlist`): Also produced by `shard_ivf_index_centroids()`, when the original quantizer has `ntotal == nlist` and centroids are split across shards. 4. **Untrained quantizer** (`ntotal == 0`): Legitimate for custom inverted list management. Reviewed By: mnorris11 Differential Revision: D99494237 fbshipit-source-id: 6a76b55f104b9c233dfdd2625bb0336ed8061463

…#5054) Summary: Pull Request resolved: facebookresearch#5054 `IndexSVSVamana::storage_kind` was declared without a default initializer, and the default constructor is `= default`, so the field was left uninitialized in default-constructed instances. This is undefined behavior any time the value is read — including serialization via `write_index`, which writes the garbage value to disk. Add `= SVS_FP32` as the default initializer, matching the default used by the parameterized constructor `IndexSVSVamana(d, degree, metric, storage)`. This is a safe, behavior-preserving change: - The parameterized constructor already defaults to `SVS_FP32`, so any code constructing an index with arguments is unaffected. - The two derived classes (`IndexSVSVamanaLVQ`, `IndexSVSVamanaLeanVec`) explicitly set `storage_kind` in their own default constructors, so they are also unaffected. - The only code path that changes behavior is default construction of `IndexSVSVamana` itself, which previously produced an uninitialized (UB) value and now produces `SVS_FP32`. Reviewed By: mnorris11 Differential Revision: D99891611 fbshipit-source-id: da6acff7bdeb5668a2bf5f3b585bc1a3179004b9

Summary: Pull Request resolved: facebookresearch#5055 Add deserialization-time validation for the `storage_kind` field in SVS index types (IndexSVSVamana, IndexSVSVamanaLVQ, IndexSVSVamanaLeanVec) to reject corrupted or malicious index files before they can cause crashes. 1. **Read `storage_kind` as int and range-check before cast**: `storage_kind` was previously read directly into the `SVSStorageKind` enum via `READ1`, which is undefined behavior for out-of-range values. Now read into a temporary `int`, validate the value is in `[0, SVS_count)`, and only then cast to `SVSStorageKind`. This rejects invalid values at deserialization time with a `FaissException` instead of reaching `to_svs_storage_kind()` where the `default` branch calls `FAISS_ASSERT(false)` and aborts the process. 2. **Add `SVS_count` sentinel to `SVSStorageKind` enum**: Follows the convention used by `QT_count`, `ST_count`, and `DMT_count` in other FAISS enums. The deserialization validation uses this sentinel so it automatically stays correct when new storage kinds are added. Reviewed By: mnorris11 Differential Revision: D99722676 fbshipit-source-id: 557cf91a963d1d93171fea2a67ba99f19b9b3420

…arch#5056) Summary: Pull Request resolved: facebookresearch#5056 Add deserialization-time validation for FastScan `M2` across all six FastScan index types to reject corrupted or malicious index files that would cause heap buffer overflows during search. During normal construction, `M2 = roundup(M, 2)` is an invariant maintained by `init_fastscan()`. During deserialization, `M2` is read directly from the file and was not validated. A corrupted file with `M2 < M` causes `compute_quantized_LUT` to write `M * ksub` bytes into a buffer sized for `M2 * ksub` bytes, producing an out-of-bounds write. The `memset` that zeroes padding from M to M2 additionally underflows as an unsigned subtraction when `M2 < M`. 1. **Added `validate_fastscan_fields()` helper**: Consolidates all FastScan field validation into a single function: M > 0, ksub > 0, bbs > 0 and 32-aligned, M2 == roundup(M, 2), and overflow checks for ksub * M and ksub * M2. 2. **Non-IVF FastScan paths (already had partial validation)**: Replaced inline checks in IndexPQFastScan, IndexAdditiveQuantizer- FastScan, and IndexRaBitQFastScan with calls to the new helper, adding the missing M2 consistency check. 3. **IVF FastScan paths (had no validation at all)**: Added validation to IndexIVFPQFastScan, IndexIVFAdditiveQuantizer- FastScan, and IndexIVFRaBitQFastScan, which previously had zero checks on any FastScan fields. Reviewed By: mnorris11 Differential Revision: D99738294 fbshipit-source-id: 8e334993b0e8b4375f9ec173c20754e301b7c9f6

…earch#5077) Summary: Pull Request resolved: facebookresearch#5077 Mirror of D100047797 for the non-IVF IndexFastScan hierarchy. Adds a pure virtual fast_scan_code_size() to IndexFastScan with concrete implementations in IndexPQFastScan (M2/2), IndexAdditiveQuantizerFastScan (M2/2), and IndexRaBitQFastScan ((d+7)/8). Reviewed By: alibeklfc Differential Revision: D100342866 fbshipit-source-id: 3e5edcb1f45d53eec2b41ca63c7854fd8f1f4280

Summary: Extend the POSIX mmap reader to Apple platforms and use MAP_FAILED for mmap error checks while keeping madvise best-effort. Update the C++ and Python mmap tests to exercise Darwin, and stop linking faiss_test against the Python example extension so Python-enabled test builds can run on macOS. Pull Request resolved: facebookresearch#5058 Reviewed By: junjieqi Differential Revision: D100351635 Pulled By: alibeklfc fbshipit-source-id: f384251a634c7d3154103dbc763293b52b093ee8

…okup in sorting.cpp (facebookresearch#5078) Summary: Pull Request resolved: facebookresearch#5078 Four fixes in `faiss/utils/sorting.cpp`: **1. OpenMP directive fix in `fvec_argsort_parallel`** The initialization loop used `#pragma omp parallel` without the `for` directive. This caused every thread to execute the entire loop independently rather than distributing iterations. With `nt` threads, each `permA[i]` was written by all `nt` threads concurrently — a data race under the C++ memory model (multiple unsynchronized writes to the same non-atomic location), and O(n * nt) wasted work instead of O(n). Fixed by changing to `#pragma omp parallel for`. In practice, all threads write the same value (`permA[i] = i`), so the output was always correct despite the UB. The fix eliminates the undefined behavior and the redundant work. **2. RAII memory management in `fvec_argsort_parallel`** Replaced `new size_t[n]` / `delete[] perm2` with `std::vector<size_t>`. The old code had no realistic exception path between allocation and deallocation (all intermediate code is either C functions or non-throwing OpenMP regions), but the manual `new`/`delete` pattern is fragile against future edits that might introduce a throwing path. The `std::vector` provides RAII lifetime management with no behavioral change. **3. Removed debug `printf` in `fvec_argsort_parallel`** A leftover `printf("merge %d %d, %d threads\n", ...)` in the parallel merge loop wrote to stdout during normal operation. Removed. **4. Missing early termination in `hashtable_int64_to_int64_lookup`** The linear probing loop did not check for empty slots (`tab[slot * 2] == -1`). In an open-addressing hash table with no deletion support, an empty slot is definitive proof that the key was not inserted — the insert function would have placed it there or earlier. Without this check, lookups for absent keys probed every slot in the bucket before the wrap-around termination at `slot == hk_i`. The fix adds the standard empty-slot check, matching the structure of the insert function (`hashtable_int64_to_int64_add`). This is a performance optimization — the old code always returned the correct result (`-1` after a full bucket scan), just slower. Reviewed By: mnorris11 Differential Revision: D100317917 fbshipit-source-id: aadfe33b1d76c34e04db7fe0c9b7ca53b4a30c71

…rch#5112) Summary: Pull Request resolved: facebookresearch#5112 Add validation in read_ivf_header() to reject a null quantizer sub-index read from serialized data. The IVF deserialization reads the quantizer via read_index(), which returns nullptr when the stream contains the "null" fourcc. A null quantizer is fundamentally invalid for any IVF index type. Without this check, downstream code (e.g. initialize_IVFPQ_precomputed_table, IndexIVF::search) dereferences the null pointer. This single validation protects all IVF index types that share read_ivf_header: IndexIVFFlat, IndexIVFPQ, IndexIVFScalarQuantizer, IndexIVFAdditiveQuantizer, and others. Reviewed By: mnorris11 Differential Revision: D101236489 fbshipit-source-id: d9eb6759024ee2a4a59b838367ebf9299759ff23

…ch (facebookresearch#5113) Summary: Pull Request resolved: facebookresearch#5113 Add validation that IndexHNSW2Level (fourcc "IHN2") has storage of an appropriate type, both at deserialization time and at search time. IndexHNSW2Level::search() uses dynamic_cast to dispatch between Index2Layer and IndexIVFPQ storage types. When storage is null or a different type (e.g. IndexFlat from corrupt serialized data, or a programmatically misconfigured index), the dynamic_cast returns nullptr which is then unconditionally dereferenced, causing a segfault. Deserialization-time fix: After reading the HNSW storage sub-index for IHN2, validate that storage is non-null and is either Index2Layer or IndexIVFPQ. Search-time defense-in-depth: Add a FAISS_THROW_IF_NOT check on the dynamic_cast result in IndexHNSW2Level::search() before dereferencing. This protects against programmatically constructed indexes that bypass deserialization validation. Reviewed By: mnorris11 Differential Revision: D101243603 fbshipit-source-id: f3d75c1b19e68bf8539c55877c94749ef2899445

…uerying untrained indexes (facebookresearch#5114) Summary: Pull Request resolved: facebookresearch#5114 Add FAISS_THROW_IF_NOT(is_trained) to IndexIVF::search(), IndexIVF::search_preassigned(), IndexIVF::range_search(), and IndexIVF::range_search_preassigned(), mirroring the existing check in IndexScalarQuantizer::search(). This prevents querying untrained IVF indexes deserialized from corrupt data where the ScalarQuantizer trained vector is empty. The existing deserialization validation in read_ScalarQuantizer correctly allows untrained indexes (is_trained=false with empty trained) to be deserialized, since these are legitimately produced by index_factory before training. However, IndexIVF search methods lacked the is_trained guard that IndexScalarQuantizer::search() has, allowing a deserialized untrained IndexIVFScalarQuantizer to be queried, which causes null-deref in QuantizerTemplate when it indexes into the empty trained vector. Reviewed By: mnorris11 Differential Revision: D101243973 fbshipit-source-id: eca68dc82e5cca37d4c461b735c5d59a66349248

…facebookresearch#5115) Summary: Pull Request resolved: facebookresearch#5115 Add deserialization-time validation for VectorTransform dimension invariants that are enforced by constructors but not by deserialization: 1. NormalizationTransform (VNrm): Require d_in == d_out. The constructor enforces this (both set to d), but deserialization reads them independently. A crafted file with d_in > d_out causes memcpy in apply_noalloc to overflow the output buffer (allocated as n * d_out floats but copied as n * d_in). 2. CenteringTransform (VCnt): Same d_in == d_out invariant. 3. IndexPreTransform (IxPT) chain consistency: Validate that chain[0].d_in == index.d, chain[i].d_in == chain[i-1].d_out for consecutive transforms, and chain.back().d_out == sub_index.d. Without this, mismatched dimensions between transforms cause out-of-bounds reads when one transform produces fewer elements than the next expects. Reviewed By: mnorris11 Differential Revision: D101244181 fbshipit-source-id: fbe88bf63d42381297319d4125e750f6d47bc333

…facebookresearch#5117) Summary: Pull Request resolved: facebookresearch#5117 Add deserialization byte limit checks before vector::resize calls in read_InvertedLists_up() for both ArrayInvertedListsPanorama ("ilpn") and ArrayInvertedLists ("ilar") paths. Previously, per-list sizes read from serialized data were used directly in .resize() calls without checking against get_deserialization_vector_byte_limit(). The READVECTOR macro enforces this limit, but explicit .resize() calls bypassed it. Also add mul_no_overflow protection for the ilpn codes allocation (num_elems * code_size) which previously had no overflow check. Reviewed By: mnorris11 Differential Revision: D101260923 fbshipit-source-id: 24287740642cc9647115676c71508faf8bf8f48e

…rrupt index data (facebookresearch#5118) Summary: Pull Request resolved: facebookresearch#5118 Add per-read byte limit enforcement to the ReaderStreambuf bridge between faiss IOReader and std::istream, used by SVS index deserialization. SVS third-party code reads sizes from the stream and immediately allocates (e.g. string::resize, vector::resize) without any size validation. Since SVS operates through std::istream, it completely bypasses faiss's deserialization_vector_byte_limit mechanism enforced in the IOReader/READVECTOR layer. The fix adds a per_read_byte_limit parameter to ReaderStreambuf. When set, xsgetn() rejects individual read requests that meet or exceed the limit by returning 0 (EOF). This matches READVECTOR semantics where each individual vector allocation is independently checked against deserialization_vector_byte_limit. Small reads (header fields, size values) pass through unimpeded; only oversized bulk reads that correspond to data allocations in the SVS code are rejected. All three SVS deserialization call sites now pass get_deserialization_vector_byte_limit() as the per-read limit. Reviewed By: mnorris11 Differential Revision: D101261327 fbshipit-source-id: 6e45aec63de42e5b5eaf811f4bb9b06732b09eb5

facebookresearch#5125) Summary: Pull Request resolved: facebookresearch#5125 The faiss-gpu conda recipe pins `{{ compiler('cxx') }} =12.4` (GCC 12.4). GCC 12.4 miscompiles the 16-bin SIMD histogram reduction in `partitioning_simdlib256.h`, producing correct results for bins 0-7 but near-zero for bins 8-15. This causes `test_16bin_bounded_bigrange` in `TestHistograms_AVX2` to fail in the CUDA 12.6 GPU nightly. The bug is in GCC 12's code generation for the AVX2 cross-lane reduction chain (`_mm256_hadd_epi16` → `_mm256_permute2f128_si256` → `_mm256_permutevar8x32_epi32`). GCC 13 and 14 both compile this correctly. The CPU-only `faiss/meta.yaml` leaves the compiler unpinned (gets GCC 14), which is why only the GPU nightly fails. The GCC 12.4 pin was introduced in D84193438 as part of a batch nightly fix — not a deliberate CUDA compatibility constraint. CUDA 12.6 supports up to GCC 13.x as host compiler (GCC 14 requires CUDA 12.9+), so we widen the pin to `>=12.4,<14`. Reproduced locally: GCC 12.4 fails, GCC 13.4 passes, GCC 14.2 passes — all on the same faiss source, same test, same machine. Reviewed By: mdouze Differential Revision: D101601476 fbshipit-source-id: 8e36c83a9df67ba66408faa4ca392e1bd46d7c87

) Summary: Pull Request resolved: facebookresearch#5074 Move `with_simd_level` / `with_simd_level_256bit` calls outside the enclosing loops so the SIMD level is resolved once rather than on every iteration. Sites fixed: - distances.cpp: knn_inner_products_by_idx, knn_L2sqr_by_idx - NeuralNet.cpp: ZnLUTCodec::encode - ClusteringInitialization.cpp: init_kmpp_plus_plus Reviewed By: mdouze Differential Revision: D100144174 fbshipit-source-id: bd2369ed4fd9c3b5b54e435c7ee66a03f0e152df

…esearch#5126) Summary: Pull Request resolved: facebookresearch#5126 Replace the dispatch_HammingComputer + Run_XXX consumer struct pattern with with_HammingComputer that takes a C++20 template lambda directly. This eliminates boilerplate wrapper structs across 8 files. Before: struct Run_foo { using T = void; template<class HC, class... T> void f(T... a) { foo<HC>(a...); } }; Run_foo r; dispatch_HammingComputer(code_size, r, args...); After: with_HammingComputer(code_size, [&]<class HC>() { foo<HC>(args...); }); Reviewed By: algoriddle Differential Revision: D101350351 fbshipit-source-id: 02a346e8c33ffdb49153cbe13415b748f0a1e847

Summary: Pull Request resolved: facebookresearch#5048 Reviewed By: mnorris11, hanle11 Differential Revision: D99419595 fbshipit-source-id: 9c1214c6f4b88bf41e9d1851dd0acb5c7c5001ef

…#5132) Summary: Pull Request resolved: facebookresearch#5132 Reviewed By: limqiying, junjieqi Differential Revision: D101359141 fbshipit-source-id: 7d78875eed114367d4a45215e058f5fa9ebf06a1

…bookresearch#5031) Summary: Pull Request resolved: facebookresearch#5031 `IndexHNSW` allocates an initializes locks for `ntotal+n` nodes on every call to `add()`. This makes batched insertion very costly, and incremental insertion prohibitively so. This diff introduces optional persistent locks for `IndexHNSW` to improve incremental `add()` performance. Previously, `omp_lock_t` arrays of size `ntotal+n` were created/destroyed on each `add()` call. Now locks can be retained via a new `retain_locks` flag (default: false), using a new `HNSW::Lock` RAII wrapper with geometric growth. RFC: Instead of `retain_locks` being the only way to opt into this new behavior, this could be inferred on the first incremental add. That is, clear the locks after insertion iff `n0 == 0`. Workloads which call `add()` once would be unaffected, but workloads which call `add()` repeatedly would 1) forego the clearing of the lock vector after `add()` call #2, and reuse locks for all subsequent calls. The downside would be the lack of the ability to reclaim the locks after insertion without HNSW-specific behavior at the call site. Reviewed By: mdouze Differential Revision: D98232750 fbshipit-source-id: ef55cd9e4eb79793267a29a06502a582873e6a74

Summary: Pull Request resolved: facebookresearch#5129 ## Bug `HNSW::add_with_locks()` updates two shared member variables — `max_level` and `entry_point` — after releasing the per-node lock, without any synchronization: ```cpp omp_unset_lock(&locks[pt_id]); if (pt_level > max_level) { // read shared state max_level = pt_level; // write shared state entry_point = pt_id; // write shared state } ``` This function is called from inside `#pragma omp for` in `hnsw_add_vertices()` (IndexHNSW.cpp), meaning multiple threads execute it concurrently. The unprotected check-then-act pattern is a classic TOCTOU race condition. ## Proof by interleaving Suppose `max_level = 2` and two threads finish their link-building simultaneously: - Thread A: `pt_level = 4`, `pt_id = 100` - Thread B: `pt_level = 3`, `pt_id = 200` | Step | Thread A | Thread B | max_level | entry_point | |------|---------------------------------------|---------------------------------------|-----------|-------------| | 0 | — | — | 2 | (level-2 node) | | 1 | reads `4 > 2` -> true | | 2 | | | 2 | | reads `3 > 2` -> true | 2 | | | 3 | writes `max_level = 4` | | 4 | | | 4 | writes `entry_point = 100` | | 4 | 100 | | 5 | | writes `max_level = 3` | 3 | 100 | | 6 | | writes `entry_point = 200` | 3 | 200 | **Result**: `max_level = 3`, `entry_point = 200` (a node at level 3). But node 100 exists at level 4 — the true maximum. The HNSW invariant that `entry_point` is a node at `max_level` is violated. ## Consequence Search starts from `entry_point` and walks down from `max_level`. With a wrong entry point at a lower level, the upper levels of the graph are never traversed during search, leading to silently degraded recall. The index does not crash and still returns results — they are just worse. ## Fix Wrap the check-and-update in `#pragma omp critical` to make it atomic: ```cpp #pragma omp critical { if (pt_level > max_level) { max_level = pt_level; entry_point = pt_id; } } ``` This guarantees that only one thread executes the block at a time. In the interleaving above, Thread B would enter the critical section after Thread A completes, see `max_level = 4`, evaluate `3 > 4` as false, and correctly skip the write. ## Note on the read at line 561 `int level = max_level` reads `max_level` without synchronization. This is technically a data race under the C++ memory model, but it is benign: reading a stale value just means the greedy search starts one level too low, which the algorithm handles correctly (it still finds correct neighbors, just slightly less efficiently). Adding synchronization here would introduce overhead on every iteration of a hot loop for negligible benefit. ## Why existing tests did not catch this 1. **Tiny race window**: both threads must pass the `if` check in the few nanoseconds before either writes — extremely unlikely per run. 2. **Subtle consequence**: a wrong entry point degrades recall slightly but does not crash or return wrong types. Tests assert recall thresholds (e.g., recall > 0.9), not exact values. 3. **Rare trigger condition**: the race only fires when two nodes added concurrently both exceed the current `max_level`. Higher HNSW levels are exponentially less probable by design — most nodes are level 0, and the highest levels typically have only 1-2 nodes, making concurrent contention on `max_level` near-impossible in practice. Reviewed By: mnorris11 Differential Revision: D101444067 fbshipit-source-id: 82b9fdafed0b7c3cc26eb4d6c7e3536e6e12bee3

…search#5130) Summary: Pull Request resolved: facebookresearch#5130 This diff fixes four bugs in `Clustering.cpp`, four of which trigger for datasets with more than 2,147,483,647 vectors (`INT_MAX`), and one that can trigger regardless of dataset size. ## Bug 1: Integer truncation in fast subsampling — out-of-bounds memory access **Location**: `subsample_training_set()`, line 96 **Before**: ```cpp std::vector<int> perm; // ... perm[i] = rng.rand_int(nx); ``` **Bug**: `rand_int(int max)` takes an `int` parameter. When `nx` is `idx_t` (`int64_t`) and exceeds `INT_MAX`, the implicit narrowing conversion truncates `nx` to `int`. On two's complement (all target platforms), a value like `3,000,000,000` becomes `-1,294,967,296`. The function then generates a "random" index in a garbage range. These values are stored in `perm` and used as array indices: ```cpp memcpy(x_new + i * line_size, x + perm[i] * line_size, line_size); ``` A negative `perm[i]` produces an out-of-bounds read from before the start of `x`. This is undefined behavior that can crash or silently corrupt data. **Fix**: ```cpp std::vector<idx_t> perm; // ... perm[i] = rng.rand_int64() % nx; ``` Two changes: (1) `perm` is now `std::vector<idx_t>` so it can hold indices > `INT_MAX`. (2) `rand_int64()` returns `int64_t`, and `% nx` produces a value in `[0, nx)` without any narrowing. The result is stored losslessly in `idx_t`. ## Bug 2: Missing guard in standard subsampling path **Location**: `subsample_training_set()`, lines 99-108 **Before**: ```cpp } else { perm.resize(nx); rand_perm(perm.data(), nx, actual_seed); } ``` **Bug**: `rand_perm(int* perm, size_t n, int64_t seed)` takes `int*` and internally does `perm[i] = i`. When `nx > INT_MAX`, the value `i` (a `size_t`) is narrowed to `int` on assignment, wrapping to negative values. These negative values are then used as dataset indices — same out-of-bounds access as Bug 1. **Fix**: ```cpp } else { FAISS_THROW_IF_NOT_FMT( nx <= static_cast<idx_t>(std::numeric_limits<int>::max()), "Dataset too large (%" PRId64 ") for standard subsampling; " "set use_faster_subsampling=true", nx); std::vector<int> int_perm(nx); rand_perm(int_perm.data(), nx, actual_seed); perm.assign(int_perm.begin(), int_perm.end()); } ``` Three parts: (1) A guard that fails early with a clear error message directing the user to the fast path (which handles large datasets correctly via Bug 1 fix). (2) A temporary `std::vector<int>` to satisfy `rand_perm`'s `int*` API — safe because the guard guarantees `nx <= INT_MAX`. (3) Copy into the `idx_t` perm vector so both paths produce the same type for downstream code. We chose not to change `rand_perm`'s signature from `int*` to `idx_t*` because it is a public API in `faiss/utils/random.h` and changing it would break all callers. ## Bug 3: Infinite loop in split_clusters **Location**: `split_clusters()`, lines 239-265 **Before**: ```cpp for (cj = 0; true; cj = (cj + 1) % k) { float p = (hassign[cj] - 1.0) / (float)(n - k); float r = rng.rand_float(); if (r < p) { break; } } ``` **Bug**: This loop probabilistically selects a cluster to split (to replace an empty cluster). The probability of picking cluster `cj` is `p = (hassign[cj] - 1) / (n - k)`. When `hassign[cj] = 1` (cluster has exactly one vector), `p = 0 / (n - k) = 0`. No random float `r` satisfies `r < 0`, so that cluster is never picked. **Proof of infinite loop**: If all non-empty clusters have exactly 1 vector assigned (which happens with bad initialization, adversarial data, or too many clusters), then every `p = 0` and the loop condition `true` is never broken. The loop spins forever, hanging the process. Even in non-degenerate cases, the loop can be extremely slow. Example: `n = 10,000,000`, `k = 1000`, largest cluster has 50,000 vectors. Per-cluster probability: `p = 49999 / 9999000 ≈ 0.005`. Expected iterations to find a match: ~200. But with smaller clusters or larger `n`, this grows without bound. **Fix**: ```cpp size_t max_tries = 10 * k; size_t n_tries = 0; bool found = false; for (cj = 0; n_tries < max_tries; cj = (cj + 1) % k) { float p = (hassign[cj] - 1.0) / (float)(n - k); float r = rng.rand_float(); if (r < p) { found = true; break; } n_tries++; } if (!found) { cj = 0; for (size_t j = 1; j < k; j++) { if (hassign[j] > hassign[cj]) { cj = j; } } } ``` After `10 * k` attempts (10 full passes through all clusters), the loop falls back to deterministically picking the largest cluster. This is semantically correct because the probabilistic selection is already weighted by cluster size — larger clusters have higher `p`. The deterministic fallback produces the most likely outcome of the probabilistic selection. Termination is guaranteed in O(k) time. ## Bug 4: Integer overflow in objective accumulation loop **Location**: `Clustering::train_encoded()`, line 535 **Before**: ```cpp for (int j = 0; j < nx; j++) { obj += dis[j]; } ``` **Bug**: `nx` is `idx_t` (`int64_t`). When `nx > INT_MAX`, `int j` overflows at 2,147,483,647. Signed integer overflow is undefined behavior per the C++ standard. In practice on two's complement, `j` wraps to `-2,147,483,648`, which satisfies `j < nx`, so the loop continues with a negative index. `dis[j]` with negative `j` is an out-of-bounds read — crash or garbage accumulation. **Proof**: For `nx = 3,000,000,000`: - `j` increments from 0 to 2,147,483,647 (correct) - Next increment: UB, typically wraps to -2,147,483,648 - `-2,147,483,648 < 3,000,000,000` is true (signed/unsigned comparison promotes to unsigned, but even with signed comparison it's true) - `dis[-2147483648]` — out-of-bounds access **Fix**: ```cpp for (idx_t j = 0; j < nx; j++) { obj += dis[j]; } ``` `idx_t` matches `nx`'s type. The loop variable can represent all valid indices up to `nx`. Reviewed By: mnorris11 Differential Revision: D101624009 fbshipit-source-id: b961f2677f7e7b93642fe795cfe6ca77812573d3

…#5075) Summary: Pull Request resolved: facebookresearch#5075 In DD mode, the QBS (bbs=32) accumulate path always used 256-bit kernels, even in the AVX512 per-ISA TU. The 512-bit kernels in kernels_simd512.h were dead because bare simdlib aliases resolve to _tpl<NONE> in DD mode, and 512-bit NONE types don't exist (empty primary templates). Fix: add function-local using declarations in both 512-bit kernel functions to bind types to explicit AVX512/AVX2 levels. Create accumulate_loops_512.h with FixedStorage512 (a non-virtual intermediate handler that bridges the AVX2→NONE type gap via storeu/loadu at the handler boundary) and the 512-bit QBS accumulate loop. Wire it into dispatching.h's ScannerMixIn behind an Reviewed By: mdouze Differential Revision: D100151879 fbshipit-source-id: b801f897f2d061a8448842f42edcdeb3a447eafd

…ebookresearch#5136) Summary: Pull Request resolved: facebookresearch#5136 Fixes integer truncation in `IDSelectorBatch::is_member` on platforms where `long` is 32-bit (Windows LLP64). **Root cause.** `IDSelectorBatch::mask` is declared as `idx_t` (i.e. `int64_t` — see `MetricType.h:51`) and is computed in the constructor as `mask = ((idx_t)1 << nbits) - 1`, where `nbits = ceil(log2(n)) + 5`. For bloom filters sized for `n >= ~134M` ids, `nbits >= 32` and `mask` requires more than 32 bits to represent. The expression `i & mask` therefore yields a 64-bit `idx_t`. The previous code stored the result in a local `long im`: - LP64 ABI (Linux, macOS x86_64/arm64): `long` is 64-bit — no truncation, behaves correctly. - LLP64 ABI (Windows x86_64, MinGW): `long` is 32-bit — silently truncates the upper bits. After truncation, `im >> 3` indexes the wrong bloom slot and `1 << (im & 7)` tests the wrong bit. This produces false negatives in the bloom filter, causing `is_member` to incorrectly return `false` for ids that are in the set, which silently drops legitimate matches during selection. **Fix.** Change the local from `long im` to `idx_t im` so its type matches both operands of `i & mask`. This eliminates the platform-dependent truncation. As a small follow-on cleanup, change the early `return 0;` to `return false;` to match the function's `bool` return type (no behavior change — both compile to the same value). **Scope.** Intentionally narrow. Earlier iterations of this diff also widened `DirectMap::update_codes` from `int n` to `idx_t n` and added an `if (ii < 0) return false;` guard in `IDSelectorBitmap::is_member`. Both were reverted after review: - The `DirectMap::update_codes` widening was cosmetic: its sole caller `IndexIVF::update_vectors` still takes `int n` (see `IndexIVF.h:357`), so widening the inner type cannot unlock any larger batch size. Lifting the 2^31 cap would require widening the public virtual `update_vectors`, all overrides, and the C API in `IndexIVFFlat_c.{h,cpp}` — out of scope here, and a separate diff if desired. - The `IDSelectorBitmap` negative-id guard was redundant: per `[conv.integral]` the existing `uint64_t i = ii;` for negative `ii` produces a value in `[2^63, 2^64)`, so `i >> 3 >= 2^60`, which is unconditionally `>= n` for any physically realizable bitmap (`n` is bounded far below 2^60 by addressable memory). The pre-existing `(i >> 3) >= n` check already handles the case correctly. Reviewed By: mnorris11 Differential Revision: D101801522 fbshipit-source-id: 719d6dcc26ece5faf0dfb927e4639e322cf1a6fd

…ch#5134) Summary: Pull Request resolved: facebookresearch#5134 Expand DD test coverage by applying for_all_simd_levels to existing test classes that exercise DD-dispatched code paths but were previously pinned to a single SIMD level. This is the "mega decorator" diff from the DD test coverage gaps plan -- pure decorator additions, no new test logic. Follow-up diffs add new test files and numerical cross-level assertions for gaps the decorator alone cannot close. Classes decorated (grouped by area): * Binary Hamming non-IVF: TestRange and TestKnn in test_binary_hashindex.py; TestBinarySearchParams in test_binary_search_params.py; TestIndexBinaryFromFloat in test_index_binary_from_float.py; TestSpectralHash in test_index_accuracy.py. * IVFPQ / search: EvalIVFPQAccuracy in test_index.py; TestSelector and TestSearchParams in test_search_params.py. * Flat / refine / Panorama: TestIndexFlat, TestIndexFlatL2, TestIndexFlatL2Panorama, TestScalarQuantizer in test_index.py; TestDistanceComputer, TestIndexRefineSearchParams, TestIndexRefineRangeSearch in test_refine.py; TestIndexRefinePanorama, TestIndexFlatPanorama, TestIndexHNSWFlatPanorama, TestIndexIVFFlatPanorama in their respective files. * Quantizer encode: TestResidualQuantizer, TestIndexResidualQuantizerSearch in test_residual_quantizer.py; TestComponents, TestLocalSearchQuantizer in test_local_search_quantizer.py. * Fast scan: TestFastScanFiltering, TestBlockSkipConsistency, TestFastScanRangeSearchFilter in test_fastscan_filter.py. * Broader index tests (search-exercising classes only): TestParameterSpace in test_autotune.py; TestSpectralHash in test_factory.py; TestMerge1, TestMerge2, TestRemoveFastScan in test_merge_index.py. BUCK changes move decorated tests from the legacy-listing lists to the simd_levels lists (which use supports_static_listing = False, required for dynamic class name generation), and add the new entries test_binary_search_params, test_fastscan_filter, test_refine_panorama, test_hnsw_panorama. ___ overriding_review_checks_triggers_an_audit_and_retroactive_review Oncall Short Name: fair_umami_cluster Differential Revision: D101822335 fbshipit-source-id: 9a432c6d6bee201c3731713226405a8c8ecebbe6

Summary: _Note: Should be merged before facebookresearch#4970 (IVFPQPanorama)._ ## Changes ### Performance This PR implements various optimizations to Panorama (L2Flat and IVFFlat). 1. Disaggregate distance computation from pruning decisions to avoid branches in distance computation hotpath. 2. Early batch processing termination when no points are remaining. 3. Manually unrolled distance kernel. 4. Template distance computation on level width for autovectorization. 5. `if constexpr (C::is_max)` instead of `C::cmp` for autovectorized pruning. 6. Byteset for vectorized compacting of active indices using `_pext_u64`. 7. Template distance computation and pruning on first level (no `active_indices` indirection) to let it autovectorize. 8. Hoist buffer allocations into `IndexFlat`/`IVFFlatScannerPanorama`. 9. Expose `batch_size` as a parameter for IVFFlatPanorama (for consistency with `IndexFlatPanorama` but also because 1024 `batch_size` can improve performance). ### Other - Define `kDefaultBatchSize` once in `Panorama.h` (previously defined in 5 separate locations). - Allow `bench_flat_l2_panorama.py` and `bench_ivf_flat_panorama.py` to accept `gist1M` or `sift1M` as dataset to bench on. ## Results Together, these optimizations enable powerful additional speedups, especially on lower-dimensional datasets like SIFT (128d), by dramatically minimizing Panorama's overhead: **GIST1M (IVF128, nlist=128, nlevels=16)** | nprobe | Recall@10 | Old Speedup | New Speedup | _Additional_ Speedup | |--------|-----------|-------------------------|-------------------------|--------------------| | 1 | 0.1439 | 3.92x | 3.93x | 1.00x | | 2 | 0.2605 | 4.71x | 5.19x | 1.10x | | 4 | 0.4369 | 5.53x | 6.75x | 1.22x | | 8 | 0.6470 | 6.37x | 8.21x | 1.29x | | 16 | 0.8780 | 7.30x | 9.74x | 1.33x | | 32 | 0.9764 | 8.33x | 11.29x | 1.36x | | 64 | 0.9868 | 9.38x | 12.74x | 1.36x | **SIFT1M (IVF128, nlist=128, nlevels=8)** | nprobe | Recall@10 | Old Speedup | New Speedup | _Additional_ Speedup | |--------|-----------|-------------------------|-------------------------|--------------------| | 1 | 0.2678 | 1.20x | 1.81x | 1.52x | | 2 | 0.4584 | 1.38x | 2.23x | 1.62x | | 4 | 0.6855 | 1.59x | 2.70x | 1.70x | | 8 | 0.8760 | 1.83x | 3.44x | 1.88x | | 16 | 0.9679 | 2.11x | 4.72x | 2.24x | | 32 | 0.9855 | 2.44x | 5.61x | 2.30x | | 64 | 0.9861 | 2.74x | 6.39x | 2.33x | ### Raw Data Collected by running the new benches on `main` and this branch. On main, you cannot specify `batch_size` so remove the `{1024}` from the factory string in the new benches to run them there. The results above are calculated from the following raw data as follows: 1. For each experiment (i.e., GIST (old) or SIFT (new), calculate the Panorama speedups for each `nprobe` ((original ms per query) / (pano ms per query)) 2. For each pairing of (old) and (new) results, calculate the additional speedup by calculating (new speedup) / (old speedup). #### Before (`main`) GIST1M: ``` ======IVF128,Flat nprobe 1, Recall@10: 0.145200, speed: 2.705442 ms/query, dims scanned: 100.00% nprobe 2, Recall@10: 0.260800, speed: 5.456891 ms/query, dims scanned: 100.00% nprobe 4, Recall@10: 0.441900, speed: 10.895120 ms/query, dims scanned: 100.00% nprobe 8, Recall@10: 0.648200, speed: 21.676788 ms/query, dims scanned: 100.00% nprobe 16, Recall@10: 0.878000, speed: 43.142261 ms/query, dims scanned: 100.00% nprobe 32, Recall@10: 0.975400, speed: 84.498397 ms/query, dims scanned: 100.00% nprobe 64, Recall@10: 0.986800, speed: 160.092644 ms/query, dims scanned: 100.00% ======PCA960,IVF128,FlatPanorama16 nprobe 1, Recall@10: 0.143900, speed: 0.689507 ms/query, dims scanned: 12.96% nprobe 2, Recall@10: 0.260500, speed: 1.158416 ms/query, dims scanned: 11.18% nprobe 4, Recall@10: 0.436900, speed: 1.968814 ms/query, dims scanned: 9.90% nprobe 8, Recall@10: 0.647000, speed: 3.401469 ms/query, dims scanned: 8.91% nprobe 16, Recall@10: 0.878000, speed: 5.912757 ms/query, dims scanned: 8.10% nprobe 32, Recall@10: 0.976400, speed: 10.147847 ms/query, dims scanned: 7.44% nprobe 64, Recall@10: 0.986800, speed: 17.074573 ms/query, dims scanned: 6.93% ``` SIFT1M: ``` ======IVF128,Flat nprobe 1, Recall@10: 0.267480, speed: 0.285990 ms/query, dims scanned: 100.00% nprobe 2, Recall@10: 0.457520, speed: 0.564067 ms/query, dims scanned: 100.00% nprobe 4, Recall@10: 0.685320, speed: 1.111833 ms/query, dims scanned: 100.00% nprobe 8, Recall@10: 0.877210, speed: 2.195088 ms/query, dims scanned: 100.00% nprobe 16, Recall@10: 0.967730, speed: 4.338444 ms/query, dims scanned: 100.00% nprobe 32, Recall@10: 0.985400, speed: 8.500538 ms/query, dims scanned: 100.00% nprobe 64, Recall@10: 0.986100, speed: 16.349893 ms/query, dims scanned: 100.00% ======PCA128,IVF128,FlatPanorama8 nprobe 1, Recall@10: 0.267670, speed: 0.239243 ms/query, dims scanned: 27.97% nprobe 2, Recall@10: 0.458320, speed: 0.408590 ms/query, dims scanned: 24.42% nprobe 4, Recall@10: 0.685480, speed: 0.699694 ms/query, dims scanned: 21.50% nprobe 8, Recall@10: 0.875930, speed: 1.197310 ms/query, dims scanned: 19.06% nprobe 16, Recall@10: 0.967760, speed: 2.055968 ms/query, dims scanned: 16.98% nprobe 32, Recall@10: 0.985370, speed: 3.481555 ms/query, dims scanned: 15.26% nprobe 64, Recall@10: 0.985980, speed: 5.977346 ms/query, dims scanned: 14.02% ``` #### After (`optimize-pano`) GIST1M: ``` ======IVF128,Flat nprobe 1, Recall@10: 0.145200, speed: 2.625779 ms/query, dims scanned: 100.00% nprobe 2, Recall@10: 0.260800, speed: 5.285007 ms/query, dims scanned: 100.00% nprobe 4, Recall@10: 0.441900, speed: 10.555867 ms/query, dims scanned: 100.00% nprobe 8, Recall@10: 0.648200, speed: 21.012494 ms/query, dims scanned: 100.00% nprobe 16, Recall@10: 0.878000, speed: 41.794143 ms/query, dims scanned: 100.00% nprobe 32, Recall@10: 0.975400, speed: 81.865038 ms/query, dims scanned: 100.00% nprobe 64, Recall@10: 0.986800, speed: 155.067333 ms/query, dims scanned: 100.00% ======PCA960,IVF128,FlatPanorama16_1024 nprobe 1, Recall@10: 0.143900, speed: 0.668800 ms/query, dims scanned: 20.33% nprobe 2, Recall@10: 0.260500, speed: 1.018440 ms/query, dims scanned: 14.81% nprobe 4, Recall@10: 0.436900, speed: 1.563622 ms/query, dims scanned: 11.72% nprobe 8, Recall@10: 0.647000, speed: 2.557981 ms/query, dims scanned: 9.82% nprobe 16, Recall@10: 0.878000, speed: 4.292616 ms/query, dims scanned: 8.56% nprobe 32, Recall@10: 0.976400, speed: 7.248832 ms/query, dims scanned: 7.68% nprobe 64, Recall@10: 0.986800, speed: 12.171319 ms/query, dims scanned: 7.06% ``` SIFT1M: ``` ======IVF128,Flat nprobe 1, Recall@10: 0.267480, speed: 0.295904 ms/query, dims scanned: 100.00% nprobe 2, Recall@10: 0.457520, speed: 0.583204 ms/query, dims scanned: 100.00% nprobe 4, Recall@10: 0.685320, speed: 1.150055 ms/query, dims scanned: 100.00% nprobe 8, Recall@10: 0.877210, speed: 2.425575 ms/query, dims scanned: 100.00% nprobe 16, Recall@10: 0.967730, speed: 5.509365 ms/query, dims scanned: 100.00% nprobe 32, Recall@10: 0.985400, speed: 10.794491 ms/query, dims scanned: 100.00% nprobe 64, Recall@10: 0.986100, speed: 20.727924 ms/query, dims scanned: 100.00% ======PCA128,IVF128,FlatPanorama8_1024 nprobe 1, Recall@10: 0.267750, speed: 0.163266 ms/query, dims scanned: 34.97% nprobe 2, Recall@10: 0.458370, speed: 0.261109 ms/query, dims scanned: 27.97% nprobe 4, Recall@10: 0.685540, speed: 0.425977 ms/query, dims scanned: 23.30% nprobe 8, Recall@10: 0.875990, speed: 0.704580 ms/query, dims scanned: 19.98% nprobe 16, Recall@10: 0.967860, speed: 1.167465 ms/query, dims scanned: 17.45% nprobe 32, Recall@10: 0.985470, speed: 1.925296 ms/query, dims scanned: 15.50% nprobe 64, Recall@10: 0.986080, speed: 3.245793 ms/query, dims scanned: 14.14% ``` Pull Request resolved: facebookresearch#5041 Reviewed By: alibeklfc Differential Revision: D101753364 Pulled By: mnorris11 fbshipit-source-id: e6da1aa05e465e83632239bc69548bf8f5353d49

…acebookresearch#5138) Summary: Pull Request resolved: facebookresearch#5138 ## Summary This diff fixes several bugs and memory safety issues in VectorTransform.cpp: ### 1. Bug fix: Wrong beta parameter in PCAMatrix::train (sgemm_ call) In the code path where n < d_in (Gram matrix approach), the sgemm_ call that computes `PCAMat = xc * gram` incorrectly uses beta=1.0 instead of beta=0.0. This means the computation is actually `PCAMat = xc * gram + PCAMat` instead of `PCAMat = xc * gram`. On the first training call this works by accident because std::vector::resize zero-initializes new elements. However, if PCAMatrix::train() is called a second time (e.g., retraining with different data), PCAMat retains stale values from the previous training, corrupting the PCA matrix and producing incorrect dimensionality reduction results. ### 2. Memory leak fix: eig() function Replaced raw `new double[]` with `std::vector<double>` for the LAPACK workspace buffer. The old code would leak memory if dsyev_ threw an exception. ### 3. Memory leak fix: LinearTransform::transform_transpose() Replaced raw `new float[]` with `std::vector<float>` for the bias-corrected buffer. The old code would leak memory if sgemm_ threw an exception between allocation and the manual delete[]. ### 4. Missing error check: OPQ SVD workspace query Added FAISS_THROW_IF_NOT_FMT check after the sgesvd_ workspace query in OPQMatrix::train(). Previously, if the workspace query failed, the returned workspace size would be garbage, leading to either a crash or silent data corruption in the subsequent SVD computation. Reviewed By: junjieqi, mnorris11 Differential Revision: D101975473 fbshipit-source-id: 57e74d8cc55d119bfee99f164caaf4d64b08a7ce

…ontrib.TestBigBatchSearch) (facebookresearch#5139) Summary: Pull Request resolved: facebookresearch#5139 Reviewed By: junjieqi Differential Revision: D101979704 fbshipit-source-id: b3b1575fd3431dca7d3b5e7ec86e4009810095fb

Summary: Pull Request resolved: facebookresearch#5093 Fix remaining miscellaneous lint warnings across 10 files: - `facebook-hte-MultTypeDeclaration`: Split mixed-type declaration in AutoTune.cpp - `facebook-hte-IdenticalOperands`: Rename variable in build.cpp to avoid false positive - `facebook-hte-BadImplicitCast`: Add explicit cast in Index.cpp - `performance-inefficient-vector-operation`: Add reserve() in IndexBinaryIVF.cpp - `performance-for-range-copy`: Use const reference in IndexBinaryHash.cpp range-for - `facebook-hte-UnassignedReleasedUniquePointer`: Capture release() results in IVFlib.cpp, IndexPreTransform.cpp - `facebook-hte-UnqualifiedCall-sqrt`: Use std::sqrt() in MatrixStats.cpp - `facebook-unused-include-check`: Remove unused includes in IndexIVF.cpp, IndexNNDescent.cpp, IndexNSG.cpp - `clang-diagnostic-switch-enum`: Add missing enum cases in IndexAdditiveQuantizer.cpp, IndexIVFAdditiveQuantizer.cpp Reviewed By: pankajsingh88 Differential Revision: D100592786 fbshipit-source-id: 1d324e30d79e967c345737bae6991e9a443622ee

…search#5091) Summary: Pull Request resolved: facebookresearch#5091 Fix 172 `clang-diagnostic-shorten-64-to-32` lint warnings across 29 files by adding explicit `static_cast<int>()` or widening variable types where `size_t`/`idx_t` (64-bit) values were implicitly narrowed to `int`/`int32_t` (32-bit). The fixes fall into two categories: - **Explicit casts**: Where the receiving API requires `int` and the value is known to fit (e.g., vector dimensions, sub-quantizer counts, cluster counts, BLAS parameters) - **Type widening**: Where the variable was unnecessarily narrow (e.g., `int nprobe` → `size_t nprobe`, `int list_no` → `size_t list_no`) Reviewed By: limqiying Differential Revision: D100588996 fbshipit-source-id: 153fe3d557a102c5adb7831915c1b3c8cecae22b

Summary: Pull Request resolved: facebookresearch#5140 Full logs: P2284358998 ## TLDR: Fix vs No-fix: Is the improvement consistent? yes, seems like doing this fix is better than doing nothing. Net result: faster overall. 26/60 configs improve by >5%, only 4 regress by >5%. ┌───────────────────────────────┬─────────────┬────────────┐ │ Worst regressions │ ms increase │ % increase │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=100, bs=10000, np=16, k=100 │ +10.0 ms │ +8.4% │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=50, bs=10000, np=16, k=100 │ +6.3 ms │ +7.7% │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=50, bs=10000, np=64, k=100 │ +10.2 ms │ +6.2% │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=100, bs=1000, np=32, k=100 │ +11.8 ms │ +5.2% │ └───────────────────────────────┴─────────────┴────────────┘ ┌────────────────────────────────┬───────────┬──────────┐ │ Best improvements │ ms saved │ % faster │ ├────────────────────────────────┼───────────┼──────────┤ │ M=50, bs=1000, np=16, k=100 │ -35.8 ms │ -22.7% │ ├────────────────────────────────┼───────────┼──────────┤ │ M=100, bs=10000, np=256, k=500 │ -359.8 ms │ -10.7% │ ├────────────────────────────────┼───────────┼──────────┤ │ M=100, bs=1000, np=128, k=1000 │ -240.9 ms │ -10.4% │ └────────────────────────────────┴───────────┴──────────┘ No fix in Faiss D101399711 (migration only, no fix) vs baseline More regressions: 14/60 configs regress >5%, 2 exceed 10%. ┌───────────────────────────────┬─────────────┬────────────┐ │ Worst regressions │ ms increase │ % increase │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=50, bs=10000, np=32, k=1000 │ +110.5 ms │ +10.6% │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=50, bs=10000, np=16, k=1000 │ +68.3 ms │ +10.1% │ ├───────────────────────────────┼─────────────┼────────────┤ │ M=50, bs=10000, np=64, k=1000 │ +147.7 ms │ +9.2% │ └───────────────────────────────┴─────────────┴────────────┘ ## what about variance? Variance summary (see mnorris11 notes, I don't agree with everything) The search is mostly stable, with occasional single-run outliers. ┌─────────────────────┬───────────────────────────────────┐ │ Metric │ Value │ ├─────────────────────┼───────────────────────────────────┤ │ Median CV │ 3.3% │ ├─────────────────────┼───────────────────────────────────┤ │ 90th percentile CV │ 9.8% │ ├─────────────────────┼───────────────────────────────────┤ │ Most stable configs │ bs=1000, nprobe≥64: CV 0.3-1.7% │ ├─────────────────────┼───────────────────────────────────┤ │ Noisiest configs │ bs=10000, nprobe=16: CV up to 39% │ └─────────────────────┴───────────────────────────────────┘ What drives variance: absolute runtime. Sub-200ms configs (low nprobe, k=100) are noisy because GPU scheduling jitter is proportionally large. Configs >1 second are rock-solid (CV < 2%). Outlier pattern: Every high-CV config is caused by a single extreme outlier (e.g., one run at 1124ms when 49 others are at 120-155ms), not general measurement noise. Using median instead of mean would remove these. The median values in the results file should be more reliable for these fast configs. Are the cross-trial differences real? - For stable configs (bs=1000, nprobe≥64, CV < 2%): the 8-11% improvements at high nprobe are definitely real — intra-trial noise is well below 2%, so a 10% shift is ~5 standard errors. - For noisy configs (bs=10000, nprobe=16, CV 5-15%): the 5-8% regressions are borderline — they're statistically detectable with 50 runs, but could partly reflect system-state differences between trials (GPU thermals, background load) rather than code changes. ******[mnorris11 note: this explanation sounds like BS, there was nothing else going on in the GPU...]****** - To be fully confident about the small regressions at low nprobe, you'd want same-machine A/B testing (build both versions, alternate runs within one script). ******[mnorris11 note: this is nonsense, this is what we did?]****** Reviewed By: weidbd2025 Differential Revision: D101500046 fbshipit-source-id: 9a2759a64a3ba8e398b6b36ca40e31cf6aaa5ba0

Summary: Pull Request resolved: facebookresearch#5127 Attempt to enable FAISS dynamic dispatch (DD) mode on Windows/MSVC. Changes: - CMakeLists.txt: Remove if(NOT WIN32) guard from DD section, add MSVC per-file SIMD flags (/arch:AVX2, /arch:AVX512, /bigobj) alongside existing GCC/Clang flags - build-pull-request.yml: Add windows-x86_64-DD-cmake job that runs immediately (no dependency on linux-x86_64-cmake), builds with MSVC and FAISS_OPT_LEVEL=dd, runs C++ tests and Python tests This diff is expected to fail on Windows due to MSVC requiring explicit template specialization declarations (C++ §17.8.3) which GCC/Clang don't enforce. The CI failure will surface the exact errors to guide the fix. UPDATE: in fact, MSVC does not seem to require this. After fixing build scripts and a few classical Windows error, the C++ and python tests pass. Reviewed By: algoriddle Differential Revision: D101649751 fbshipit-source-id: 765ef8483d02652ce58625cd22baab8870acf718

…kresearch#5143) Summary: Pull Request resolved: facebookresearch#5143 This diff fixes four bugs in `IndexRefine.cpp` and adds a regression test. **Bug 1 (critical): `sa_decode` reads wrong bytes** `IndexRefine::sa_encode` writes each vector's codes as `[base_codes (cs1 bytes) | refine_codes (cs2 bytes)]` (lines 197-199). The encode writes base codes at offset 0 and refine codes at offset `cs1`: ``` memcpy(b, tmp1.get() + cs1 * i, cs1); // base at b+0 memcpy(b + cs1, tmp2.get() + cs2 * i, cs2); // refine at b+cs1 ``` `sa_decode` must extract the refine portion to pass to `refine_index->sa_decode`. The old code read from `bytes + i * (cs1 + cs2)` (offset 0), which extracted the base codes instead of the refine codes. The fix adds `+ cs1` to skip past the base codes: ``` // Old (wrong): memcpy(..., bytes + i * (cs1 + cs2), cs2); // New (fixed): memcpy(..., bytes + i * (cs1 + cs2) + cs1, cs2); ``` This mirrors the write offset in `sa_encode` line 199. Without this fix, `sa_decode` silently produces wrong reconstructions by feeding base-index codes to the refine decoder. **Bug 2: `int` loop variable causes overflow with large inputs** Three validation loops used `int i` as the loop counter: ``` for (int i = 0; i < n * k_base; i++) ``` Both `n` and `k_base` are `idx_t` (int64_t), so `n * k_base` can exceed `INT_MAX` (2^31 - 1). When the `int` counter reaches `INT_MAX`, incrementing it is signed integer overflow (undefined behavior). In practice this causes an infinite loop or out-of-bounds access. Changed to `idx_t i` in all three search methods: `IndexRefine::search`, `IndexRefineFlat::search`, and `IndexRefinePanorama::search`. **Bug 3: Wrong class name in error message** `IndexRefinePanorama::search` had the error message `"IndexRefineFlat params have incorrect type"` -- a copy-paste error from `IndexRefineFlat::search`. Fixed to `"IndexRefinePanorama params have incorrect type"`. **Bug 4: Missing overflow guard on `n * k_base` allocation** The product `n * k_base` is used in `new idx_t[n * k_base]` before the loop. If the product overflows int64_t, it could wrap to a small positive value, causing a too-small allocation followed by out-of-bounds writes from `base_index->search`. Added `FAISS_THROW_IF_NOT_MSG(n <= INT64_MAX / k_base, ...)` before the allocation in all three search methods. Division by zero is impossible because `k_base >= k > 0` is checked earlier. **Cleanup: Redundant `sa_code_size()` call** The allocation in `sa_decode` called `refine_index->sa_code_size()` a second time instead of using the already-computed `cs2`. Replaced with `cs2`. Reviewed By: junjieqi Differential Revision: D101903134 fbshipit-source-id: 37848d280d50447216ec3c76c598c3e212ea0971

- Fix double negatives: use FAISS_THROW_IF_MSG(is_static, ...) instead of dynamic_impl() - Make LVQ default storage consistent: both default and parameterized constructors now use SVS_LVQ4x0 for IndexSVSIVFLVQ and IndexSVSVamanaLVQ - Document intra_query_threads limitation: must be set before train() or deserialize_impl(); runtime changes not yet supported by SVS runtime API - Fix is_lvq_leanvec_enabled() to check both IVFIndex and DynamicIVFIndex storage kind availability

Adds IVFIntraQueryThreadsSetBeforeTrain test that demonstrates the supported usage pattern: set intra_query_threads before train(), not after. Changes after index creation are silently ignored due to a current SVS runtime API limitation.

Document that IVF search-time ID filtering is a pending item in the SVS runtime — IVFIndex::search() does not yet accept an IDFilter parameter (unlike VamanaIndex::search()). Once exposed, it can be wired up using the same make_faiss_id_filter() pattern as Vamana.

…ult handlers (facebookresearch#5185) Summary: Pull Request resolved: facebookresearch#5185 Three sequential post-BLAS / end_multiple loops in faiss were leaving OMP threads idle while a single thread did all the work. Each is parallelized with `#pragma omp parallel for schedule(static)`, gated by an `if (...)` clause to avoid spawn-cost regressions on small workloads. **Changes** 1. `exhaustive_L2sqr_blas_cmax` (AVX2 + ARM SVE): after `sgemm_` completes, the per-query result accumulation loop ran single-threaded while all OMP threads were idle. Each query `i` reads a distinct row of `ip_block` and writes to `dis_tab[i]/ids_tab[i]` — no cross-query dependencies. Added `#pragma omp parallel for schedule(static) if ((i1 - i0) >= 16)` to both ISA specializations. 2. `HeapBlockResultHandler::end_multiple`: `heap_reorder` is O(k log k) per query and was sequential. The original author left a `// maybe parallel for` comment. `add_results` in the same class already has `#pragma omp parallel for`; `end_multiple` was the only remaining sequential step. Gate: `if ((i1 - i0) * k >= 1024)`. 3. `ReservoirBlockResultHandler::end_multiple`: same pattern — reservoir `to_result` (partial sort, O(capacity)) was sequential despite `add_results` being parallelized. `// maybe parallel for` comment removed and replaced with the actual pragma. Gate: `if ((i1 - i0) * this->k >= 1024)`. The `if (...)` thresholds were chosen from microbenchmark data: below the threshold, OMP fanout cost exceeds the work, producing 3-6× regressions on small batches. Above the threshold, parallelization yields 9-14× speedups at 16 threads. Data independence verified for all three: each loop iteration operates on a disjoint slice of `dis_tab`/`ids_tab` indexed by query `i`. **Benchmark results** A local microbench (not landed) was used for A/B measurement. Host: Intel Sapphire Rapids, 28 physical cores, AVX-512. Pinned with `taskset -c 0-15` (OMP=16) and `taskset -c 0` (OMP=1). Median of 5 reps. Synthetic uniform-random distance distributions. `HeapBlockResultHandler::end_multiple` (us, lower better): | nq | k | parent t=1 | this t=1 | parent t=16 | this t=16 | speedup t=16 | |------:|-----:|-----------:|---------:|------------:|----------:|--------------:| | 64 | 10 | 9.2 | 7.2 | 8.1 | 8.3 | 0.98× (gated) | | 64 | 100 | 340 | 345 | 318 | 67 | 4.79× | | 64 | 1000 | 5,796 | 5,700 | 5,886 | 501 | 11.76× | | 512 | 100 | 2,811 | 2,769 | 2,677 | 312 | 8.59× | | 512 | 1000 | 46,109 | 46,070 | 45,758 | 3,778 | 12.11× | | 4096 | 100 | 22,041 | 21,588 | 21,672 | 1,869 | 11.60× | | 4096 | 1000 | 369,069 | 376,541 | 372,481 | 25,442 | 14.64× | `ReservoirBlockResultHandler::end_multiple` (us): | nq | k | parent t=16 | this t=16 | speedup | |------:|-----:|------------:|----------:|--------------:| | 64 | 10 | 18.0 | 18.1 | 0.99× (gated) | | 64 | 100 | 659 | 96 | 6.86× | | 64 | 1000 | 7,592 | 553 | 13.73× | | 512 | 100 | 5,498 | 490 | 11.21× | | 512 | 1000 | 59,548 | 4,677 | 12.73× | | 4096 | 100 | 44,064 | 3,230 | 13.64× | | 4096 | 1000 | 476,388 | 32,237 | 14.78× | `IndexFlatL2::search` end-to-end — drives `exhaustive_L2sqr_blas_cmax` (ms): | nb | nq | k | parent t=16 | this t=16 | speedup | |------:|------:|----:|------------:|----------:|--------:| | 1024 | 1024 | 10 | 1.71 | 1.45 | 1.18× | | 1024 | 4096 | 100 | 58.5 | 15.5 | 3.78× | | 4096 | 4096 | 100 | 76.9 | 39.4 | 1.95× | Single-threaded paths (OMP=1) are within ±5% of parent across all configurations — the `if (...)` clause makes the pragma a no-op below the threshold, eliminating overhead for serial callers. Caveats: the `IndexFlatL2::search` numbers measure the full search path, so the speedup attributed to change #1 also includes contributions from change #2 (heap handler, also called by this path). The `end_multiple` numbers isolate the changed function via `PauseTiming`/`ResumeTiming` around setup. ARM SVE not measured directly (no Graviton host); the AVX2 numbers are the strongest available proxy. Reviewed By: mnorris11 Differential Revision: D103830810 fbshipit-source-id: 8434fa6f16b78c5ff7b2244ac5d5fe9cc8c012a5

alibeklfc and others added 30 commits April 6, 2026 13:29

Merge .gitignore updates from faiss-gpu-cu132 into main (facebookrese…

acac823

…arch#4996) Summary: Pull Request resolved: facebookresearch#4996 Reviewed By: alibeklfc Differential Revision: D99569811 Pulled By: mnorris11 fbshipit-source-id: 127c6b6b771b81b1f11b0f28dc4936959fafac09

fbcode/faiss/impl/HNSW.cpp (facebookresearch#5059)

afd6949

Summary: Pull Request resolved: facebookresearch#5059 Reviewed By: DenisYaroshevskiy Differential Revision: D99854747 fbshipit-source-id: 8bca36ec90475771ef17356d9a16b0d680a6296b

scsiguy and others added 29 commits April 19, 2026 08:54

fbcode/faiss/invlists (facebookresearch#5048)

632427f

Summary: Pull Request resolved: facebookresearch#5048 Reviewed By: mnorris11, hanle11 Differential Revision: D99419595 fbshipit-source-id: 9c1214c6f4b88bf41e9d1851dd0acb5c7c5001ef

facebook-unused-include-check in IndexNNDescent.cpp (facebookresearch…

bdf877f

…#5132) Summary: Pull Request resolved: facebookresearch#5132 Reviewed By: limqiying, junjieqi Differential Revision: D101359141 fbshipit-source-id: 7d78875eed114367d4a45215e058f5fa9ebf06a1

Merge remote-tracking branch 'upstream/main' into ib/svs_ssd

017503d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable options for leanvec/LVQ data to reside in SSD or have primary only leanvec data#1

Enable options for leanvec/LVQ data to reside in SSD or have primary only leanvec data#1
ibhati wants to merge 91 commits into
ib/svs_ivffrom
ib/svs_ssd

ibhati commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

ibhati commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

ibhati commented Apr 15, 2026 •

edited

Loading