Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32 and f16 by suri-kumkaran · Pull Request #863 · microsoft/DiskANN

suri-kumkaran · 2026-03-25T18:50:12Z

What

SIMD-accelerated MaxSim / Chamfer distance for f32 and f16 multi-vector queries, using block-transposed layout with L2/L1 cache-aware tiling. Introduces QueryComputer<T> — a runtime-dispatched type that hides GROUP and architecture behind a vtable.

The focus is proving the Kernel / tiled_reduce / ConvertTo abstraction is solid and type-agnostic. f32 and f16 share the same tiling loop and micro-kernel body.

Why

The fallback kernel iterates query×doc in a flat nested loop, causing repeated cache evictions. Block-transposing the query and tiling both sides to fit in L2/L1 keeps hot data resident and feeds the FMA pipeline efficiently. f16 comes for free: ConvertTo converts f16→f32 once per tile, then the f32 micro-kernel runs unchanged.

Changed Files

All paths relative to diskann-quantization/src/multi_vector/.

New — distance/kernels/

File	Purpose
`mod.rs`	`Kernel<A>` unsafe trait (`Left`/`Right` layouts, `full_panel`/`partial_panel`), cache budget helpers.
`layouts.rs`	`Layout` marker trait. `BlockTransposed`/`RowMajor` ZST markers. `DescribeLayout` bridge. `ConvertTo<A, To>` with blanket identity and f16→f32 specializations.
`tiled_reduce.rs`	5-level tiling loop (`K: Kernel`, `LA/LB: ConvertTo`). `FullReduce` tile planner. `Reduce` unroll trait.
`f32/mod.rs`	`F32Kernel<GROUP>`, `max_ip_kernel`, `Target3` dispatch, tests.
`f32/v3.rs`	V3 (AVX2+FMA) 16×4 micro-kernel. V4 delegates via `retarget()`.
`f32/scalar.rs`	Scalar 8×2 micro-kernel (mul+add, no libm `fma()`). Neon delegates via `retarget()`.
`f16.rs`	`F16Entry<GROUP>`, `Target3` dispatch, tests — drives `tiled_reduce` with `F32Kernel` + f16→f32 `ConvertTo`. No `Kernel` impl.

New — distance/query_computer/

File	Purpose
`mod.rs`	`QueryComputer<T>` (`Box<dyn DynQueryComputer<T>>`). `chamfer`/`max_sim` methods. `Chamfer`/`MaxSim` trait impls. Tests.
`f32.rs`	`QueryComputer<f32>::new` via `dispatch1_no_features`. `BuildComputer` `Target1` impls.
`f16.rs`	`QueryComputer<half::f16>::new` — same pattern, delegates through `F16Entry`.

Modified

File	Change
`block_transposed.rs`	Added `padded_nrows()`.
`matrix.rs`	Added `as_matrix_view()`.
`multi_vector/mod.rs`	Re-exports `QueryComputer`, `Chamfer`, `MaxSim`, `MaxSimError`, `QueryMatRef`.
`distance/mod.rs`	Module wiring, `QueryComputer` re-export, doc example.
`distance/max_sim.rs`	`MaxSim`/`Chamfer` types, `MaxSimError` enum.
`distance/fallback.rs`	`FallbackKernel` (was `SimpleKernel`), `QueryMatRef`, fallback trait impls.

Renamed: simple.rs → fallback.rs

Design Decisions

Kernel trait

Kernel<A> declares Left/Right layout types and full_panel/partial_panel. The kernel receives already-converted pointers — it knows nothing about storage formats. V4→V3 and Neon→Scalar delegate via retarget(). GROUP const generic (16 for V3/V4, 8 for Scalar/Neon) acts as a closed-world filter.

Layout markers and ConvertTo

BlockTransposed<T, GROUP, PACK> and RowMajor<T> are ZST markers. Layout impl requires T: Copy (micro-kernels load via raw pointers); Copy/Clone on the markers themselves are unconditional (PhantomData wrappers). DescribeLayout bridges matrix types to markers for type inference.

ConvertTo<A, To>: blanket identity (Buffer = (), zero cost) + f16→f32 specializations (Vec<f32> buffer, SIMD-accelerated SliceCast). Conversion is once per tile, not per panel. SliceCast dispatches through the runtime architecture token via arch.run2() — the same SIMD level used by the micro-kernel.

Tiling loop (reducing-GEMM)

5-level loop: L2 A-tiles → L1 B-tiles → A-panels → B-panels → micro-kernel. ConvertTo::convert runs at tile boundaries. Cache budgets from kernel layout element sizes (~625 KB L2, ~36 KB L1). Geometry: 16×4 (V3/V4) or 8×2 (Scalar/Neon). Source strides and kernel strides differ when conversion is active (f16→f32).

f16 path

F16Entry<GROUP> is a dispatch adapter, not a Kernel impl. Calls tiled_reduce with F32Kernel and f16→f32 ConvertTo impls. Zero SIMD code. Extends naturally to PQ/SQ/MinMax via new ConvertTo impls.

QueryComputer

QueryComputer<T> wraps Box<dyn DynQueryComputer<T>>. CPU detection once at construction via dispatch1_no_features; hot path uses Architecture::run3 with #[target_feature] — no re-dispatch. Turbofish: QueryComputer::<f32>::new(q). Per-type BuildComputer dispatch in f32.rs/f16.rs; mod.rs is generic.

Suggested Review Order

distance/kernels/mod.rs — Kernel<A> trait.
distance/kernels/layouts.rs — markers, ConvertTo, blanket identity, f16→f32.
distance/kernels/tiled_reduce.rs — tiling loop, source vs kernel strides.
distance/kernels/f32/v3.rs → f32/scalar.rs — micro-kernels.
distance/kernels/f32/mod.rs — F32Kernel, max_ip_kernel, dispatch.
distance/kernels/f16.rs — F16Entry, no Kernel impl, no submodules.
distance/query_computer/mod.rs — QueryComputer<T>, tests.
distance/query_computer/f32.rs → f16.rs — per-type dispatch.
distance/max_sim.rs → distance/fallback.rs — types, fallback kernel.
block_transposed.rs, matrix.rs, distance/mod.rs, multi_vector/mod.rs — supporting.

Future Work

Generalize the accumulator/reduction strategy. The current design hard-codes f32 scratch and max-reduce, which fits Chamfer/MaxSim but may over-fit. Brute-force search would need arg-max (tracking indices, not just values), and u8/i8 kernels would naturally accumulate into u32/i32 rather than f32.
Dedicated Neon micro-kernels (replace Scalar delegation).
Dedicated V4/AVX-512 micro-kernels (replace V3 delegation).
Kernel + ConvertTo for PQ, SQ, MinMax.
Cache size detection — Figure out L1/L2 sizes (env vars, platform detection, etc.) instead of hardcoded budgets.
Row-major × row-major Chamfer via ConvertTo transpose.

codecov-commenter · 2026-03-25T19:16:46Z

Codecov Report

❌ Patch coverage is 94.37280% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.47%. Comparing base (cfb5927) to head (e06c48a).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...on/src/multi_vector/distance/query_computer/f16.rs	66.66%	11 Missing ⚠️
...on/src/multi_vector/distance/query_computer/f32.rs	66.66%	11 Missing ⚠️
...on/src/multi_vector/distance/query_computer/mod.rs	92.42%	10 Missing ⚠️
...ation/src/multi_vector/distance/kernels/f32/mod.rs	87.50%	6 Missing ⚠️
...ation/src/multi_vector/distance/kernels/layouts.rs	90.16%	6 Missing ⚠️
...-quantization/src/multi_vector/block_transposed.rs	91.66%	1 Missing ⚠️
...on/src/multi_vector/distance/kernels/f32/scalar.rs	97.29%	1 Missing ⚠️
...zation/src/multi_vector/distance/kernels/f32/v3.rs	97.82%	1 Missing ⚠️
.../src/multi_vector/distance/kernels/tiled_reduce.rs	99.70%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #863      +/-   ##
==========================================
+ Coverage   89.31%   89.47%   +0.16%     
==========================================
  Files         447      460      +13     
  Lines       83250    84627    +1377     
==========================================
+ Hits        74354    75723    +1369     
- Misses       8896     8904       +8

Flag	Coverage Δ
miri	`89.47% <94.37%> (+0.16%)`	⬆️
unittests	`89.32% <94.37%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...quantization/src/multi_vector/distance/fallback.rs	`98.43% <100.00%> (ø)`
...ntization/src/multi_vector/distance/kernels/f16.rs	`100.00% <100.00%> (ø)`
...ntization/src/multi_vector/distance/kernels/mod.rs	`100.00% <100.00%> (ø)`
...zation/src/multi_vector/distance/kernels/reduce.rs	`100.00% <100.00%> (ø)`
...-quantization/src/multi_vector/distance/max_sim.rs	`100.00% <ø> (ø)`
diskann-quantization/src/multi_vector/matrix.rs	`95.59% <100.00%> (+0.33%)`	⬆️
...-quantization/src/multi_vector/block_transposed.rs	`96.88% <91.66%> (-0.07%)`	⬇️
...on/src/multi_vector/distance/kernels/f32/scalar.rs	`97.29% <97.29%> (ø)`
...zation/src/multi_vector/distance/kernels/f32/v3.rs	`97.82% <97.82%> (ø)`
.../src/multi_vector/distance/kernels/tiled_reduce.rs	`99.70% <99.70%> (ø)`
... and 5 more

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hildebrandmw

This is on the right track, but there are some claims made that I don't think are fully backed up yet.

First, this is not exactly type agnostic. While an implementation does exist for f32, one does not exist for another data type as proof of generality. It would be nice to see at least an f16 implementation, which requires functionality not present in this PR (i.e., lazily unpacking f16 panels into f32 panels before entering micropanel loops to hoist the conversion out of the micro-kernel).

It would also be nice to see this used to implement row-major x row-major kernels again with packing on each tile load. The packing algorithms need not be super optimized (SIMD shuffles can be added later), but it would be great to see that this is possible within the kernel abstraction.

Second, we really should make kernel implementations micro-architecture aware from the get-go. Passing around diskann_wide::arch::Current isn't super helpful as that type is always available. Rather, we should parameterize something (perhaps at the trait level) on the Architecture to enable a clean extension point for AVX-512, Neon, etc..

Finally, since the original experimentation for this contained implementations for 8-bit integers, showing that this abstraction layer works there too would make a strong case for the abstraction.

suri-kumkaran · 2026-04-01T14:52:29Z

This is on the right track, but there are some claims made that I don't think are fully backed up yet.

First, this is not exactly type agnostic. While an implementation does exist for f32, one does not exist for another data type as proof of generality. It would be nice to see at least an f16 implementation, which requires functionality not present in this PR (i.e., lazily unpacking f16 panels into f32 panels before entering micropanel loops to hoist the conversion out of the micro-kernel).

It would also be nice to see this used to implement row-major x row-major kernels again with packing on each tile load. The packing algorithms need not be super optimized (SIMD shuffles can be added later), but it would be great to see that this is possible within the kernel abstraction.

Second, we really should make kernel implementations micro-architecture aware from the get-go. Passing around diskann_wide::arch::Current isn't super helpful as that type is always available. Rather, we should parameterize something (perhaps at the trait level) on the Architecture to enable a clean extension point for AVX-512, Neon, etc..

Finally, since the original experimentation for this contained implementations for 8-bit integers, showing that this abstraction layer works there too would make a strong case for the abstraction.

Thanks for the through and insightful feedback — addressed in the latest push.

f16 implementation: Done. F16Kernel lazily unpacks f16→f32 in prepare_a/prepare_b before the micro-panel loops. The micro-kernel body is unchanged f32 SIMD — zero additional micro-kernel code. This required adding APrepared/BPrepared associated types and self-owned staging buffers to the Kernel trait. The f32 kernel stays zero-sized with identity pass-throughs.

Architecture parameterization: Kernel<A: Architecture> is parameterized at the trait level with concrete impls for V3, V4, Scalar, and Neon. Dedicated arch-specific micro-kernels are future work.

Row-major × row-major: Not in this PR, but the prepare_a hook can transpose a row-major panel on the fly using the same staging buffer mechanism the f16 kernel already uses. Future work.

8-bit integers: Same story — an i8 kernel would dequantize in prepare_a/prepare_b and delegate to existing micro-kernels. The prepare hooks are designed for this. Future work alongside PQ/SQ/MinMax.

The goal of this PR is proving the abstraction with two concrete types (f32 identity + f16 lazy unpacking) sharing one tiling loop and micro-kernel. Remaining implementations are mechanical from here.

Copilot

Pull request overview

This PR introduces a new SIMD-accelerated, cache-tiled kernel framework for multi-vector MaxSim/Chamfer distance, targeting block-transposed query layouts and supporting both f32 and f16 (via staged f16→f32 preparation).

Changes:

Added a new distance::kernels module with an unsafe Kernel<A> abstraction and a shared tiled_reduce 5-level cache tiling loop.
Implemented f32 (AVX2+FMA + scalar/Neon delegation) and f16 (SIMD/scalar conversion + delegation to f32 microkernels) kernel families with correctness tests vs the fallback.
Extended matrix and block-transposed utilities (MatRef::as_matrix_view, BlockTransposedRef::available_rows) and renamed the previous “simple” implementation to fallback.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
diskann-quantization/src/multi_vector/matrix.rs	Adds `MatRef<Standard<T>>::as_slice()` and `as_matrix_view()` plus a roundtrip test.
diskann-quantization/src/multi_vector/distance/mod.rs	Wires in `fallback` and new `kernels` module; updates re-exports and docs.
diskann-quantization/src/multi_vector/distance/fallback.rs	Renames Simple→Fallback kernel and disambiguates a test conversion.
diskann-quantization/src/multi_vector/block_transposed.rs	Adds `available_rows()` and validates it in tests.
diskann-quantization/src/multi_vector/distance/kernels/mod.rs	Introduces kernel framework module and cache-budget helpers.
diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs	Adds generic tiling loop + reduction helper trait + planner tests.
diskann-quantization/src/multi_vector/distance/kernels/f32/mod.rs	Adds f32 kernel entrypoint and MaxSim/Chamfer impls + tests.
diskann-quantization/src/multi_vector/distance/kernels/f32/v3.rs	Adds AVX2+FMA 16×4 microkernel and V4→V3 delegation.
diskann-quantization/src/multi_vector/distance/kernels/f32/scalar.rs	Adds scalar 8×2 microkernel and Neon→Scalar delegation.
diskann-quantization/src/multi_vector/distance/kernels/f16/mod.rs	Adds f16 kernel entrypoint and MaxSim/Chamfer impls + tests.
diskann-quantization/src/multi_vector/distance/kernels/f16/v3.rs	Adds SIMD f16→f32 prepare hooks and delegates to f32 V3 microkernel.
diskann-quantization/src/multi_vector/distance/kernels/f16/scalar.rs	Adds scalar f16→f32 prepare hooks and delegates to scalar f32 microkernel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

arrayka

The PR description suggests that the existing simple kernel is slower than the proposed solution. Could you please support this claim with easy‑to‑reproduce benchmark results?

That would allow others to replicate the numbers to verify the performance claims.

…ides the GROUP const generic and arch token behind vtable

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hildebrandmw

Thanks Suryansh - this is getting close. The big thing I noticed is that the data preparation step is happening in the wrong place. It should be done at the tile level, not the panel level, to maximize the reuse of the preparation step.

hildebrandmw · 2026-04-13T18:05:44Z

+        a: *const Self::APrepared,
+        b: *const Self::BPrepared,
+        k: usize,
+        r: *mut f32,


With this type of accumulator, are we over-fitting for Chamfer/MaxSim? What if we wanted to implement arg max for brute-force search? Also, in the case of u8/i8, we wouldn't want f32 as the result. We'd want u32/i32, right?

As discussed offline, we’ll address this in a follow-up. I’ll spend some time thinking it through, outline an approach, and add the proposed direction to the PR description/comments.

hildebrandmw

Thanks Suryansh - this is getting there. In addition to the inline nits, I have some larger overall concerns:

Documentation: While documentation is good, verbose documentation that is redundant (e.g. repeats what the type signature already says) or what can be retrieved from rustdoc, or the function name, or a quick glance as the code is more harmful than helpful. Documentation of intent, invariants, surprised, etc. is great! But please look through many of the comments/docs and prune out the ones that are low signal.

Testing: Testing a lot of edge cases for matrix sizes is great. However. The test cases exercised here do not actually hit the main body of the loop nest. They all fall into the peeled section. This means that as-is, one of the main loops in this PR is not being tested in any way. Please fix this. Unfortunately, matrices of a realistic dimension relative to the L1 and L2 cache sizes are needed to exercise these paths, which is slow and means Miri tests will be especially lethargic. The best way I see to remedy that is to make the cache sizes configurable or overridable (which we will in any case need). But this really needs coverage.

hildebrandmw · 2026-04-16T20:26:47Z

+///
+/// The blanket identity impl covers every layout converting to itself
+/// with `Buffer = ()` and zero cost. Explicit impls handle f16→f32 via
+/// [`SliceCast`].


The terminology around row * k and being contiguous makes me worry a little bit about future scenarios for strided access. What are the plans for systematically and safely updating the code when that lands?

Yes, let’s address it in a follow-up once the design for strided access is clearer. I’m still working out the right level at which to handle K-splitting and perform the conversion.

arkrishn94

Thanks Suryansh, I think this is almost ready to merge! Left some small comments but other than that, the things that I wanted to highlight -

Testing. From what I can tell, the tiled_reduce loop is not tested across architectures. If that's the case, we should definitely add that.
On specializing the output type for the scratch to support other reductions like argmax and inputs like i8/u8 - am I misunderstand that this should be easy by augmenting Kernel with an associated return type and wiring that through?
This is probably cause I don't understand all the details very well but I still don't see how supporting metadata for quantized vectors will be wired through in terms of where it will live and how it'll be accessed in the main micro-kernel for post-op processing.

arkrishn94 · 2026-04-07T14:37:15Z

+    ///
+    /// * `a` must point to `A_PANEL * k` contiguous `APrepared` values.
+    /// * `b` must point to `B_PANEL * k` contiguous `BPrepared` values.
+    /// * `r` must point to at least `A_PANEL` writable `f32` values.


And be valid for the lifetime of this function execution only?

Lifetime-of-call validity is the implicit raw-pointer convention (stdlib's pointer APIs don't spell it out either) - the kernel doesn't store any of the pointers across the call. Left the contracts in their concise form to keep them scannable; happy to add it explicitly if you'd rather have the trait be paranoid-self-contained.

arkrishn94 · 2026-04-21T16:19:53Z

+    /// # Safety
+    ///
+    /// * `src` must point to `rows * k` valid elements in `Self`'s layout.
+    /// * `buf` must come from [`new_buffer`](Self::new_buffer) with


I'm guessing buf has to be created with the same k as used in convert?

Correct - added to the safety contract: buf must come from new_buffer with the same k (and max_tile_rows >= rows). The f16→f32 impls allocate max_tile_rows * k and convert writes rows * k via &mut buf[..count], so a smaller k would short-write or panic on the slice bound. The blanket identity impl ignores both, so this is purely a contract for non-identity converters.

arkrishn94 · 2026-04-21T16:31:48Z

 //! | [`BlockTransposedRef`] | Immutable view of a block-transposed matrix |
 //! | [`BlockTransposedMut`] | Mutable view of a block-transposed matrix |
 //! | [`QueryMatRef`] | Query wrapper for asymmetric distances |
+//! | [`QueryComputer`] | Architecture-dispatched SIMD query computer |


nit: Can we separate the matrix types from the computer type in the documentation? Might be wroth adding separate documentation for it here since it's a core type in the new distance computation path?

Good call - I'd lean toward keeping the table flat. It's meant as a fast inventory of what multi_vector re-exports, and MaxSim/Chamfer are equally first-class on the new distance path; pulling QueryComputer out into its own section while leaving them in the table would be inconsistent. The detailed docs already live on the type itself in query_computer/mod.rs (dispatch model, build cost, usage). Happy to expand that type-level doc if you feel anything's missing — just don't think the module-level overview is the right place for it. WDYT?

Add Cache aware multi-vector distance functions

0a70420

narendatha approved these changes Mar 30, 2026

View reviewed changes

sampathrg reviewed Mar 30, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/mod.rs Outdated

sampathrg reviewed Mar 30, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated

sampathrg reviewed Mar 30, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/kernel.rs Outdated

sampathrg reviewed Mar 30, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/f32_kernel.rs Outdated

hildebrandmw reviewed Mar 30, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/f32_kernel.rs Outdated

hildebrandmw reviewed Mar 30, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/cache_aware/f32_kernel.rs Outdated

hildebrandmw reviewed Mar 30, 2026

View reviewed changes

suri-kumkaran added 3 commits March 31, 2026 07:42

Merge branch 'main' into users/suryangupta/multi-vector-distance-impl

08459cf

Merge branch 'main' into users/suryangupta/multi-vector-distance-impl

0bf3fc0

Improve design - make it more extensible and generic

359da08

Suryansh Gupta and others added 2 commits April 1, 2026 20:43

cfg flag fix in tests

888de9f

Make design more solid and powerful and add f16 kernels

cfa8b76

suri-kumkaran changed the title ~~Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32~~ Cache-Aware Block-Transposed Chamfer/MaxSim Distance for f32 and f16 Apr 1, 2026

suri-kumkaran marked this pull request as ready for review April 1, 2026 21:36

suri-kumkaran requested review from a team, Copilot, hildebrandmw, narendatha and sampathrg April 1, 2026 21:36

Copilot started reviewing on behalf of suri-kumkaran April 1, 2026 21:37 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

arrayka reviewed Apr 3, 2026

View reviewed changes

arrayka added this to DiskANN backlog Apr 3, 2026

arrayka removed this from DiskANN backlog Apr 3, 2026

suri-kumkaran added 2 commits April 7, 2026 23:22

Merge branch 'main' into users/suryangupta/multi-vector-distance-impl

816057a

Enable dyanmic dispatch of multi-vector distance function based on arch

3222c93

Suryansh Gupta and others added 3 commits April 8, 2026 23:31

Fix miri tests and increase code coverage

3f7b544

Use Target traits for runtime dispatch, add QueryComputer type that h…

02a6acc

…ides the GROUP const generic and arch token behind vtable

Improve testing and code quality

b5c8895

suri-kumkaran requested a review from Copilot April 10, 2026 22:08

Copilot started reviewing on behalf of suri-kumkaran April 10, 2026 22:09 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated

Comment thread diskann-quantization/src/multi_vector/distance/kernels/tiled_reduce.rs Outdated

Comment thread diskann-quantization/src/multi_vector/distance/query_computer/mod.rs Outdated

suri-kumkaran and others added 2 commits April 13, 2026 19:32

Address Copilot review comments

aa1990b

Fix clippy

2913912

suri-kumkaran self-assigned this Apr 13, 2026

hildebrandmw reviewed Apr 13, 2026

View reviewed changes

Move preparation step to tiles rather than pannels

3258e67

suri-kumkaran requested a review from hildebrandmw April 16, 2026 17:46

hildebrandmw reviewed Apr 16, 2026

View reviewed changes

suri-kumkaran added 3 commits April 17, 2026 15:47

Merge branch 'main' into users/suryangupta/multi-vector-distance-impl

8a42b0f

Address review comments

2198497

Merge branch 'main' into users/suryangupta/multi-vector-distance-impl

793fb9b

arkrishn94 reviewed Apr 21, 2026

View reviewed changes

Improve testing and address review comments

e06c48a

suri-kumkaran linked an issue Apr 28, 2026 that may be closed by this pull request

Add tiled_reduce function with Kernel and ConvertTo abstraction (f16 and f32 impls with hardcoded reduce) #988

Open

5 tasks

Conversation

suri-kumkaran commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changed Files

Design Decisions

Kernel trait

Layout markers and ConvertTo

Tiling loop (reducing-GEMM)

f16 path

QueryComputer

Suggested Review Order

Future Work

Uh oh!

codecov-commenter commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suri-kumkaran commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arrayka left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

suri-kumkaran commented Mar 25, 2026 •

edited

Loading

codecov-commenter commented Mar 25, 2026 •

edited

Loading

suri-kumkaran commented Apr 1, 2026 •

edited

Loading