Skip to content

Merge #903: OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul#925

Merged
copybara-service[bot] merged 1 commit into
devfrom
test_925970431
Jun 3, 2026
Merged

Merge #903: OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul#925
copybara-service[bot] merged 1 commit into
devfrom
test_925970431

Conversation

@copybara-service
Copy link
Copy Markdown

Merge #903: OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul

============= Description ==========

This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support.

What

When enabled via the GEMMA_ONEDNN_BRGEMM compile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems.

How to Enable

# CMake
cmake -DGEMMA_ONEDNN_BRGEMM=ON ..

# Bazel
bazel build --define gemma_onednn_brgemm=1 ...

Runtime Fallback

When GEMMA_ONEDNN_BRGEMM is enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically.

Changes

File Description
ops/brgemm.h Types, caches, thread-local buffers, UseOneDnnBrgemm(), autotuning candidates
ops/brgemm-inl.h DoMatMul_BRGeMM(): kernel JIT/caching, B-packing with hugepages, tiled parallel execution
ops/matmul-inl.h BRGeMM dispatch block in MatMul() guarded by #if GEMMA_ONEDNN_BRGEMM
ops/matmul.h #include "ops/brgemm.h", brgemm_autotune field in MMPerKey
ops/bench_matmul.cc Check brgemm_autotune.Best() to avoid infinite loop when BRGeMM handles dispatch
CMakeLists.txt GEMMA_ONEDNN_BRGEMM option, FetchContent for OneDNN v3.11, conditional target linking
BUILD.bazel config_setting for gemma_onednn_brgemm, conditional OneDNN dep and defines for x86_64
MODULE.bazel OneDNN v3.11 http_archive dependency
bazel/onednn.BUILD Bazel build rules for OneDNN
util/zones.h kBRGeMM caller enum for thread pool dispatch
util/zones.cc CallerName mapping for kBRGeMM

Testing

  • matmul_test passes with and without GEMMA_ONEDNN_BRGEMM (all original test shapes, types, and correctness checks preserved)
  • bench_matmul runs successfully with BRGeMM enabled
  • No changes to existing tests; zero impact when OneDNN is not enabled or on non-x86 platforms

============= Commits ==============

--
09ddbf4 by Bibek Bhattarai bibek.bhattarai@intel.com:

Tested and benchmarked OneDNN BRGeMM integration against dev branch

--
1308355 by Bibek Bhattarai bibek.bhattarai@intel.com:

fixing the copyright info

--
656444f by Bibek Bhattarai bibek.bhattarai@intel.com:

Removing OneTBB dependency

--
f8527a1 by Bibek Bhattarai bibek.bhattarai@intel.com:

Fixed the compile time flag to designate BRGEMM path

--
0dde315 by Bibek Bhattarai bibek.bhattarai@intel.com:

Adding the cmake based build support for oneDNN BGGeMM

--
fd3b119 by Bibek Bhattarai bibek.bhattarai@intel.com:

fixed dtypes and syntax divergence from codebase

--
6640021 by Bibek Bhattarai bibek.bhattarai@intel.com:

changed lda and ldb to size_t. Added conversions inplace for brgemm and transform inits

--
9d6bbee by Bibek Bhattarai bibek.bhattarai@intel.com:

Replaced / and % with Divide and Remainder utils from hwy::Divisor

--
45708ea by Bibek Bhattarai bibek.bhattarai@intel.com:

Moved the BRGeMM Kernel inits to a separate HWY_NOINLINE helper function

--
acf7592 by Bibek Bhattarai bibek.bhattarai@intel.com:

Added HWY_WARN and fallback instead of exiting

--
7bdf4c6 by Bibek Bhattarai bibek.bhattarai@intel.com:

using hwy::AlignedVector instead of std::vector for scratch and tc_storage

====================================

Resolves #903.

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Jun 3, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

============= Description ==========

This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support.

## What

When enabled via the `GEMMA_ONEDNN_BRGEMM` compile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems.

### How to Enable

```bash
# CMake
cmake -DGEMMA_ONEDNN_BRGEMM=ON ..

# Bazel
bazel build --define gemma_onednn_brgemm=1 ...
```

### Runtime Fallback

When `GEMMA_ONEDNN_BRGEMM` is enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically.

## Changes

| File | Description |
|---|---|
| `ops/brgemm.h` | Types, caches, thread-local buffers, `UseOneDnnBrgemm()`, autotuning candidates |
| `ops/brgemm-inl.h` | `DoMatMul_BRGeMM()`: kernel JIT/caching, B-packing with hugepages, tiled parallel execution |
| `ops/matmul-inl.h` | BRGeMM dispatch block in `MatMul()` guarded by `#if GEMMA_ONEDNN_BRGEMM` |
| `ops/matmul.h` | `#include "ops/brgemm.h"`, `brgemm_autotune` field in `MMPerKey` |
| `ops/bench_matmul.cc` | Check `brgemm_autotune.Best()` to avoid infinite loop when BRGeMM handles dispatch |
| `CMakeLists.txt` | `GEMMA_ONEDNN_BRGEMM` option, FetchContent for OneDNN v3.11, conditional target linking |
| `BUILD.bazel` | `config_setting` for `gemma_onednn_brgemm`, conditional OneDNN dep and defines for x86_64 |
| `MODULE.bazel` | OneDNN v3.11 `http_archive` dependency |
| `bazel/onednn.BUILD` | Bazel build rules for OneDNN |
| `util/zones.h` | `kBRGeMM` caller enum for thread pool dispatch |
| `util/zones.cc` | `CallerName` mapping for `kBRGeMM` |

## Testing

- `matmul_test` passes with and without `GEMMA_ONEDNN_BRGEMM` (all original test shapes, types, and correctness checks preserved)
- `bench_matmul` runs successfully with BRGeMM enabled
- No changes to existing tests; zero impact when OneDNN is not enabled or on non-x86 platforms

============= Commits ==============

--
09ddbf4 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Tested and benchmarked OneDNN BRGeMM integration against dev branch

--
1308355 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

fixing the copyright info

--
656444f by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Removing OneTBB dependency

--
f8527a1 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Fixed the compile time flag to designate BRGEMM path

--
0dde315 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Adding the cmake based build support for oneDNN BGGeMM

--
fd3b119 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

fixed dtypes and syntax divergence from codebase

--
6640021 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

changed lda and ldb to size_t. Added conversions inplace for brgemm and transform inits

--
9d6bbee by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Replaced / and % with Divide and Remainder utils from hwy::Divisor

--
45708ea by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Moved the BRGeMM Kernel inits to a separate HWY_NOINLINE helper function

--
acf7592 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

Added HWY_WARN and fallback instead of exiting

--
7bdf4c6 by Bibek Bhattarai <bibek.bhattarai@intel.com>:

using hwy::AlignedVector instead of std::vector for scratch and tc_storage

====================================

Resolves #903.

PiperOrigin-RevId: 925971432
@copybara-service copybara-service Bot merged commit 3325f6f into dev Jun 3, 2026
@copybara-service copybara-service Bot deleted the test_925970431 branch June 3, 2026 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants