Enhancements
- Support Stream-K GEMM ops in Python API (#800)
- Support fast path for LinearCombination in xe_epilogue (#802)
- Support event-less launch when profiling is disabled in GemmUniversalAdapter (#803)
- Support subbyte reorder (#793)
- Add Handler-less and Event-less support in launch APIs (#794)
- Add SYCL subgroup lane index to canonical_lane_idx() (#816)
- Add memory-budget-based bounded buffer for EventManager (#795)
- Add more PyTorch GEMM configs (#789)
- Reverse Q scheduling order in FMHA tile scheduler (#814)
- Refine SLM r2s/s2r to reuse UniversalCopy without vectorization (#776)
- Refine barrier API (#810)
- Drop redundant __INTEL_LLVM_COMPILER checks (#801)
Bug Fixes
- Fix shapes parameter issue in the MoE grouped GEMMs (#820)
- Fix average runtime and GFLOPS calculation in example 10 (#818)
- Fix rem mask of SDPA (#813)
- Fix int32 overflow in MoE GEMM for large expert counts (#804)
- Fix sub-byte pointer arithmetic and zero buffer allocation in grouped GEMM (#790)
- Fix wrong constexpr/lifetime evaluation (#799)
Documentation
- Align build commands across README (#788)
- Update googlebenchmark to v1.9.5 (#797)
- Update PyTorch commit for cutlass-inductor workflow (#808)
- Update inductor workflow (#806)
Pypi wheels here
See the CHANGELOG for details of all past releases and updates.