SYCL*TLA 0.9.1

Latest

Latest

taozha2 released this 12 Jun 06:43

· 1 commit to main since this release

90fd94d

Enhancements

Support Stream-K GEMM ops in Python API (#800)
Support fast path for LinearCombination in xe_epilogue (#802)
Support event-less launch when profiling is disabled in GemmUniversalAdapter (#803)
Support subbyte reorder (#793)
Add Handler-less and Event-less support in launch APIs (#794)
Add SYCL subgroup lane index to canonical_lane_idx() (#816)
Add memory-budget-based bounded buffer for EventManager (#795)
Add more PyTorch GEMM configs (#789)
Reverse Q scheduling order in FMHA tile scheduler (#814)
Refine SLM r2s/s2r to reuse UniversalCopy without vectorization (#776)
Refine barrier API (#810)
Drop redundant __INTEL_LLVM_COMPILER checks (#801)

Bug Fixes

Fix shapes parameter issue in the MoE grouped GEMMs (#820)
Fix average runtime and GFLOPS calculation in example 10 (#818)
Fix rem mask of SDPA (#813)
Fix int32 overflow in MoE GEMM for large expert counts (#804)
Fix sub-byte pointer arithmetic and zero buffer allocation in grouped GEMM (#790)
Fix wrong constexpr/lifetime evaluation (#799)

Documentation

Align build commands across README (#788)
Update googlebenchmark to v1.9.5 (#797)
Update PyTorch commit for cutlass-inductor workflow (#808)
Update inductor workflow (#806)

Pypi wheels here

See the CHANGELOG for details of all past releases and updates.

Assets 2