Skip to content

SYCL*TLA 0.9.1

Latest

Choose a tag to compare

@taozha2 taozha2 released this 12 Jun 06:43
· 1 commit to main since this release
90fd94d

Enhancements

  • Support Stream-K GEMM ops in Python API (#800)
  • Support fast path for LinearCombination in xe_epilogue (#802)
  • Support event-less launch when profiling is disabled in GemmUniversalAdapter (#803)
  • Support subbyte reorder (#793)
  • Add Handler-less and Event-less support in launch APIs (#794)
  • Add SYCL subgroup lane index to canonical_lane_idx() (#816)
  • Add memory-budget-based bounded buffer for EventManager (#795)
  • Add more PyTorch GEMM configs (#789)
  • Reverse Q scheduling order in FMHA tile scheduler (#814)
  • Refine SLM r2s/s2r to reuse UniversalCopy without vectorization (#776)
  • Refine barrier API (#810)
  • Drop redundant __INTEL_LLVM_COMPILER checks (#801)

Bug Fixes

  • Fix shapes parameter issue in the MoE grouped GEMMs (#820)
  • Fix average runtime and GFLOPS calculation in example 10 (#818)
  • Fix rem mask of SDPA (#813)
  • Fix int32 overflow in MoE GEMM for large expert counts (#804)
  • Fix sub-byte pointer arithmetic and zero buffer allocation in grouped GEMM (#790)
  • Fix wrong constexpr/lifetime evaluation (#799)

Documentation

  • Align build commands across README (#788)
  • Update googlebenchmark to v1.9.5 (#797)
  • Update PyTorch commit for cutlass-inductor workflow (#808)
  • Update inductor workflow (#806)

Pypi wheels here

See the CHANGELOG for details of all past releases and updates.