Skip to content

v2.0.0

Choose a tag to compare

@LoserCheems LoserCheems released this 23 Mar 03:33
· 812 commits to main since this release
969a280

What's Changed

  • Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
  • [PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
  • [BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
  • [BUG FIX] Improve packed GQA handling by @LoserCheems in #224
  • Add utility functions for device management and input validation by @LoserCheems in #225
  • [PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
  • [FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
  • Improves softmax stability with log2 scaling by @LoserCheems in #228
  • Renames variables and refactors functions for clarity by @LoserCheems in #229
  • Improve performance and configuration for SM90 forward path by @LoserCheems in #231
  • Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
  • [BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
  • Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
  • [FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
  • [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
  • Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
  • [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
  • Enhance forward kernel for block range and masking logic by @LoserCheems in #239
  • Refactor backward kernels for clarity and optimization by @LoserCheems in #240
  • [BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
  • Add benchmark functions for Triton attention operations by @LoserCheems in #242
  • [FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
  • [BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
  • [BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
  • Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
  • [FEATURE] Implement dense attention with masking support by @LoserCheems in #247
  • Enhance sparse attention implementation and documentation by @LoserCheems in #248
  • [FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
  • Update project structure and dependencies by @LoserCheems in #250
  • [BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
  • Update repository URLs and improve documentation by @LoserCheems in #252
  • Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
  • Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
  • Refactor masking logic in backward kernel functions by @LoserCheems in #255
  • Refactor GitHub Actions workflows for package building and publishing by @LoserCheems in #256

Full Changelog: v1.2.4...v2.0.0

What's Changed

  • Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
  • [PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
  • [BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
  • [BUG FIX] Improve packed GQA handling by @LoserCheems in #224
  • Add utility functions for device management and input validation by @LoserCheems in #225
  • [PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
  • [FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
  • Improves softmax stability with log2 scaling by @LoserCheems in #228
  • Renames variables and refactors functions for clarity by @LoserCheems in #229
  • Improve performance and configuration for SM90 forward path by @LoserCheems in #231
  • Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
  • [BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
  • Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
  • [FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
  • [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
  • Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
  • [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
  • Enhance forward kernel for block range and masking logic by @LoserCheems in #239
  • Refactor backward kernels for clarity and optimization by @LoserCheems in #240
  • [BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
  • Add benchmark functions for Triton attention operations by @LoserCheems in #242
  • [FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
  • [BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
  • [BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
  • Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
  • [FEATURE] Implement dense attention with masking support by @LoserCheems in #247
  • Enhance sparse attention implementation and documentation by @LoserCheems in #248
  • [FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
  • Update project structure and dependencies by @LoserCheems in #250
  • [BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
  • Update repository URLs and improve documentation by @LoserCheems in #252
  • Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
  • Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
  • Refactor masking logic in backward kernel functions by @LoserCheems in #255
  • Refactor GitHub Actions workflows for package building and publishing by @LoserCheems in #256

Full Changelog: v1.2.4...v2.0.0