v2.0.0
What's Changed
- Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
- [PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
- [BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
- [BUG FIX] Improve packed GQA handling by @LoserCheems in #224
- Add utility functions for device management and input validation by @LoserCheems in #225
- [PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
- [FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
- Improves softmax stability with log2 scaling by @LoserCheems in #228
- Renames variables and refactors functions for clarity by @LoserCheems in #229
- Improve performance and configuration for SM90 forward path by @LoserCheems in #231
- Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
- [BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
- Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
- [FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
- Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
- Enhance forward kernel for block range and masking logic by @LoserCheems in #239
- Refactor backward kernels for clarity and optimization by @LoserCheems in #240
- [BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
- Add benchmark functions for Triton attention operations by @LoserCheems in #242
- [FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
- [BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
- [BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
- Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
- [FEATURE] Implement dense attention with masking support by @LoserCheems in #247
- Enhance sparse attention implementation and documentation by @LoserCheems in #248
- [FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
- Update project structure and dependencies by @LoserCheems in #250
- [BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
- Update repository URLs and improve documentation by @LoserCheems in #252
- Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
- Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
- Refactor masking logic in backward kernel functions by @LoserCheems in #255
- Refactor GitHub Actions workflows for package building and publishing by @LoserCheems in #256
Full Changelog: v1.2.4...v2.0.0
What's Changed
- Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
- [PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
- [BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
- [BUG FIX] Improve packed GQA handling by @LoserCheems in #224
- Add utility functions for device management and input validation by @LoserCheems in #225
- [PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
- [FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
- Improves softmax stability with log2 scaling by @LoserCheems in #228
- Renames variables and refactors functions for clarity by @LoserCheems in #229
- Improve performance and configuration for SM90 forward path by @LoserCheems in #231
- Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
- [BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
- Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
- [FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
- Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
- Enhance forward kernel for block range and masking logic by @LoserCheems in #239
- Refactor backward kernels for clarity and optimization by @LoserCheems in #240
- [BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
- Add benchmark functions for Triton attention operations by @LoserCheems in #242
- [FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
- [BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
- [BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
- Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
- [FEATURE] Implement dense attention with masking support by @LoserCheems in #247
- Enhance sparse attention implementation and documentation by @LoserCheems in #248
- [FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
- Update project structure and dependencies by @LoserCheems in #250
- [BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
- Update repository URLs and improve documentation by @LoserCheems in #252
- Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
- Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
- Refactor masking logic in backward kernel functions by @LoserCheems in #255
- Refactor GitHub Actions workflows for package building and publishing by @LoserCheems in #256
Full Changelog: v1.2.4...v2.0.0