Release v2.0.2 · HKUSTDial/flash-sparse-attention

What's Changed

Refactor cache management in cache_utils.py by @LoserCheems in #278
[BUG FIX] Update softmax threshold parameter in sparse attention functions by @LoserCheems in #279
[PERFORMANCE OPTIMIZATION] Optimize alpha/delta layout to eliminate transpose copy in gated attention by @LoserCheems in #280
[BUG FIX] Varlen Decode Reference Bug Fix by @LoserCheems in #281
Fix stride bug by @LoserCheems in #282
Refactor activation functions for optimization by @LoserCheems in #285
FEATURE] Optimize kernel representations for Triton by @LoserCheems in #286
Enhance attention kernel performance and quantization support by @LoserCheems in #287
Refactor attention functions and add autotuning support in Triton by @LoserCheems in #288
Add optional output and logsumexp tensors to attention functions by @LoserCheems in #289
Quant backward by @LoserCheems in #290
Update autotune cache system and local attention by @LoserCheems in #293
Refactor MASK_CAUSAL parameter to use IS_CAUSAL by @LoserCheems in #295
Refactor flash_dec_combine to flash_fwd_combine in forward attention … by @LoserCheems in #296
Fix _reference_scores launch config lookup for sparse/gated and GQA by @LoserCheems in #300
Fix window_sizes parameter handling and falsy-value default bugs by @LoserCheems in #302

Full Changelog: v2.0.1...v2.0.2

Refactor cache management in cache_utils.py by @LoserCheems in #278
[BUG FIX] Update softmax threshold parameter in sparse attention functions by @LoserCheems in #279
[PERFORMANCE OPTIMIZATION] Optimize alpha/delta layout to eliminate transpose copy in gated attention by @LoserCheems in #280
[BUG FIX] Varlen Decode Reference Bug Fix by @LoserCheems in #281
Fix stride bug by @LoserCheems in #282
Refactor activation functions for optimization by @LoserCheems in #285
FEATURE] Optimize kernel representations for Triton by @LoserCheems in #286
Enhance attention kernel performance and quantization support by @LoserCheems in #287
Refactor attention functions and add autotuning support in Triton by @LoserCheems in #288
Add optional output and logsumexp tensors to attention functions by @LoserCheems in #289
Quant backward by @LoserCheems in #290
Update autotune cache system and local attention by @LoserCheems in #293
Refactor MASK_CAUSAL parameter to use IS_CAUSAL by @LoserCheems in #295
Refactor flash_dec_combine to flash_fwd_combine in forward attention … by @LoserCheems in #296
Fix _reference_scores launch config lookup for sparse/gated and GQA by @LoserCheems in #300
Fix window_sizes parameter handling and falsy-value default bugs by @LoserCheems in #302

Full Changelog: v2.0.1...v2.0.2