v2.0.2
What's Changed
- Refactor cache management in cache_utils.py by @LoserCheems in #278
- [BUG FIX] Update softmax threshold parameter in sparse attention functions by @LoserCheems in #279
- [PERFORMANCE OPTIMIZATION] Optimize alpha/delta layout to eliminate transpose copy in gated attention by @LoserCheems in #280
- [BUG FIX] Varlen Decode Reference Bug Fix by @LoserCheems in #281
- Fix stride bug by @LoserCheems in #282
- Refactor activation functions for optimization by @LoserCheems in #285
- FEATURE] Optimize kernel representations for Triton by @LoserCheems in #286
- Enhance attention kernel performance and quantization support by @LoserCheems in #287
- Refactor attention functions and add autotuning support in Triton by @LoserCheems in #288
- Add optional output and logsumexp tensors to attention functions by @LoserCheems in #289
- Quant backward by @LoserCheems in #290
- Update autotune cache system and local attention by @LoserCheems in #293
- Refactor MASK_CAUSAL parameter to use IS_CAUSAL by @LoserCheems in #295
- Refactor flash_dec_combine to flash_fwd_combine in forward attention … by @LoserCheems in #296
- Fix _reference_scores launch config lookup for sparse/gated and GQA by @LoserCheems in #300
- Fix window_sizes parameter handling and falsy-value default bugs by @LoserCheems in #302
Full Changelog: v2.0.1...v2.0.2
What's Changed
- Refactor cache management in cache_utils.py by @LoserCheems in #278
- [BUG FIX] Update softmax threshold parameter in sparse attention functions by @LoserCheems in #279
- [PERFORMANCE OPTIMIZATION] Optimize alpha/delta layout to eliminate transpose copy in gated attention by @LoserCheems in #280
- [BUG FIX] Varlen Decode Reference Bug Fix by @LoserCheems in #281
- Fix stride bug by @LoserCheems in #282
- Refactor activation functions for optimization by @LoserCheems in #285
- FEATURE] Optimize kernel representations for Triton by @LoserCheems in #286
- Enhance attention kernel performance and quantization support by @LoserCheems in #287
- Refactor attention functions and add autotuning support in Triton by @LoserCheems in #288
- Add optional output and logsumexp tensors to attention functions by @LoserCheems in #289
- Quant backward by @LoserCheems in #290
- Update autotune cache system and local attention by @LoserCheems in #293
- Refactor MASK_CAUSAL parameter to use IS_CAUSAL by @LoserCheems in #295
- Refactor flash_dec_combine to flash_fwd_combine in forward attention … by @LoserCheems in #296
- Fix _reference_scores launch config lookup for sparse/gated and GQA by @LoserCheems in #300
- Fix window_sizes parameter handling and falsy-value default bugs by @LoserCheems in #302
Full Changelog: v2.0.1...v2.0.2