-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Open
Labels
Feature requestRequest for a new featureRequest for a new feature
Description
Feature request
Transformers currently supports Flash Attention 2 and 3, but not Flash Attention 4. Users with compatible hardware and the latest flash-attn package cannot leverage FA4's improvements . Lets add that as well .
Motivation
Flash Attention 4 is now available in the flash-attn package, bringing significant improvements over FA2/FA3:
- 30-50x faster compilation through JIT compilation from Python
- Eliminates binary wheel distribution issues - no more platform-specific wheels needed
- Optimized for modern GPUs - specifically tuned for Hopper (H100/H200) and Blackwell architectures
- Cleaner implementation - leverages CuTe's high-level DSL abstractions
Reference: https://x.com/StasBekman/status/1993060880150675700
Your contribution
Add comprehensive FA4 support to transformers:
- Detection: Add
is_flash_attn_4_available()function to check for FA4 with hardware requirements (SM 8.0+) - Import logic: Handle
flash_attn.cuteimport path - API compatibility: Use runtime introspection to handle API differences between FA4 and FA2/FA3
- Auto-selection: Include FA4 in automatic attention implementation selection with highest priority
- Registration: Register
flash_attention_4in AttentionInterface - Testing: Add test suite and validation scripts
Metadata
Metadata
Assignees
Labels
Feature requestRequest for a new featureRequest for a new feature