Description
Optimize kernel performance for small batch size (B), number of heads (H), and sequence length (S) configurations.
Context
Current kernels are primarily tuned for large-scale configurations. Many real-world inference and fine-tuning workloads operate with small B, H, or S values, where kernel launch overhead and underutilization of SMs become significant. Targeted optimizations for these cases can unlock better performance in production serving scenarios.
Tasks
Description
Optimize kernel performance for small batch size (B), number of heads (H), and sequence length (S) configurations.
Context
Current kernels are primarily tuned for large-scale configurations. Many real-world inference and fine-tuning workloads operate with small B, H, or S values, where kernel launch overhead and underutilization of SMs become significant. Targeted optimizations for these cases can unlock better performance in production serving scenarios.
Tasks