Skip to content

Small B/H/S optimizations #11

@icavan

Description

@icavan

Description

Optimize kernel performance for small batch size (B), number of heads (H), and sequence length (S) configurations.

Context

Current kernels are primarily tuned for large-scale configurations. Many real-world inference and fine-tuning workloads operate with small B, H, or S values, where kernel launch overhead and underutilization of SMs become significant. Targeted optimizations for these cases can unlock better performance in production serving scenarios.

Tasks

  • Profile kernel performance on small B/H/S configurations
  • Identify bottlenecks (launch overhead, SM underutilization, etc.)
  • Implement specialized kernel variants or tiling strategies for small configurations
  • Add benchmarks covering small B/H/S cases
  • Validate correctness

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions