Small B/H/S optimizations

### Description

Optimize kernel performance for small batch size (B), number of heads (H), and sequence length (S) configurations.

### Context

Current kernels are primarily tuned for large-scale configurations. Many real-world inference and fine-tuning workloads operate with small B, H, or S values, where kernel launch overhead and underutilization of SMs become significant. Targeted optimizations for these cases can unlock better performance in production serving scenarios.

### Tasks

- [ ] Profile kernel performance on small B/H/S configurations
- [ ] Identify bottlenecks (launch overhead, SM underutilization, etc.)
- [ ] Implement specialized kernel variants or tiling strategies for small configurations
- [ ] Add benchmarks covering small B/H/S cases
- [ ] Validate correctness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small B/H/S optimizations #11

Description

Context

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Small B/H/S optimizations #11

Description

Description

Context

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions