Skip to content

Fix FlashInfer when not using prefix caching#633

Merged
tgaddair merged 6 commits intomainfrom
fix-flashinfer-graph
Oct 4, 2024
Merged

Fix FlashInfer when not using prefix caching#633
tgaddair merged 6 commits intomainfrom
fix-flashinfer-graph

Conversation

@tgaddair
Copy link
Contributor

@tgaddair tgaddair commented Oct 4, 2024

Fixes #625.

Additionally:

  • Removes unsupported bgmv rank 96 from CUDA graph traces
  • Skips batch total tokens check when using prefix caching and relies instead on block allocator
  • Fixes FP8 support for Qwen2
  • Fixes FlashInfer / Prefix Caching support with CUDA graphs

@tgaddair tgaddair merged commit d513ee8 into main Oct 4, 2024
@tgaddair tgaddair deleted the fix-flashinfer-graph branch October 4, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flashinfer backend raises RuntimeError: paged_kv_indices must be a 1D tensor

3 participants