Fix FlashInfer when not using prefix caching by tgaddair · Pull Request #633 · predibase/lorax

tgaddair · 2024-10-04T21:43:05Z

Fixes #625.

Additionally:

Removes unsupported bgmv rank 96 from CUDA graph traces
Skips batch total tokens check when using prefix caching and relies instead on block allocator
Fixes FP8 support for Qwen2
Fixes FlashInfer / Prefix Caching support with CUDA graphs

tgaddair added 5 commits October 4, 2024 13:41

Fix flashinfer with CUDA graphs

88b05a1

Test skip token limit check

4587223

Fix bias

86620aa

Make check dynamic based on prefix caching

a2256b8

Fixed flashinfer without prefix caching

5e70cf6

tgaddair requested review from arnavgarg1, magdyksaleh and noah-yoshida October 4, 2024 21:43

magdyksaleh approved these changes Oct 4, 2024

View reviewed changes

arnavgarg1 approved these changes Oct 4, 2024

View reviewed changes

Typo

89fede5

tgaddair merged commit d513ee8 into main Oct 4, 2024

tgaddair deleted the fix-flashinfer-graph branch October 4, 2024 21:50

Provide feedback