Replies: 1 comment
-
|
I also found a potential correctness issue in Both kernels hardcode the FP8 E4M3 max value for clamping: output_vec[j] = fmaxf(-448.0f, fminf(output_vec[j], 448.0f));However, the output type I have a fix ready — will submit a PR shortly. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm interested in contributing to FlashInfer. I found a few TODOs/FIXMEs in the codebase and wanted to confirm they're still relevant before starting work:
include/flashinfer/norm.cuh—// TODO(kaixih): add support for fp8 quantization if neededinclude/flashinfer/attention/persistent.cuh—// FIXME: fix the offset calculationinclude/flashinfer/sampling.cuh—// TODO: compare the speed of log2 and logFor context, I have experience with FP8 RMSNorm CUDA kernel development (merged PRs to vLLM) and would be happy to take on any of these if they're still needed.
Would appreciate any guidance on which of these would be most valuable to the project. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions