Skip to content

Conversation

Valentine233
Copy link

@Valentine233 Valentine233 commented Sep 3, 2025

The PR supports the FP8 KV cache during prefill phase. This ensures the good functionality.
For performance, we do not see any gain because the FP8 conversion-related instructions are not available now.

@Valentine233 Valentine233 marked this pull request as draft September 3, 2025 06:01
@Valentine233 Valentine233 marked this pull request as ready for review September 9, 2025 08:46
@Valentine233
Copy link
Author

@tdeng5 Hi, I see that you have canceled the CIs. Can you please kindly explain the reason?

@rolandschulz
Copy link

our CI was overloaded. I just started them again.

@@ -0,0 +1,122 @@
/***************************************************************************************************
* Copyright (c) 2024 - 2025 Codeplay Software Ltd. All rights reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise the copyright with Intel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thanks!

copy(gmem_tiled_copy_k, tKgK(_,_,_,k_tile), tKrK);
cute::gemm(tiled_mma, accum, tCrQ, tCrK, frag_src);
if constexpr (is_fp8_v<ElementQ> && is_fp8_v<ElementK>) {
auto tCrQ_ = make_fragment_like<half_t>(tCrQ);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please be descriptive with variable names it seems like all FP8 related tensors are sufficed with "_", is there any reason? Instead can you keep the descriptive suffix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. The suffix is changed to _fp16 to indicate that the dtype is fp16.

@pengzhao-intel
Copy link

Any performance data for this PR?

@Valentine233
Copy link
Author

Any performance data for this PR?

Thanks for the review!
Currently, we see a performance regression with FP8 KV cache, compare to the BF16 one.
This is caused by the slow conversion between FP8 and FP16. Actually, the FP8 conversion-related instructions are only available with Xe3 architecture. We will continue to tune the performance in the future when these instructions are available. For now, only the functionality is ready.

@Valentine233
Copy link
Author

Hi @rolandschulz, could you help re-trigger the canceled CI again? Thanks!

@Valentine233
Copy link
Author

@pengzhao-intel @Antonyvance Please kindly help review again.

@rolandschulz
Copy link

The BMG machine is offline at the moment.

@Valentine233
Copy link
Author

Hi @rolandschulz, has the BMG machine come back?

@rolandschulz
Copy link

no but we are merging PRs with just PVC passing. So this is only blocked by reviews not the BMG CI.

@Valentine233
Copy link
Author

no but we are merging PRs with just PVC passing. So this is only blocked by reviews not the BMG CI.

@rolandschulz, I saw some CI issues on BMG before, so it is better to have the BMG CI passed for this PR.
Please kindly help re-trigger the CI when the machine is back, thanks!

@Valentine233
Copy link
Author

@pengzhao-intel @Antonyvance Could you please review again?

@Valentine233
Copy link
Author

Hi @rolandschulz, is the BMG machine back?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants