Unable to get KV cache from GPU when KV cache is quantized to q8_0 #34555
Unanswered
KarSri7694
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I am currently building a prototype pipeline for the GUI Agent project (GSoC Project 1) and have run into a blocker I'd appreciate some guidance on.
As part of the pipeline, I need to capture a snapshot of the KV cache prior to image input being processed. To handle longer contexts more efficiently, I attempted to quantize the KV cache to
q8_0.Issue
Upon enabling
q8_0quantization,get_stateandset_stateoperations on KV cache tensors fail with the following error:It appears that KV cache compression (quantization) and the state snapshot API are mutually exclusive in the current implementation. get_state and set_state will not work when KV cache is quantized
What Works
Setting KV precision to
f16resolves the issue —get_stateandset_stateoperate as expected under that configuration.Question
Is there a known workaround to capture a snapshot of a quantized (
q8_0) KV cache, or is falling back tof16precision currently the only viable option? Any pointers to relevant internals or planned support for this would be greatly appreciated.Thank you in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions