You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There has been a lot of progress in low precision inference (Int3/4/5) with LLMs like LLaMA using GPTQ and llama.cpp. While we are adding support for quantization to StableHLO it would be good to ensure we can express a GPTQ quantized LLaMA via StableHLO.
It would be awesome if we added a small example of importing LLaMA form HuggingFace and exporting a GPTQ quantized StableHLO IR.
The text was updated successfully, but these errors were encountered:
We have started working on gathering what's needed in StableHLO to represent quantization in cutting-edge LLMs. The discussion in #1535 (comment) is pretty useful in that direction. As a starting point, we are exploring some of the int4 LLaMA / Vicuna torch IR while working with @powderluv and @qedawkins .
I note that the GTPQ is definitely within scope of coverage along with other algorithms for the StableHLO Quantizer team. We will update the ticket once the plans are finalized.
Request description
There has been a lot of progress in low precision inference (Int3/4/5) with LLMs like LLaMA using GPTQ and llama.cpp. While we are adding support for quantization to StableHLO it would be good to ensure we can express a GPTQ quantized LLaMA via StableHLO.
It would be awesome if we added a small example of importing LLaMA form HuggingFace and exporting a GPTQ quantized StableHLO IR.
The text was updated successfully, but these errors were encountered: