Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Quantization] AutoGPTQ refactor and matmul combination support
This PR refactors the AutoGPTQ integration to better align with the framework design. The PR, meanwhile, supports the AutoGPTQ quantization in MLC LLM with matmul combination. With this PR, you will be able to compile Llama2 using the following command: ```python python -m mlc_llm.build --model=Llama-2-7b-chat-hf --quantization autogptq_llama_q4f16_1 --target cuda ``` to use the AutoGPTQ quantization. **Note that the first run may take around 10 min for AutoGPTQ quantization computation, and the following runs will be much quicker.** The AutoGPTQ quantization requires the Python `auto_gptq` package to have version at least 0.2.0. Co-authored-by: Lei Wang <LeiWang1999@users.noreply.github.com>
- Loading branch information