[Quantization] AutoGPTQ refactor and matmul combination support #694

This PR refactors the AutoGPTQ integration to better align with the framework design. The PR, meanwhile, supports the AutoGPTQ quantization in MLC LLM with matmul combination. With this PR, you will be able to compile Llama2 using the following command: ```python python -m mlc_llm.build --model=Llama-2-7b-chat-hf --quantization autogptq_llama_q4f16_1 --target cuda ``` to use the AutoGPTQ quantization. **Note that the first run may take around 10 min for AutoGPTQ quantization computation, and the following runs will be much quicker.** The AutoGPTQ quantization requires the Python `auto_gptq` package to have version at least 0.2.0. Co-authored-by: Lei Wang <LeiWang1999@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] AutoGPTQ refactor and matmul combination support #694

[Quantization] AutoGPTQ refactor and matmul combination support #694

Commits on Aug 25, 2023