Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quantization] AutoGPTQ refactor and matmul combination support #694

Merged
merged 1 commit into from Aug 25, 2023

Conversation

LeiWang1999
Copy link
Contributor

@LeiWang1999 LeiWang1999 commented Aug 8, 2023

This PR refactors the AutoGPTQ integration to better align with the framework design. The PR, meanwhile, supports the AutoGPTQ quantization in MLC LLM with matmul combination.

With this PR, you will be able to compile Llama2 using the following command:

python -m mlc_llm.build --model=Llama-2-7b-chat-hf --quantization autogptq_llama_q4f16_1 --target cuda

to use the AutoGPTQ quantization. Note that the first run may take around 10 min for AutoGPTQ quantization computation, and the following runs will be much quicker. The AutoGPTQ quantization requires the Python auto_gptq package to have version at least 0.2.0.

Co-authored-by: Ruihang Lai ruihangl@cs.cmu.edu

@LeiWang1999
Copy link
Contributor Author

Please cc @MasterJH5574

mlc_llm/core.py Outdated Show resolved Hide resolved
@MasterJH5574 MasterJH5574 changed the title [Param Manager] Combined Matmul Support for auto-gptq Quant Spec [Quantization] AutoGPTQ refactor and matmul combination support Aug 15, 2023
@MasterJH5574 MasterJH5574 force-pushed the lei/gptq-combined branch 2 times, most recently from 8e0400a to b5c1162 Compare August 16, 2023 02:07
@LeiWang1999
Copy link
Contributor Author

LGTM, thanks for your hard work on this pr @MasterJH5574 !

This PR refactors the AutoGPTQ integration to better align with the
framework design. The PR, meanwhile, supports the AutoGPTQ quantization
in MLC LLM with matmul combination.

With this PR, you will be able to compile Llama2 using the following
command:
```python
python -m mlc_llm.build --model=Llama-2-7b-chat-hf --quantization autogptq_llama_q4f16_1 --target cuda
```
to use the AutoGPTQ quantization. **Note that the first run may take
around 10 min for AutoGPTQ quantization computation, and the following
runs will be much quicker.** The AutoGPTQ quantization requires the
Python `auto_gptq` package to have version at least 0.2.0.

Co-authored-by: Lei Wang <LeiWang1999@users.noreply.github.com>
@MasterJH5574 MasterJH5574 merged commit 5fe6344 into mlc-ai:main Aug 25, 2023
@Lurrobert
Copy link

Lurrobert commented Nov 3, 2023

do you also think this should be added in the requirements? auto_gptq module

not working in macos unfortunately
AutoGPTQ/AutoGPTQ#299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants