-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CPU offloading for quantizing bigger models on smaller GPUs #22
Add support for CPU offloading for quantizing bigger models on smaller GPUs #22
Conversation
…o_quantize_model_weight, real_quantize_model_weight
…o_quantize_model_weight, real_quantize_model_weight
awq/entry.py
Outdated
# Init model on GPUs: | ||
kwargs = {"device_map": "balanced", "torch_dtype": torch.float16} | ||
# Init model on CPU: | ||
kwargs = {"torch_dtype": torch.float16, "low_cpu_mem_usage": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low_cpu_mem_usage
is needed here as loading HF models on CPU with fp16 dtype is slow, more details here: https://huggingface.co/mosaicml/mpt-7b-instruct/discussions/6#6470dec93df93fddece5fcde and
awq/entry.py
Outdated
# Move the model to GPU (as much as possible) for LM evaluation | ||
kwargs = { | ||
"torch_dtype": torch.float16, | ||
"device_map": "auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you would want me to change device_map
to balanced
instead of auto
.
Hi @abhinavkulkarni, Thank you for your work on this PR!
Have you encountered these issues as well? If I've misunderstood or missed anything, I'd appreciate your correction! |
843d0e0
to
6d618d0
Compare
Hey @Sakits, Thanks for your reply. I have made the necessary changes. Here's the output I got on RTX 3060 (12GB VRAM): # LM Eval: Original model
$ python -m awq.entry --model_path mosaicml/mpt-7b-instruct \
--max_memory 0:9GiB cpu:99GiB \
--tasks wikitext
Time: 05:51
Results:
| Task |Version| Metric | Value | |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext| 1|word_perplexity|10.8864| | |
| | |byte_perplexity| 1.5628| | |
| | |bits_per_byte | 0.6441| | |
# LM Eval: Fake quantization
$ python -m awq.entry --model_path mosaicml/mpt-7b-instruct \
--max_memory 0:9GiB cpu:99GiB \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/mpt-7b-instruct-w4-g128.pt \
--q_backend fake
Time: 05:53
Results:
| Task |Version| Metric | Value | |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext| 1|word_perplexity|11.2684| | |
| | |byte_perplexity| 1.5729| | |
| | |bits_per_byte | 0.6534| | |
# LM Eval: Real quantization
$ python -m awq.entry --model_path mosaicml/mpt-7b-instruct \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant quant_cache/mpt-7b-instruct-w4-g128-awq.pt
Time: 06:52
Results:
| Task |Version| Metric | Value | |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext| 1|word_perplexity|11.2696| | |
| | |byte_perplexity| 1.5729| | |
| | |bits_per_byte | 0.6535| | | |
6d618d0
to
cb71dfe
Compare
308d689
to
d93bfb8
Compare
d93bfb8
to
e04d0ec
Compare
Hi @abhinavkulkarni, I've reviewed your changes and everything looks great! |
Hi,
This PR has the following changes:
dev/more_models
to do quantization layer by layer on GPUdevice_map="auto"
andmax_memory
kwargs to do LM Evaluation on smaller GPUs if neededThanks!