Add support for CPU offloading for quantizing bigger models on smaller GPUs #22

abhinavkulkarni · 2023-07-03T04:16:46Z

Hi,

This PR has the following changes:

Further refinement to the branch dev/more_models to do quantization layer by layer on GPU
Using device_map="auto" and max_memory kwargs to do LM Evaluation on smaller GPUs if needed

Thanks!

…o_quantize_model_weight, real_quantize_model_weight

abhinavkulkarni · 2023-07-03T04:21:45Z

awq/entry.py

-            # Init model on GPUs:
-            kwargs = {"device_map": "balanced", "torch_dtype": torch.float16}
+            # Init model on CPU:
+            kwargs = {"torch_dtype": torch.float16, "low_cpu_mem_usage": True}


low_cpu_mem_usage is needed here as loading HF models on CPU with fp16 dtype is slow, more details here: https://huggingface.co/mosaicml/mpt-7b-instruct/discussions/6#6470dec93df93fddece5fcde and

abhinavkulkarni · 2023-07-03T04:22:15Z

awq/entry.py

+            # Move the model to GPU (as much as possible) for LM evaluation
+            kwargs = {
+                "torch_dtype": torch.float16, 
+                "device_map": "auto", 


Let me know if you would want me to change device_map to balanced instead of auto.

Sakits · 2023-07-03T20:21:51Z

Hi @abhinavkulkarni,

Thank you for your work on this PR!
However, when I was reviewing and testing it, I encountered the following two issues:

I received the following error when running the original model directly: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
This seems to be due to the code that moves the model to the GPU being inside the if branch of quantization. This should probably be outside the quantization if statement.
When running the evaluation with fake quantization, the results I obtained were identical to those of the original model. This raises a question about whether the state_dict=model.state_dict() on line 167 in entry.py is functioning as intended. Additionally, when applying AWQ to models using GeLU, we add an additional scaling node, which may not be well supported in this context.

Have you encountered these issues as well? If I've misunderstood or missed anything, I'd appreciate your correction!

abhinavkulkarni · 2023-07-04T05:27:48Z

Hey @Sakits,

Thanks for your reply. I have made the necessary changes. Here's the output I got on RTX 3060 (12GB VRAM):

# LM Eval: Original model
$ python -m awq.entry --model_path mosaicml/mpt-7b-instruct \
        --max_memory 0:9GiB cpu:99GiB \
	--tasks wikitext

Time: 05:51
Results:
|  Task  |Version|    Metric     | Value |   |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext|      1|word_perplexity|10.8864|   |      |
|        |       |byte_perplexity| 1.5628|   |      |
|        |       |bits_per_byte  | 0.6441|   |      |

# LM Eval: Fake quantization
$ python -m awq.entry --model_path mosaicml/mpt-7b-instruct \
        --max_memory 0:9GiB cpu:99GiB \
	--tasks wikitext \
	--w_bit 4 --q_group_size 128 \
	--load_awq awq_cache/mpt-7b-instruct-w4-g128.pt \
	--q_backend fake

Time: 05:53
Results:
|  Task  |Version|    Metric     | Value |   |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext|      1|word_perplexity|11.2684|   |      |
|        |       |byte_perplexity| 1.5729|   |      |
|        |       |bits_per_byte  | 0.6534|   |      |

# LM Eval: Real quantization
$ python -m awq.entry --model_path mosaicml/mpt-7b-instruct \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/mpt-7b-instruct-w4-g128-awq.pt

Time: 06:52
Results:
|  Task  |Version|    Metric     | Value |   |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext|      1|word_perplexity|11.2696|   |      |
|        |       |byte_perplexity| 1.5729|   |      |
|        |       |bits_per_byte  | 0.6535|   |      |

Sakits · 2023-07-04T19:30:02Z

Hi @abhinavkulkarni,

I've reviewed your changes and everything looks great!
Thank you very much for your valuable contributions to awq project!

Abhinav Kulkarni added 3 commits July 1, 2023 12:23

[Major] Add CPU offloading support for apply_scale, apply_clip, pseud…

95cd9c2

…o_quantize_model_weight, real_quantize_model_weight

[Major] Add CPU offloading support for apply_scale, apply_clip, pseud…

d32095a

…o_quantize_model_weight, real_quantize_model_weight

[Minor] Added max-memory command line paramemter

4e7ada8

abhinavkulkarni mentioned this pull request Jul 3, 2023

Quantization of larger models on smaller GPUs using CPU offloading #15

Closed

abhinavkulkarni commented Jul 3, 2023

View reviewed changes

abhinavkulkarni mentioned this pull request Jul 3, 2023

bloom-176b CUDA out of memory on 8* A100 80g #17

Open

abhinavkulkarni force-pushed the dev/more_models branch from 843d0e0 to 6d618d0 Compare July 4, 2023 05:26

abhinavkulkarni force-pushed the dev/more_models branch from 6d618d0 to cb71dfe Compare July 4, 2023 05:30

[Minor] Added model dispatch to GPU logic

df0c600

abhinavkulkarni force-pushed the dev/more_models branch 2 times, most recently from 308d689 to d93bfb8 Compare July 4, 2023 05:39

[Minor] Added model dispatch to GPU logic

e04d0ec

abhinavkulkarni force-pushed the dev/more_models branch from d93bfb8 to e04d0ec Compare July 4, 2023 05:41

[Minor] Merge model initilization

6371c3a

Sakits merged commit ab536fb into mit-han-lab:dev/more_models Jul 4, 2023

abhinavkulkarni deleted the dev/more_models branch July 5, 2023 06:49

abhinavkulkarni restored the dev/more_models branch July 9, 2023 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CPU offloading for quantizing bigger models on smaller GPUs #22

Add support for CPU offloading for quantizing bigger models on smaller GPUs #22

abhinavkulkarni commented Jul 3, 2023

abhinavkulkarni Jul 3, 2023

abhinavkulkarni Jul 3, 2023

Sakits commented Jul 3, 2023 •

edited

Loading

abhinavkulkarni commented Jul 4, 2023 •

edited

Loading

Sakits commented Jul 4, 2023

Add support for CPU offloading for quantizing bigger models on smaller GPUs #22

Add support for CPU offloading for quantizing bigger models on smaller GPUs #22

Conversation

abhinavkulkarni commented Jul 3, 2023

abhinavkulkarni Jul 3, 2023

Choose a reason for hiding this comment

abhinavkulkarni Jul 3, 2023

Choose a reason for hiding this comment

Sakits commented Jul 3, 2023 • edited Loading

abhinavkulkarni commented Jul 4, 2023 • edited Loading

Sakits commented Jul 4, 2023

Sakits commented Jul 3, 2023 •

edited

Loading

abhinavkulkarni commented Jul 4, 2023 •

edited

Loading