Can the BNB quantization process be on GPU? #30770

mxjmtxrm · 2024-05-13T08:10:14Z

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0a0+81ea7a4 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@SunMarc and @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I noticed that when quantization config is not None and is_deepspeed_zero3_enabled() is True, the device map is 'cpu'. Thus the quantization process is on CPU.
Why this? If the quantization can be run on the GPUs?

Expected behavior

--

The text was updated successfully, but these errors were encountered:

younesbelkada · 2024-05-13T09:46:12Z

Hi @mxjmtxrm
Thanks for the issue ! Do you have a small reproducer of the issue for us to better picture what is going on?

mxjmtxrm · 2024-05-13T09:56:43Z

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=False,
            bnb_4bit_quant_storage=torch.float16,
        )
model = AutoModelForCausalLM.from_pretrained(
            ‘meta-llama/Llama-2-7b-chat-hf’,
            torch_dtype=orch.float16,
            trust_remote_code=True,
            quantization_config=bnb_config,
            attn_implementation="flash_attention_2",
        )

The command is:

accelerate launch --config_file "configs/deepspeed_config_z3.yaml" test.py

And the deepspeed_config_z3.yaml is

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The GPU memory usage during from_pretrained is very slow, as the quantization process is going on CPU.
Same with other quantization method, like EETQ and AWQ.

amyeroberts added the Quantization label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the BNB quantization process be on GPU? #30770

Can the BNB quantization process be on GPU? #30770

mxjmtxrm commented May 13, 2024

younesbelkada commented May 13, 2024

mxjmtxrm commented May 13, 2024

Can the BNB quantization process be on GPU? #30770

Can the BNB quantization process be on GPU? #30770

Comments

mxjmtxrm commented May 13, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

younesbelkada commented May 13, 2024

mxjmtxrm commented May 13, 2024