Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect. #1890

Closed
2 of 4 tasks
KarasZhang opened this issue Jun 27, 2024 · 15 comments

Comments

@KarasZhang
Copy link

KarasZhang commented Jun 27, 2024

System Info

bitsandbytes==0.43.1
peft==0.11.0
accelerate==0.31.0
transformers==4.38.2
trl==0.9.4

Who can help?

@BenjaminBossan @sayakpaul

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

Hello
I have 8 Nvidia H100 GPUs and trying to do some training with qlora and deepspeed zero3.
I'm using the code examples/sft/train.py but have no luck.

script:
!accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml" train.py
--seed 100
--model_name_or_path "tokyotech-llm/Swallow-70b-hf"
--dataset_name "smangrul/ultrachat-10k-chatml"
--chat_template_format "chatml"
--add_special_tokens False
--append_concat_token False
--splits "train,test"
--max_seq_len 4096
--num_train_epochs 3
--logging_steps 1
--log_level "info"
--logging_strategy "steps"
--evaluation_strategy "epoch"
--save_strategy "epoch"
--bf16 True
--packing True
--learning_rate 1e-4
--lr_scheduler_type "cosine"
--weight_decay 1e-4
--warmup_ratio 0.0
--max_grad_norm 1.0
--output_dir "mistral-sft-lora-multigpu"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--gradient_checkpointing True
--report_to "tensorboard"
--use_reentrant False
--dataset_text_field "content"
--use_peft_lora True
--lora_r 8
--lora_alpha 16
--lora_dropout 0.1
--lora_target_modules "all-linear"
--use_4bit_quantization True
--use_nested_quant True
--bnb_4bit_compute_dtype "bfloat16"
--use_flash_attn False

Got error below:
ValueErrorValueError: : Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Expected behavior

I'm not sure, but the train.py should work without flash_attention.

@BenjaminBossan
Copy link
Member

This error is common when trying to set the data of weight that is sharded to a different device. However, based on the information you give, it's impossible to say why that happens. If possible, please share the full error message, the content of train.py (or even better, a minimal code example), and the content of deepspeed_config_z3_qlora.yaml.

@KarasZhang
Copy link
Author

@BenjaminBossan
Thanks for you reply.
Here are the error logs:

 (app-root) accelerate launch --config_file configs/deepspeed_config_z3_qlora.yaml  qlora_train.py \
--seed 100 \
--model_name_or_path "tokyotech-llm/Swallow-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 4096 \
--num_train_epochs 3 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-qlora-dsz3" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--gradient_checkpointing True \
--report_to "tensorboard" \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn False \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16" \
--bnb_4bit_quant_storage_dtype "bfloat16"

[2024-06-27 09:21:01,622] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2024-06-27 09:21:10,921] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:11,747] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:11,969] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:12,053] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:12,249] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:12,446] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:12,460] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:12,463] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-27 09:21:12,733] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:13,594] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:13,752] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:13,981] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:14,084] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:14,430] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:14,430] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-27 09:21:14,467] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-27 09:21:14,508] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
[2024-06-27 09:21:24,141] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 723, num_elems = 69.16B
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
    main(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
    model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:02<?, ?it/s]
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
    main(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
    model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
        return model_class.from_pretrained(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(

  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:03<?, ?it/s]

Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:03<?, ?it/s]
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
    main(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
Traceback (most recent call last):
        main(model_args, data_args, training_args)main(model_args, data_args, training_args)  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
    

model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main

  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
        model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)

  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
        model = AutoModelForCausalLM.from_pretrained(model = AutoModelForCausalLM.from_pretrained(

  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    main(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
    model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
        return model_class.from_pretrained(return model_class.from_pretrained(

  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
            ) = cls._load_pretrained_model() = cls._load_pretrained_model(
) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model

  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
            new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(


  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
        set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)

      File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError        : raise ValueError(raise ValueError(    Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.

raise ValueError(

ValueErrorValueError: : Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.ValueErrorTrying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.
: 
Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:03<?, ?it/s]
    main(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
    model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
Traceback (most recent call last):
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 168, in <module>
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    main(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_train.py", line 113, in main
    model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)
  File "/opt/app-root/src/ds_qlora/qlora_utils.py", line 131, in create_and_prepare_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([43176, 8192]) in "weight" (which has shape torch.Size([0])), this look incorrect.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31388 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31389 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31390 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31391 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31392 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31393 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31394 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31387) of binary: /opt/app-root/bin/python3.9
Traceback (most recent call last):
  File "/opt/app-root/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/app-root/lib64/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/app-root/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/app-root/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-27_09:21:28
  host      : xxxxxx
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 31387)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

and the train.py :

import os
import sys
from dataclasses import dataclass, field
from typing import Optional

from transformers import HfArgumentParser, TrainingArguments, set_seed
from trl import SFTTrainer
from qlora_utils import create_and_prepare_model, create_datasets

# Define and parse arguments.
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    chat_template_format: Optional[str] = field(
        default="none",
        metadata={
            "help": "chatml|zephyr|none. Pass `none` if the dataset is already formatted with the chat template."
        },
    )
    lora_alpha: Optional[int] = field(default=16)
    lora_dropout: Optional[float] = field(default=0.1)
    lora_r: Optional[int] = field(default=64)
    lora_target_modules: Optional[str] = field(
        default="q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj",
        metadata={"help": "comma separated list of target modules to apply LoRA layers to"},
    )
    use_nested_quant: Optional[bool] = field(
        default=False,
        metadata={"help": "Activate nested quantization for 4bit base models"},
    )
    bnb_4bit_compute_dtype: Optional[str] = field(
        default="float16",
        metadata={"help": "Compute dtype for 4bit base models"},
    )
    bnb_4bit_quant_storage_dtype: Optional[str] = field(
        default="uint8",
        metadata={"help": "Quantization storage dtype for 4bit base models"},
    )
    bnb_4bit_quant_type: Optional[str] = field(
        default="nf4",
        metadata={"help": "Quantization type fp4 or nf4"},
    )
    use_flash_attn: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables Flash attention for training."},
    )
    use_peft_lora: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables PEFT LoRA for training."},
    )
    use_8bit_quantization: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables loading model in 8bit."},
    )
    use_4bit_quantization: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables loading model in 4bit."},
    )
    use_reentrant: Optional[bool] = field(
        default=False,
        metadata={"help": "Gradient Checkpointing param. Refer the related docs"},
    )
    use_unsloth: Optional[bool] = field(
        default=False,
        metadata={"help": "Enables UnSloth for training."},
    )


@dataclass
class DataTrainingArguments:
    dataset_name: Optional[str] = field(
        default="timdettmers/openassistant-guanaco",
        metadata={"help": "The preference dataset to use."},
    )
    packing: Optional[bool] = field(
        default=False,
        metadata={"help": "Use packing dataset creating."},
    )
    dataset_text_field: str = field(default="text", metadata={"help": "Dataset field to use as input text."})
    max_seq_length: Optional[int] = field(default=512)
    append_concat_token: Optional[bool] = field(
        default=False,
        metadata={"help": "If True, appends `eos_token_id` at the end of each sample being packed."},
    )
    add_special_tokens: Optional[bool] = field(
        default=False,
        metadata={"help": "If True, tokenizers adds special tokens to each sample being packed."},
    )
    splits: Optional[str] = field(
        default="train,test",
        metadata={"help": "Comma separate list of the splits to use from the dataset."},
    )


def main(model_args, data_args, training_args):
    # Set seed for reproducibility
    set_seed(training_args.seed)

    # model
    model, peft_config, tokenizer = create_and_prepare_model(model_args, data_args, training_args)

    # gradient ckpt
    model.config.use_cache = not training_args.gradient_checkpointing
    training_args.gradient_checkpointing = training_args.gradient_checkpointing and not model_args.use_unsloth
    if training_args.gradient_checkpointing:
        training_args.gradient_checkpointing_kwargs = {"use_reentrant": model_args.use_reentrant}

    # datasets
    train_dataset, eval_dataset = create_datasets(
        tokenizer,
        data_args,
        training_args,
        apply_chat_template=model_args.chat_template_format != "none",
    )

    # trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=peft_config,
        packing=data_args.packing,
        dataset_kwargs={
            "append_concat_token": data_args.append_concat_token,
            "add_special_tokens": data_args.add_special_tokens,
        },
        dataset_text_field=data_args.dataset_text_field,
        max_seq_length=data_args.max_seq_length,
    )
    trainer.accelerator.print(f"{trainer.model}")
    trainer.model.print_trainable_parameters()

    # train
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    trainer.train(resume_from_checkpoint=checkpoint)

    # saving final model
    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
    trainer.save_model()


if __name__ == "__main__":
    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    main(model_args, data_args, training_args)

together with the utils.py:

import os
from enum import Enum

import torch
from datasets import DatasetDict, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

from peft import LoraConfig


DEFAULT_CHATML_CHAT_TEMPLATE = "{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"
DEFAULT_ZEPHYR_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


class ZephyrSpecialTokens(str, Enum):
    user = "<|user|>"
    assistant = "<|assistant|>"
    system = "<|system|>"
    eos_token = "</s>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]


class ChatmlSpecialTokens(str, Enum):
    user = "<|im_start|>user"
    assistant = "<|im_start|>assistant"
    system = "<|im_start|>system"
    eos_token = "<|im_end|>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]


def create_datasets(tokenizer, data_args, training_args, apply_chat_template=False):
    def preprocess(samples):
        batch = []
        for conversation in samples["messages"]:
            batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
        return {"content": batch}

    raw_datasets = DatasetDict()
    for split in data_args.splits.split(","):
        try:
            # Try first if dataset on a Hub repo
            dataset = load_dataset(data_args.dataset_name, split=split)
        except DatasetGenerationError:
            # If not, check local dataset
            dataset = load_from_disk(os.path.join(data_args.dataset_name, split))

        if "train" in split:
            raw_datasets["train"] = dataset
        elif "test" in split:
            raw_datasets["test"] = dataset
        else:
            raise ValueError(f"Split type {split} not recognized as one of test or train.")

    if apply_chat_template:
        raw_datasets = raw_datasets.map(
            preprocess,
            batched=True,
            remove_columns=raw_datasets["train"].column_names,
        )

    train_data = raw_datasets["train"]
    valid_data = raw_datasets["test"]
    print(f"Size of the train set: {len(train_data)}. Size of the validation set: {len(valid_data)}")
    print(f"A sample of train dataset: {train_data[0]}")

    return train_data, valid_data


def create_and_prepare_model(args, data_args, training_args):
    if args.use_unsloth:
        from unsloth import FastLanguageModel
    bnb_config = None
    quant_storage_dtype = None

    if (
        torch.distributed.is_available()
        and torch.distributed.is_initialized()
        and torch.distributed.get_world_size() > 1
        and args.use_unsloth
    ):
        raise NotImplementedError("Unsloth is not supported in distributed training")

    if args.use_4bit_quantization:
        compute_dtype = getattr(torch, args.bnb_4bit_compute_dtype)
        quant_storage_dtype = getattr(torch, args.bnb_4bit_quant_storage_dtype)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=args.use_4bit_quantization,
            bnb_4bit_quant_type=args.bnb_4bit_quant_type,
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=args.use_nested_quant,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )

        if compute_dtype == torch.float16 and args.use_4bit_quantization:
            major, _ = torch.cuda.get_device_capability()
            if major >= 8:
                print("=" * 80)
                print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
                print("=" * 80)
        elif args.use_8bit_quantization:
            bnb_config = BitsAndBytesConfig(load_in_8bit=args.use_8bit_quantization)

    if args.use_unsloth:
        # Load model
        model, _ = FastLanguageModel.from_pretrained(
            model_name=args.model_name_or_path,
            max_seq_length=data_args.max_seq_length,
            dtype=None,
            load_in_4bit=args.use_4bit_quantization,
        )
    else:
        torch_dtype = (
            quant_storage_dtype if quant_storage_dtype and quant_storage_dtype.is_floating_point else torch.float32
        )
        model = AutoModelForCausalLM.from_pretrained(
            args.model_name_or_path,
            quantization_config=bnb_config,
            trust_remote_code=True,
            attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
            torch_dtype=torch_dtype,
        )

    peft_config = None
    chat_template = None
    if args.use_peft_lora and not args.use_unsloth:
        peft_config = LoraConfig(
            lora_alpha=args.lora_alpha,
            lora_dropout=args.lora_dropout,
            r=args.lora_r,
            bias="none",
            task_type="CAUSAL_LM",
            target_modules=args.lora_target_modules.split(",")
            if args.lora_target_modules != "all-linear"
            else args.lora_target_modules,
        )

    special_tokens = None
    chat_template = None
    if args.chat_template_format == "chatml":
        special_tokens = ChatmlSpecialTokens
        chat_template = DEFAULT_CHATML_CHAT_TEMPLATE
    elif args.chat_template_format == "zephyr":
        special_tokens = ZephyrSpecialTokens
        chat_template = DEFAULT_ZEPHYR_CHAT_TEMPLATE

    if special_tokens is not None:
        tokenizer = AutoTokenizer.from_pretrained(
            args.model_name_or_path,
            pad_token=special_tokens.pad_token.value,
            bos_token=special_tokens.bos_token.value,
            eos_token=special_tokens.eos_token.value,
            additional_special_tokens=special_tokens.list(),
            trust_remote_code=True,
        )
        tokenizer.chat_template = chat_template
        # make embedding resizing configurable?
        model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
    else:
        tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token

    if args.use_unsloth:
        # Do model patching and add fast LoRA weights
        model = FastLanguageModel.get_peft_model(
            model,
            lora_alpha=args.lora_alpha,
            lora_dropout=args.lora_dropout,
            r=args.lora_r,
            target_modules=args.lora_target_modules.split(",")
            if args.lora_target_modules != "all-linear"
            else args.lora_target_modules,
            use_gradient_checkpointing=training_args.gradient_checkpointing,
            random_state=training_args.seed,
            max_seq_length=data_args.max_seq_length,
        )

    return model, peft_config, tokenizer

the deepspeed_config_z3_qlora.yaml:

compute_environment: LOCAL_MACHINE                                                          
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I thought that there may be something different in Swallow-70b-hf with Llama2, so I tried another 70b which is nvidia/Llama3-ChatQA-1.5-70B and got the same errors.

@KarasZhang
Copy link
Author

KarasZhang commented Jun 27, 2024

BTW, I ran this code with python rather than accelerate launch, and I could load the weights, but got an OOM error while trainer.train().

@BenjaminBossan
Copy link
Member

Thanks for providing more context. So this is basically the examples/sft script from PEFT. Could you also show the output of accelerate env?

From the error message, we can tell that the error occurs when the base model is being loaded, i.e. before there is any involvement of PEFT. Unfortunately, I don't have a setup available to test your code but I assume that there is some kind of misconfiguration between DS, accelerate, and your system. To start to debug this, I would probably create a new script with just the model loading (i.e. line 131ff in utils.py) and make that work.

BTW, I ran this code with python rather than accelerate launch, and I could load the weights, but got an OOM error while trainer.train().

That's expected, as running it with Python does not shard the model with DS, so there is no sharding error but your memory will be exhausted.

@KarasZhang
Copy link
Author

@BenjaminBossan
It seems that the Accelerate default config got missed, but I do remeber that I tried accelerate config and answered some question...something wrong here?

(app-root) accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.31.0
- Platform: Linux-5.14.0-284.55.1.el9_2.x86_64-x86_64-with-glibc2.34
- `accelerate` bash location: /opt/app-root/bin/accelerate
- Python version: 3.9.18
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.58 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        Not found

@BenjaminBossan
Copy link
Member

It seems that the Accelerate default config got missed, but I do remeber that I tried accelerate config and answered some question...something wrong here?

Hmm, not sure what could have happened here. Please try to create it again and then check if it can be found. Also, something worth trying could be to update the PyTorch version to a more recent one, if possible. How many GPUs do you have?

@KarasZhang
Copy link
Author

Please try to create it again and then check if it can be found.
OK, I will take a look at the accelerate docs again and see whether there is something missed.
Also, something worth trying could be to update the PyTorch version to a more recent one, if possible.
Well, I'm not sure whether I can update Pytorch because of the version of CUDA driver.
How many GPUs do you have?
I have 8 H100 GPUs, and I have tried full-parameter tuning on a 70b model with deepspeed ZeRO3 and offload, which takes me 130hours and 1.7TB RAMs. So I'm interested in how well qlora works.

@KarasZhang
Copy link
Author

@BenjaminBossan
By runing accelerate config and answering several questions, I got Accelerate default config like below. But with no luck, the same error again.

 (app-root) accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.31.0
- Platform: Linux-5.14.0-284.55.1.el9_2.x86_64-x86_64-with-glibc2.34
- `accelerate` bash location: /opt/app-root/bin/accelerate
- Python version: 3.9.18
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.58 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': 'ds_config_zero3.yaml', 'deepspeed_moe_layer_cls_names': '', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

@KarasZhang
Copy link
Author

KarasZhang commented Jun 27, 2024

@BenjaminBossan
I wrote a test.py as below, and the error never happens again. Instead, I got a OOM error...
It seems that the weights are loaded to the first GPU only. Did I miss anything in the code?

import os
import torch

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

model_name = "tokyotech-llm/Swallow-70b-hf"

compute_dtype = getattr(torch, "bfloat16")
quant_storage_dtype = getattr(torch, "bfloat16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=quant_storage_dtype,
)

torch_dtype = quant_storage_dtype

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    torch_dtype=torch_dtype,
)

The logs:

 (app-root) accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  test.py
[2024-06-27 15:01:33,077] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/app-root/lib64/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                               | 0/15 [00:00<?, ?it/s]/opt/app-root/lib64/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:  20%|█████████████████▍                                                                     | 3/15 [00:18<01:12,  6.05s/it]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/test.py", line 30, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 213, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/functional.py", line 1169, in quantize_4bit
    out = torch.zeros(((n + 1) // mod, 1), dtype=quant_storage, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 79.10 GiB total capacity; 8.17 GiB already allocated; 109.06 MiB free; 8.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:  20%|█████████████████▍                                                                     | 3/15 [00:18<01:13,  6.14s/it]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/test.py", line 30, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 213, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/functional.py", line 1169, in quantize_4bit
    out = torch.zeros(((n + 1) // mod, 1), dtype=quant_storage, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 79.10 GiB total capacity; 8.40 GiB already allocated; 109.06 MiB free; 8.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:  27%|███████████████████████▏                                                               | 4/15 [00:20<00:55,  5.03s/it]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/test.py", line 30, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 213, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/functional.py", line 1169, in quantize_4bit
    out = torch.zeros(((n + 1) // mod, 1), dtype=quant_storage, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 79.10 GiB total capacity; 10.34 GiB already allocated; 109.06 MiB free; 10.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:  20%|█████████████████▍                                                                     | 3/15 [00:18<01:14,  6.22s/it]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/test.py", line 30, in <module>
Loading checkpoint shards:  20%|█████████████████▍                                                                     | 3/15 [00:18<01:15,  6.27s/it]
Traceback (most recent call last):
  File "/opt/app-root/src/ds_qlora/test.py", line 30, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
        return model_class.from_pretrained(model = AutoModelForCausalLM.from_pretrained(

  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 213, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/app-root/lib64/python3.9/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
    return self._quantize(device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
        hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)w_4bit, quant_state = bnb.functional.quantize_4bit(

  File "/opt/app-root/lib64/python3.9/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 213, in create_quantized_param
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/functional.py", line 1169, in quantize_4bit
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    out = torch.zeros(((n + 1) // mod, 1), dtype=quant_storage, device=A.device)
torch.cuda.    w_4bit, quant_state = bnb.functional.quantize_4bit(OutOfMemoryError
:   File "/opt/app-root/lib64/python3.9/site-packages/bitsandbytes/functional.py", line 1169, in quantize_4bit
CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 79.10 GiB total capacity; 7.99 GiB already allocated; 109.06 MiB free; 8.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    out = torch.zeros(((n + 1) // mod, 1), dtype=quant_storage, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 79.10 GiB total capacity; 7.99 GiB already allocated; 109.06 MiB free; 8.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39570 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39571 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39572 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39573 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39574 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39575 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 39576 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 39569) of binary: /opt/app-root/bin/python3.9
Traceback (most recent call last):
  File "/opt/app-root/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/opt/app-root/lib64/python3.9/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/app-root/lib64/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/app-root/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/app-root/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-27_15:02:03
  host      : pls-eas-llm-finetuning-fy24up2-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 39569)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@BenjaminBossan
Copy link
Member

Instead, I got a OOM error...

Hmm, that's not good. If you monitor the GPU memory usage, can you see if the model is sharded across GPUs or are all the weights loaded to a single one? One more thing you could do is try to upgrade transformers to a newer version.

@KarasZhang
Copy link
Author

KarasZhang commented Jul 1, 2024

@BenjaminBossan

If you monitor the GPU memory usage, can you see if the model is sharded across GPUs or are all the weights loaded to a single one?

Well, nvidia-smi shows that the code is trying to load all the weights to the 1st GPT, and the other GPUs use only 3MiB / 81559MiB.
image

The funniest is that when I run python test.py, I can load the whole 70b in ONE GPU! But got OOM error with accelerate launch --num_processes 8 --config_file configs/deepspeed_config_z3_qlora.yaml test.py...
Something wrong with accelerate?

@BenjaminBossan
Copy link
Member

It could be accelerate, bnb, or deepspeed. Maybe something is not configured correctly, what does the DS config file contain? And does it work when you remove bnb?

@KarasZhang
Copy link
Author

Here is the deepspeed_config_z3_qlora.yaml

compute_environment: LOCAL_MACHINE                                                          
debug: false
deepspeed_config:
  zero_stage: 3
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And does it work when you remove bnb?

With accelerate launch --num_processes 8 --config_file configs/deepspeed_config_z3_qlora.yaml test.py, all 8 GPUs have 3mb used (3MiB / 81559MiB ) when loading the 70b model...
Does that mean nothing is loaded to GPUs?

@BenjaminBossan
Copy link
Member

Hmm, I don't see anything wrong with the config, but I'm not DeepSpeed expert.

With accelerate launch --num_processes 8 --config_file configs/deepspeed_config_z3_qlora.yaml test.py, all 8 GPUs have 3mb used (3MiB / 81559MiB ) when loading the 70b model...
Does that mean nothing is loaded to GPUs?

All but GPU 0, which is trying to load the whole model according to your screenshot. Another thing you could try is to use the exact same settings but without bitsandbytes and see if it changes anything. If not, I would probably open an issue on accelerate.

@KarasZhang
Copy link
Author

@BenjaminBossan
Finaly, I got this work! Just update torch to 2.3.1 and downgrade accelerate 0.31.0->0.28.0, bitsandbytes->0.43.1->0.43.0, and also transformers 4.38.2->4.39.2.
Many thanks, Ben.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants