"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57

ohmeow · 2023-11-28T23:55:20Z

So I'm attempting to run the DPO LoRA script and I'm getting this error:

RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1

... when the model.merge_and_load() runs here:

base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
model.eval()
model = model.merge_and_unload()

Any ideas?

The text was updated successfully, but these errors were encountered:

ohmeow · 2023-11-29T00:57:00Z

NOTE: This only occurs if I'm using the deepspeed accelerate config and set num_processes > 1

ohmeow · 2023-11-29T02:41:28Z

So I think the solution to add accelerator.wait_for_everyone() you instantiate the DPOTrainer.

If someone can confirm that feel free to close this out. If not, lmk :)

ohmeow · 2023-12-01T20:08:59Z

I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.

lewtun · 2023-12-04T12:36:27Z

Hi @ohmeow as discussed here I think indeed the issue is when trying to do the following:

Use DeepSpeed's zero.init() to shard the base model weights directly on GPU via this flag in the accelerate config
Try to merge the adapter weights on the sharded base model

I don't think we saw this issue in the original release of the code because we made a goof on the device_map for LoRA training that was later fixed in #51

If you have enough vRAM then one should be able to workaround this by setting zero3_init_flag: False in the accelerate config.

I'm discussing this with the peft team and hopefully can find a more stable solution!

ohmeow · 2023-12-04T20:49:07Z

The only way I was able to get training to proceed was by adding device_map=get_kbit_device_map() to the model_kwargs when loading an adapter model.

    if is_adapter_model(model, model_args.model_revision):
        # load the model, merge the adapter weights and unload the adapter
        # Note: to run QLora, you will need to merge the based model separately as the merged model in 16bit
        logger.info(f"Merging peft adapters for {model_args.model_name_or_path=}")

        peft_config = PeftConfig.from_pretrained(model_args.model_name_or_path, revision=model_args.model_revision)

        model_kwargs = dict(
            revision=model_args.base_model_revision,
            trust_remote_code=model_args.trust_remote_code,
            use_flash_attention_2=model_args.use_flash_attention_2,
            torch_dtype=torch_dtype,
            use_cache=False if training_args.gradient_checkpointing else True,
            device_map=get_kbit_device_map(),
        )

        base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
        model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
        model.eval()
        model = model.merge_and_unload()
        model_kwargs = None

    if model_args.use_peft is True:
        ref_model = None
        ref_model_kwargs = None
    else:
        ref_model = model
        ref_model_kwargs = model_kwargs

    accelerator.wait_for_everyone()

With this I can get everything running on my 2x3090s using the multi-gpu.yaml. GPU utilization looks even across both cards.

The deepspeed config works as well but for some reason fails when pushing the model to the hub. I imagine this has something to do with my machine and/or with using 3090s.

Randl · 2023-12-10T07:58:05Z

Can confirm that setting zero3_init_flag: False helps.

hhhhuiyuan · 2024-03-29T14:59:33Z

I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.

Having the same issue here, but wierdly, the DPO script can not run even with multi-gpu.yaml on my machine, could you please share your multi-gpu.yaml file? In my understanding, multi-gpu.yaml is for data parallelising, so it should not have problem with merge Qlora adaptator.

lewtun added the bug Something isn't working label Dec 4, 2023

TJ-Solergibert mentioned this issue Mar 7, 2024

Not able to run Zephyr 7B Gemma with 4 80GB A100s #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57

ohmeow commented Nov 28, 2023

ohmeow commented Nov 29, 2023

ohmeow commented Nov 29, 2023

ohmeow commented Dec 1, 2023

lewtun commented Dec 4, 2023

ohmeow commented Dec 4, 2023

Randl commented Dec 10, 2023

hhhhuiyuan commented Mar 29, 2024

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57

Comments

ohmeow commented Nov 28, 2023

ohmeow commented Nov 29, 2023

ohmeow commented Nov 29, 2023

ohmeow commented Dec 1, 2023

lewtun commented Dec 4, 2023

ohmeow commented Dec 4, 2023

Randl commented Dec 10, 2023

hhhhuiyuan commented Mar 29, 2024