Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57

Open
ohmeow opened this issue Nov 28, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@ohmeow
Copy link

ohmeow commented Nov 28, 2023

So I'm attempting to run the DPO LoRA script and I'm getting this error:

RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1

... when the model.merge_and_load() runs here:

base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
model.eval()
model = model.merge_and_unload()

Any ideas?

@ohmeow
Copy link
Author

ohmeow commented Nov 29, 2023

NOTE: This only occurs if I'm using the deepspeed accelerate config and set num_processes > 1

@ohmeow
Copy link
Author

ohmeow commented Nov 29, 2023

So I think the solution to add accelerator.wait_for_everyone() you instantiate the DPOTrainer.

If someone can confirm that feel free to close this out. If not, lmk :)

@ohmeow
Copy link
Author

ohmeow commented Dec 1, 2023

I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.

@lewtun
Copy link
Member

lewtun commented Dec 4, 2023

Hi @ohmeow as discussed here I think indeed the issue is when trying to do the following:

  • Use DeepSpeed's zero.init() to shard the base model weights directly on GPU via this flag in the accelerate config
  • Try to merge the adapter weights on the sharded base model

I don't think we saw this issue in the original release of the code because we made a goof on the device_map for LoRA training that was later fixed in #51

If you have enough vRAM then one should be able to workaround this by setting zero3_init_flag: False in the accelerate config.

I'm discussing this with the peft team and hopefully can find a more stable solution!

@lewtun lewtun added the bug Something isn't working label Dec 4, 2023
@ohmeow
Copy link
Author

ohmeow commented Dec 4, 2023

The only way I was able to get training to proceed was by adding device_map=get_kbit_device_map() to the model_kwargs when loading an adapter model.

    if is_adapter_model(model, model_args.model_revision):
        # load the model, merge the adapter weights and unload the adapter
        # Note: to run QLora, you will need to merge the based model separately as the merged model in 16bit
        logger.info(f"Merging peft adapters for {model_args.model_name_or_path=}")

        peft_config = PeftConfig.from_pretrained(model_args.model_name_or_path, revision=model_args.model_revision)

        model_kwargs = dict(
            revision=model_args.base_model_revision,
            trust_remote_code=model_args.trust_remote_code,
            use_flash_attention_2=model_args.use_flash_attention_2,
            torch_dtype=torch_dtype,
            use_cache=False if training_args.gradient_checkpointing else True,
            device_map=get_kbit_device_map(),
        )

        base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
        model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
        model.eval()
        model = model.merge_and_unload()
        model_kwargs = None

    if model_args.use_peft is True:
        ref_model = None
        ref_model_kwargs = None
    else:
        ref_model = model
        ref_model_kwargs = model_kwargs

    accelerator.wait_for_everyone()

With this I can get everything running on my 2x3090s using the multi-gpu.yaml. GPU utilization looks even across both cards.

The deepspeed config works as well but for some reason fails when pushing the model to the hub. I imagine this has something to do with my machine and/or with using 3090s.

@Randl
Copy link
Contributor

Randl commented Dec 10, 2023

Can confirm that setting zero3_init_flag: False helps.

@hhhhuiyuan
Copy link

I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.

Having the same issue here, but wierdly, the DPO script can not run even with multi-gpu.yaml on my machine, could you please share your multi-gpu.yaml file? In my understanding, multi-gpu.yaml is for data parallelising, so it should not have problem with merge Qlora adaptator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants