Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem training with FSDP #18

Open
agokrani opened this issue Jan 3, 2024 · 2 comments
Open

Problem training with FSDP #18

agokrani opened this issue Jan 3, 2024 · 2 comments

Comments

@agokrani
Copy link

agokrani commented Jan 3, 2024

When I am trying to train a model with FSDP, I am getting following error.

*** TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

It is happening on this specific line
trainer.model = trainer.model_wrapped = FSDP(trainer.model, **kwargs)

and after a bit of debugging it feels like it has something to do with auto_wrap_policy. I am not really sure how to solve this. Do you have any suggestions. It was working fine until few days ago.

@we1k
Copy link
Contributor

we1k commented Jan 10, 2024

I have encountered the same problem, however, it seem more like a problem with FSDP a peft wrapped model. When i run run_fsdp work fine for me, but when i try to add a lora config, which lead to the same error. @pacman100 could you please look at this problem?

@we1k
Copy link
Contributor

we1k commented Jan 10, 2024

found out this is caused by module_wrap_policy function in the FSDP(trainer.model).
image

Peft wrapped model passed a None as the module class variable.
image

After I mannually filter out the None module_class , another error occurs:

 File "train.py", line 197, in main
    trainer.model = trainer.model_wrapped = FSDP(trainer.model, **kwargs)
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 487, in __init__
    _init_param_handle_from_module(
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 519, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 531, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 537, in __init__
    self._init_flat_param_and_metadata(
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 585, in _init_flat_param_and_metadata
    ) = self._validate_tensors_to_flatten(params)
  File "/home/lzw/miniconda3/envs/Bert/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 731, in _validate_tensors_to_flatten
    raise ValueError(
ValueError: Must flatten tensors with uniform `requires_grad` when `use_orig_params=False`

I'm not familar with FSDP, so still dont know how to figure it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants