You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Hey folks,
The issue seems to be simple enough. I tried to train an FSDP model using a multi-node setup, with transformers + accelerate. I'm launching my training script (which doesn't seem relevant to the issue for now, so I'll skip it for brevity) using accelerate launch --config-file config.yaml ... python train.py. The config.yaml looks like this:
This, however, doesn't seem to save any file in save_dir (as a matter of fact, it doesn't even create the directory).
The problem seems to be in Trainer.save_model(), which tests for FSDP, and then tests for FULL_STATE_DICT. So this leads to the obvious question, what happens to the SHARDED_STATE_DICT models? I didn't manage to find anything about this in the docs.
Am I missing something? Are you supposed to change the state dict to a FULL_STATE_DICT before saving?
Note that checkpoints do indeed seem to save normally (they are sharded, so it seems that every node saves it's part, which is expected). It is just the final save_model call which seems to faulter.
Thanks a lot!
Expected behavior
The model should be saved in a safetensors format, with all the weights, inside save_dir, as described above, regardless of whether accelerate uses FULL_STATE_DICT or SHARDED_STATE_DICT with FSDP.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.39.2Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hey folks,
The issue seems to be simple enough. I tried to train an FSDP model using a multi-node setup, with transformers + accelerate. I'm launching my training script (which doesn't seem relevant to the issue for now, so I'll skip it for brevity) using
accelerate launch --config-file config.yaml ... python train.py
. Theconfig.yaml
looks like this:The issue, though, is when using transformers's built-in Trainer class and trying to save the model at the end of training. Trying to run:
This, however, doesn't seem to save any file in
save_dir
(as a matter of fact, it doesn't even create the directory).The problem seems to be in Trainer.save_model(), which tests for FSDP, and then tests for
FULL_STATE_DICT
. So this leads to the obvious question, what happens to theSHARDED_STATE_DICT
models? I didn't manage to find anything about this in the docs.Am I missing something? Are you supposed to change the state dict to a
FULL_STATE_DICT
before saving?Note that checkpoints do indeed seem to save normally (they are sharded, so it seems that every node saves it's part, which is expected). It is just the final
save_model
call which seems to faulter.Thanks a lot!
Expected behavior
The model should be saved in a safetensors format, with all the weights, inside
save_dir
, as described above, regardless of whether accelerate usesFULL_STATE_DICT
orSHARDED_STATE_DICT
with FSDP.The text was updated successfully, but these errors were encountered: