Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer/accelerate doesn't save model when using FSDP with SHARDED_STATE_DICT #30491

Open
2 of 4 tasks
alexghergh opened this issue Apr 25, 2024 · 1 comment
Open
2 of 4 tasks

Comments

@alexghergh
Copy link

System Info

  • transformers version: 4.39.2
  • Platform: Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.13
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes (using transformers Trainer)
  • Using distributed or parallel set-up in script?: Yes (FSDP, using Trainer + config.yaml file for Accelerate, over a distributed multi-node setup)

Who can help?

@pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hey folks,

The issue seems to be simple enough. I tried to train an FSDP model using a multi-node setup, with transformers + accelerate. I'm launching my training script (which doesn't seem relevant to the issue for now, so I'll skip it for brevity) using accelerate launch --config-file config.yaml ... python train.py. The config.yaml looks like this:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The issue, though, is when using transformers's built-in Trainer class and trying to save the model at the end of training. Trying to run:

trainer = Trainer(...)

trainer.train(...)

trainer.save_model(save_dir)

This, however, doesn't seem to save any file in save_dir (as a matter of fact, it doesn't even create the directory).

The problem seems to be in Trainer.save_model(), which tests for FSDP, and then tests for FULL_STATE_DICT. So this leads to the obvious question, what happens to the SHARDED_STATE_DICT models? I didn't manage to find anything about this in the docs.

Am I missing something? Are you supposed to change the state dict to a FULL_STATE_DICT before saving?

Note that checkpoints do indeed seem to save normally (they are sharded, so it seems that every node saves it's part, which is expected). It is just the final save_model call which seems to faulter.

Thanks a lot!

Expected behavior

The model should be saved in a safetensors format, with all the weights, inside save_dir, as described above, regardless of whether accelerate uses FULL_STATE_DICT or SHARDED_STATE_DICT with FSDP.

@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants