Trainer/accelerate doesn't save model when using FSDP with SHARDED_STATE_DICT #30491

alexghergh · 2024-04-25T19:22:28Z

System Info

transformers version: 4.39.2
Platform: Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28
Python version: 3.10.13
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (using transformers Trainer)
Using distributed or parallel set-up in script?: Yes (FSDP, using Trainer + config.yaml file for Accelerate, over a distributed multi-node setup)

Who can help?

@pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hey folks,

The issue seems to be simple enough. I tried to train an FSDP model using a multi-node setup, with transformers + accelerate. I'm launching my training script (which doesn't seem relevant to the issue for now, so I'll skip it for brevity) using accelerate launch --config-file config.yaml ... python train.py. The config.yaml looks like this:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The issue, though, is when using transformers's built-in Trainer class and trying to save the model at the end of training. Trying to run:

trainer = Trainer(...)

trainer.train(...)

trainer.save_model(save_dir)

This, however, doesn't seem to save any file in save_dir (as a matter of fact, it doesn't even create the directory).

The problem seems to be in Trainer.save_model(), which tests for FSDP, and then tests for FULL_STATE_DICT. So this leads to the obvious question, what happens to the SHARDED_STATE_DICT models? I didn't manage to find anything about this in the docs.

Am I missing something? Are you supposed to change the state dict to a FULL_STATE_DICT before saving?

Note that checkpoints do indeed seem to save normally (they are sharded, so it seems that every node saves it's part, which is expected). It is just the final save_model call which seems to faulter.

Thanks a lot!

Expected behavior

The model should be saved in a safetensors format, with all the weights, inside save_dir, as described above, regardless of whether accelerate uses FULL_STATE_DICT or SHARDED_STATE_DICT with FSDP.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-28T10:35:58Z

cc @muellerzr @SunMarc

amyeroberts added PyTorch FSDP trainer labels Apr 25, 2024

huggingface deleted a comment from github-actions bot May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer/accelerate doesn't save model when using FSDP with SHARDED_STATE_DICT #30491

Trainer/accelerate doesn't save model when using FSDP with SHARDED_STATE_DICT #30491

alexghergh commented Apr 25, 2024

amyeroberts commented May 28, 2024

Trainer/accelerate doesn't save model when using FSDP with SHARDED_STATE_DICT #30491

Trainer/accelerate doesn't save model when using FSDP with SHARDED_STATE_DICT #30491

Comments

alexghergh commented Apr 25, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented May 28, 2024