Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seems to be a bug related to saving model #24130

Closed
2 of 4 tasks
jeffchy opened this issue Jun 9, 2023 · 2 comments · Fixed by #24134
Closed
2 of 4 tasks

seems to be a bug related to saving model #24130

jeffchy opened this issue Jun 9, 2023 · 2 comments · Fixed by #24134

Comments

@jeffchy
Copy link

jeffchy commented Jun 9, 2023

System Info

I use pytorch==2.0 fsdp fully-shard
If I use transformers==4.29.1, accelerate==0.19.0, things works well:

[INFO|trainer.py:2904] 2023-06-09 10:35:25,236 >> Saving model checkpoint to ../outputs/tigerbot-7b/full/2023-06-09-10-33-49/ckpt/checkpoint-4
[INFO|configuration_utils.py:458] 2023-06-09 10:35:25,237 >> Configuration saved in ../outputs/tigerbot-7b/full/2023-06-09-10-33-49/ckpt/checkpoint-4/config.json
[INFO|configuration_utils.py:364] 2023-06-09 10:35:25,237 >> Configuration saved in ../outputs/tigerbot-7b/full/2023-06-09-10-33-49/ckpt/checkpoint-4/generation_config.json

When I switch to transformers==4.30 accelerate==0.20.0 when I save the model, I got the following error

│    285                                                                                           │
│    286 class _open_zipfile_writer_file(_opener):                                                 │
│    287 │   def __init__(self, name) -> None:                                                     │
│ ❱  288 │   │   super().__init__(torch._C.PyTorchFileWriter(str(name)))                           │
│    289 │                                                                                         │
│    290 │   def __exit__(self, *args) -> None:                                                    │
│    291 │   │   self.file_like.write_end_of_file()                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Parent directory ../outputs/tigerbot-7b/full/2023-06-09-10-21-30/ckpt/checkpoint-4 does not exist.

It seems like, when I save fsdp model, transformers/accelerator don't help me to create the parent folder 'xxxx/checkpoint-4'. When I downgrade the transformers and the accelerate's version, it works, and when I manually create the 'xxx/checkpoint-4' before saving, it also works.

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pytorch==2.0
transformers==4.30.0
accelerate==0.20.3

Trainer using FSDP fully shard, modified from train_clm.py example

CUDA_VISIBLE_DEVICES=0,1,2,3,7 $BASE_ENV/torchrun --nproc_per_node 5 --nnodes=1 --node_rank=0 --master_port $MASTER_PORT main_sft.py \
    --model_name_or_path $MODEL \
    --model_type $MODEL_TYPE \
    --dataset_config_file config/data/tiger.yaml \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --output_dir $OUTPUT_DIR \
    --fp16 \
    --cutoff_len 2048 \
    --save_steps 500 \
    --logging_steps 50 \
    --max_steps 6000 \
    --eval_steps 500 \
    --warmup_steps 5 \
    --gradient_accumulation_steps 32 \
    --lr_scheduler_type "cosine" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'BloomBlock' \
    --gradient_checkpointing True \
    --overwrite_cache \
    --learning_rate 1e-5 \
    | tee $LOG_DIR/train.log \
    2> $LOG_DIR/train.err

Expected behavior

I use pytorch==2.0 fsdp fully-shard
If I use transformers==4.29.1, accelerate==0.19.0, things works well:

[INFO|trainer.py:2904] 2023-06-09 10:35:25,236 >> Saving model checkpoint to ../outputs/tigerbot-7b/full/2023-06-09-10-33-49/ckpt/checkpoint-4
[INFO|configuration_utils.py:458] 2023-06-09 10:35:25,237 >> Configuration saved in ../outputs/tigerbot-7b/full/2023-06-09-10-33-49/ckpt/checkpoint-4/config.json
[INFO|configuration_utils.py:364] 2023-06-09 10:35:25,237 >> Configuration saved in ../outputs/tigerbot-7b/full/2023-06-09-10-33-49/ckpt/checkpoint-4/generation_config.json

When I switch to transformers==4.30 accelerate==0.20.0 when I save the model, I got the following error

│    285                                                                                           │
│    286 class _open_zipfile_writer_file(_opener):                                                 │
│    287 │   def __init__(self, name) -> None:                                                     │
│ ❱  288 │   │   super().__init__(torch._C.PyTorchFileWriter(str(name)))                           │
│    289 │                                                                                         │
│    290 │   def __exit__(self, *args) -> None:                                                    │
│    291 │   │   self.file_like.write_end_of_file()                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Parent directory ../outputs/tigerbot-7b/full/2023-06-09-10-21-30/ckpt/checkpoint-4 does not exist.

It seems like, when I save fsdp model, transformers/accelerator don't help me to create the parent folder 'xxxx/checkpoint-4'. When I downgrade the transformers and the accelerate's version, it works, and when I manually create the 'xxx/checkpoint-4' before saving, it also works.

@amyeroberts
Copy link
Collaborator

cc @pacman100

@pacman100
Copy link
Contributor

Hello @jeffchy, Thank you for the thorough issue, can you please confirm if the above PR resolves your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants