Change in accelerate is breaking a test in the Diffusers CI #2135

DN6 · 2023-11-09T10:19:04Z

System Info

- `Accelerate` version: 0.25.0.dev0
- Platform: Linux-5.10.0-26-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 29.39 GB
- GPU type: Tesla V100-SXM2-16GB
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

A recent change in accelerate is breaking one of the example scripts in the diffusers CI.
https://github.com/huggingface/diffusers/actions/runs/6809493785/job/18515900191?pr=5713

The issue seems to occur after this commit
e638b1e

To Reproduce

git clone https://github.com/huggingface/diffusers.git && cd diffusers

pip install -e .
pip install git+https://github.com/huggingface/accelerate.git

Run training example

python examples/custom_diffusion/train_custom_diffusion.py \
--pretrained_model_name_or_path="hf-internal-testing/tiny-stable-diffusion-pipe" \
--instance_data_dir=docs/source/en/imgs \
--instance_prompt="<new1>" \
--resolution=64 \
--train_batch_size=1 \
--modifier_token="<new1>" \
--dataloader_num_workers=0 \
--max_train_steps=9 \
--checkpointing_steps=2 \
--no_safe_serialization

Check if checkpoints exist

ls examples/custom_diffusion/custom-diffusion-model

Resume from checkpoint (this breaks)

python examples/custom_diffusion/train_custom_diffusion.py \
--pretrained_model_name_or_path="hf-internal-testing/tiny-stable-diffusion-pipe" \
--instance_data_dir=docs/source/en/imgs \
--instance_prompt="<new1>" \
--resolution=64 \
--train_batch_size=1 \
--modifier_token="<new1>" \
--dataloader_num_workers=0 \
--max_train_steps=9 \
--checkpointing_steps=2 \
--no_safe_serialization \
--resume_from_checkpoint=checkpoint-8

Traceback

Traceback (most recent call last):
  File "examples/custom_diffusion/train_custom_diffusion.py", line 1340, in <module>
    main(args)
  File "examples/custom_diffusion/train_custom_diffusion.py", line 1081, in main
    accelerator.load_state(os.path.join(args.output_dir, path))
  File "/opt/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 2984, in load_state
    load_custom_state(obj, input_dir, index)
  File "/opt/venv/lib/python3.8/site-packages/accelerate/checkpointing.py", line 275, in load_custom_state
    obj.load_state_dict(torch.load(load_location, map_location="cpu"))
  File "/opt/venv/lib/python3.8/site-packages/torch/serialization.py", line 1028, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/venv/lib/python3.8/site-packages/torch/serialization.py", line 1246, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'm'.

Expected behavior

Script to resume training from checkpoint without unpickling error.

The text was updated successfully, but these errors were encountered:

muellerzr self-assigned this Nov 9, 2023

muellerzr mentioned this issue Nov 9, 2023

Leave native save as False #2138

Merged

5 tasks

muellerzr closed this as completed in #2138 Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change in accelerate is breaking a test in the Diffusers CI #2135

Change in accelerate is breaking a test in the Diffusers CI #2135

DN6 commented Nov 9, 2023

Change in accelerate is breaking a test in the Diffusers CI #2135

Change in accelerate is breaking a test in the Diffusers CI #2135

Comments

DN6 commented Nov 9, 2023

System Info

Information

Tasks

Reproduction

To Reproduce

Expected behavior