Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in accelerate is breaking a test in the Diffusers CI #2135

Closed
1 of 4 tasks
DN6 opened this issue Nov 9, 2023 · 0 comments · Fixed by #2138
Closed
1 of 4 tasks

Change in accelerate is breaking a test in the Diffusers CI #2135

DN6 opened this issue Nov 9, 2023 · 0 comments · Fixed by #2138
Assignees

Comments

@DN6
Copy link

DN6 commented Nov 9, 2023

System Info

- `Accelerate` version: 0.25.0.dev0
- Platform: Linux-5.10.0-26-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 29.39 GB
- GPU type: Tesla V100-SXM2-16GB
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

A recent change in accelerate is breaking one of the example scripts in the diffusers CI.
https://github.com/huggingface/diffusers/actions/runs/6809493785/job/18515900191?pr=5713

The issue seems to occur after this commit
e638b1e

To Reproduce

git clone https://github.com/huggingface/diffusers.git && cd diffusers
pip install -e .
pip install git+https://github.com/huggingface/accelerate.git

Run training example

python examples/custom_diffusion/train_custom_diffusion.py \
--pretrained_model_name_or_path="hf-internal-testing/tiny-stable-diffusion-pipe" \
--instance_data_dir=docs/source/en/imgs \
--instance_prompt="<new1>" \
--resolution=64 \
--train_batch_size=1 \
--modifier_token="<new1>" \
--dataloader_num_workers=0 \
--max_train_steps=9 \
--checkpointing_steps=2 \
--no_safe_serialization

Check if checkpoints exist

ls examples/custom_diffusion/custom-diffusion-model

Resume from checkpoint (this breaks)

python examples/custom_diffusion/train_custom_diffusion.py \
--pretrained_model_name_or_path="hf-internal-testing/tiny-stable-diffusion-pipe" \
--instance_data_dir=docs/source/en/imgs \
--instance_prompt="<new1>" \
--resolution=64 \
--train_batch_size=1 \
--modifier_token="<new1>" \
--dataloader_num_workers=0 \
--max_train_steps=9 \
--checkpointing_steps=2 \
--no_safe_serialization \
--resume_from_checkpoint=checkpoint-8

Traceback

Traceback (most recent call last):
  File "examples/custom_diffusion/train_custom_diffusion.py", line 1340, in <module>
    main(args)
  File "examples/custom_diffusion/train_custom_diffusion.py", line 1081, in main
    accelerator.load_state(os.path.join(args.output_dir, path))
  File "/opt/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 2984, in load_state
    load_custom_state(obj, input_dir, index)
  File "/opt/venv/lib/python3.8/site-packages/accelerate/checkpointing.py", line 275, in load_custom_state
    obj.load_state_dict(torch.load(load_location, map_location="cpu"))
  File "/opt/venv/lib/python3.8/site-packages/torch/serialization.py", line 1028, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/venv/lib/python3.8/site-packages/torch/serialization.py", line 1246, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'm'.

Expected behavior

Script to resume training from checkpoint without unpickling error.

@muellerzr muellerzr self-assigned this Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants