You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- `Accelerate` version: 0.25.0.dev0
- Platform: Linux-5.10.0-26-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 29.39 GB
- GPU type: Tesla V100-SXM2-16GB
- `Accelerate` default config:
Not found
Information
The official example scripts
My own modified scripts
Tasks
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
Traceback (most recent call last):
File "examples/custom_diffusion/train_custom_diffusion.py", line 1340, in <module>
main(args)
File "examples/custom_diffusion/train_custom_diffusion.py", line 1081, in main
accelerator.load_state(os.path.join(args.output_dir, path))
File "/opt/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 2984, in load_state
load_custom_state(obj, input_dir, index)
File "/opt/venv/lib/python3.8/site-packages/accelerate/checkpointing.py", line 275, in load_custom_state
obj.load_state_dict(torch.load(load_location, map_location="cpu"))
File "/opt/venv/lib/python3.8/site-packages/torch/serialization.py", line 1028, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/venv/lib/python3.8/site-packages/torch/serialization.py", line 1246, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'm'.
Expected behavior
Script to resume training from checkpoint without unpickling error.
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
A recent change in accelerate is breaking one of the example scripts in the diffusers CI.
https://github.com/huggingface/diffusers/actions/runs/6809493785/job/18515900191?pr=5713
The issue seems to occur after this commit
e638b1e
To Reproduce
Run training example
Check if checkpoints exist
Resume from checkpoint (this breaks)
Traceback
Expected behavior
Script to resume training from checkpoint without unpickling error.
The text was updated successfully, but these errors were encountered: