-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Closed
Description
Here is some bug issue in examples/unconditional_image_generation/train_unconditional.py. When running this file on mutiple NPUs or GPUs with args checkpoints_total_limit, it will save the wrong number of checkpoints.
For example, here is my code
examples/unconditional_image_generation/train_unconditional.py
--dataset_name hf-internal-testing/dummy_image_class_data
--model_config_name_or_path diffusers/ddpm_dummy
--resolution 64
--output_dir {tmpdir}
--train_batch_size 1
--num_epochs 1
--gradient_accumulation_steps 1
--ddpm_num_inference_steps 2
--learning_rate 1e-3
--lr_warmup_steps 5
--checkpointing_steps=2
--checkpoints_total_limit=2
- the expected saved checkpoints: [checkpoints-4, checkpoints-6]
- real saved checkpoints: [checkpoints-6]
it seem that the accelerator.is_main_process is not placed in right line, which caused the program removes checkpoints wrongly when running in mutiple NPUs or GPUs
You can easily fix this problem by changing code below. And I can also submit a PR quickly if you need.
if accelerator.is_main_process:
if global_step % args.checkpointing_steps == 0:
.......
shutil.rmtree(removing_checkpoint)
save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
accelerator.save_state(save_path)
logger.info(f"Saved state to {save_path}")
Metadata
Metadata
Assignees
Labels
No labels