Bug on checkpoints_total_limit when runing mutiple GPUs

Here is some bug issue in examples/unconditional_image_generation/train_unconditional.py. When running this file on mutiple NPUs or GPUs with args checkpoints_total_limit, it will save the wrong number of  checkpoints.  

For example, here is my code 
```
examples/unconditional_image_generation/train_unconditional.py
--dataset_name hf-internal-testing/dummy_image_class_data
--model_config_name_or_path diffusers/ddpm_dummy
--resolution 64
--output_dir {tmpdir}
--train_batch_size 1
--num_epochs 1
--gradient_accumulation_steps 1
--ddpm_num_inference_steps 2
--learning_rate 1e-3
--lr_warmup_steps 5
--checkpointing_steps=2
--checkpoints_total_limit=2
```
- the expected saved checkpoints: [checkpoints-4, checkpoints-6]
- real saved checkpoints: [checkpoints-6]

it seem that the `accelerator.is_main_process` is not placed in right line, which caused the program removes checkpoints wrongly when running in mutiple NPUs or GPUs

You can easily fix this problem by changing code below. And I can also submit a PR quickly if you need.

```
if accelerator.is_main_process:
    if global_step % args.checkpointing_steps == 0:
        .......
                    shutil.rmtree(removing_checkpoint)

        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
        accelerator.save_state(save_path)
        logger.info(f"Saved state to {save_path}")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug on checkpoints_total_limit when runing mutiple GPUs #5338

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug on checkpoints_total_limit when runing mutiple GPUs #5338

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions