Skip to content

Bug on checkpoints_total_limit when runing mutiple GPUs #5338

@jiaqiw09

Description

@jiaqiw09

Here is some bug issue in examples/unconditional_image_generation/train_unconditional.py. When running this file on mutiple NPUs or GPUs with args checkpoints_total_limit, it will save the wrong number of checkpoints.

For example, here is my code

examples/unconditional_image_generation/train_unconditional.py
--dataset_name hf-internal-testing/dummy_image_class_data
--model_config_name_or_path diffusers/ddpm_dummy
--resolution 64
--output_dir {tmpdir}
--train_batch_size 1
--num_epochs 1
--gradient_accumulation_steps 1
--ddpm_num_inference_steps 2
--learning_rate 1e-3
--lr_warmup_steps 5
--checkpointing_steps=2
--checkpoints_total_limit=2
  • the expected saved checkpoints: [checkpoints-4, checkpoints-6]
  • real saved checkpoints: [checkpoints-6]

it seem that the accelerator.is_main_process is not placed in right line, which caused the program removes checkpoints wrongly when running in mutiple NPUs or GPUs

You can easily fix this problem by changing code below. And I can also submit a PR quickly if you need.

if accelerator.is_main_process:
    if global_step % args.checkpointing_steps == 0:
        .......
                    shutil.rmtree(removing_checkpoint)

        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
        accelerator.save_state(save_path)
        logger.info(f"Saved state to {save_path}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions