Skip to content

train_instruct_pix2pix.py in example has problem when try train with mutil gpus? #2966

@whbzju

Description

@whbzju

Describe the bug

When I try to use accelerate launch train_instruct_pix2pix.py with mutil gpus, it report the error as below:

Traceback (most recent call last):
File "train_instruct_pix2pix.py", line 1002, in
main()
File "train_instruct_pix2pix.py", line 722, in main
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1138, in prepare
result = tuple(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1139, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 990, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1218, in prepare_model
Traceback (most recent call last):
File "train_instruct_pix2pix.py", line 1002, in
model = torch.nn.parallel.DistributedDataParallel(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [2]: params[0] in this process with sizes [320, 4, 3, 3] appears not to match sizes of the same param in process 0.
main()
File "train_instruct_pix2pix.py", line 722, in main
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1138, in prepare
result = tuple(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1139, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 990, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1218, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [1]: params[0] in this process with sizes [320, 4, 3, 3] appears not to match sizes of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89468 closing signal SIGTERM


And I think the problem maybe caused by these code:

if accelerator.is_main_process:
logger.info("Initializing the InstructPix2Pix UNet from the pretrained UNet.")
in_channels = 8
out_channels = unet.conv_in.out_channels
unet.register_to_config(in_channels=in_channels)

    with torch.no_grad():
        new_conv_in = nn.Conv2d(
            in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding
        )
        new_conv_in.weight.zero_()
        new_conv_in.weight[:, :4, :, :].copy_(unet.conv_in.weight)
        unet.conv_in = new_conv_in

It will let only main_process change the unet channel to 8 and the others process remain 4, so it will cause the problem.

Reproduction

accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$DATASET_ID
--enable_xformers_memory_efficient_attention
--resolution=256 --random_flip
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing
--max_train_steps=15000
--checkpointing_steps=5000 --checkpoints_total_limit=1
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0
--conditioning_dropout_prob=0.05
--mixed_precision=fp16
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
--validation_prompt="make the mountains snowy"
--seed=42
--report_to=wandb

Logs

No response

System Info

diffusers-0.15.0.dev0
python=3.8
torch=2.0.0
accelerate=0.18.0.dev0
ubuntu 20.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions