-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Description
Describe the bug
When I try to use accelerate launch train_instruct_pix2pix.py with mutil gpus, it report the error as below:
Traceback (most recent call last):
File "train_instruct_pix2pix.py", line 1002, in
main()
File "train_instruct_pix2pix.py", line 722, in main
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1138, in prepare
result = tuple(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1139, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 990, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1218, in prepare_model
Traceback (most recent call last):
File "train_instruct_pix2pix.py", line 1002, in
model = torch.nn.parallel.DistributedDataParallel(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [2]: params[0] in this process with sizes [320, 4, 3, 3] appears not to match sizes of the same param in process 0.
main()
File "train_instruct_pix2pix.py", line 722, in main
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1138, in prepare
result = tuple(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1139, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 990, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1218, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [1]: params[0] in this process with sizes [320, 4, 3, 3] appears not to match sizes of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89468 closing signal SIGTERM
And I think the problem maybe caused by these code:
if accelerator.is_main_process:
logger.info("Initializing the InstructPix2Pix UNet from the pretrained UNet.")
in_channels = 8
out_channels = unet.conv_in.out_channels
unet.register_to_config(in_channels=in_channels)
with torch.no_grad():
new_conv_in = nn.Conv2d(
in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding
)
new_conv_in.weight.zero_()
new_conv_in.weight[:, :4, :, :].copy_(unet.conv_in.weight)
unet.conv_in = new_conv_in
It will let only main_process change the unet channel to 8 and the others process remain 4, so it will cause the problem.
Reproduction
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$DATASET_ID
--enable_xformers_memory_efficient_attention
--resolution=256 --random_flip
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing
--max_train_steps=15000
--checkpointing_steps=5000 --checkpoints_total_limit=1
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0
--conditioning_dropout_prob=0.05
--mixed_precision=fp16
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
--validation_prompt="make the mountains snowy"
--seed=42
--report_to=wandb
Logs
No response
System Info
diffusers-0.15.0.dev0
python=3.8
torch=2.0.0
accelerate=0.18.0.dev0
ubuntu 20.04