Describe the bug
While training flux-controlnet on a multi-GPU server and restricting the training to a single GPU, setting num_single_layers=0 leads to an error:
[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75
Reproduction
accelerate launch --gpu_ids='0,' --num_processes=1 --num_machines=1 --main_process_port 28700 train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-schnell" \ --dataset_name="lucataco/fill1k" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="logs" \ --mixed_precision="bf16" \ --resolution=512 \ --learning_rate=1e-5 \ --max_train_steps=15000 \ --validation_steps=100 \ --checkpointing_steps=200 \ --validation_image "./example_images/conditioning_image_1.png" "./example_images/conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --num_double_layers=2 \ --num_single_layers=0 \ --seed=42 \ --enable_model_cpu_offload \ --use_8bit_adam \ --use_adafactor \ --gradient_checkpointing \
Logs
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75
[rank0]: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
System Info
- 🤗 Diffusers version: 0.31.0.dev0
- Platform: Linux-5.14.0-427.33.1.el9_4.x86_64-x86_64-with-glibc2.34
- Running on Google Colab?: No
- Python version: 3.12.4
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.24.7
- Transformers version: 4.45.0
- Accelerate version: 0.33.0
- PEFT version: 0.12.0
- Bitsandbytes version: 0.44.1
- Safetensors version: 0.4.4
- xFormers version: 0.0.28
- Accelerator: NVIDIA RTX A6000, 49140 MiB
NVIDIA RTX A6000, 49140 MiB
NVIDIA RTX A6000, 49140 MiB
- Using GPU in script?:
- Using distributed or parallel set-up in script?: Yes
Who can help?
@sayakpaul
Describe the bug
While training flux-controlnet on a multi-GPU server and restricting the training to a single GPU, setting num_single_layers=0 leads to an error:
[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75
Reproduction
accelerate launch --gpu_ids='0,' --num_processes=1 --num_machines=1 --main_process_port 28700 train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-schnell" \ --dataset_name="lucataco/fill1k" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="logs" \ --mixed_precision="bf16" \ --resolution=512 \ --learning_rate=1e-5 \ --max_train_steps=15000 \ --validation_steps=100 \ --checkpointing_steps=200 \ --validation_image "./example_images/conditioning_image_1.png" "./example_images/conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --num_double_layers=2 \ --num_single_layers=0 \ --seed=42 \ --enable_model_cpu_offload \ --use_8bit_adam \ --use_adafactor \ --gradient_checkpointing \Logs
System Info
NVIDIA RTX A6000, 49140 MiB
NVIDIA RTX A6000, 49140 MiB
Who can help?
@sayakpaul