Skip to content

Error when setting num_single_layers=0 while training flux-controlnet on a multi-GPU server using a single GPU #9630

@wangherr

Description

@wangherr

Describe the bug

While training flux-controlnet on a multi-GPU server and restricting the training to a single GPU, setting num_single_layers=0 leads to an error:

[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75

Reproduction

accelerate launch --gpu_ids='0,' --num_processes=1 --num_machines=1 --main_process_port 28700 train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-schnell" \ --dataset_name="lucataco/fill1k" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="logs" \ --mixed_precision="bf16" \ --resolution=512 \ --learning_rate=1e-5 \ --max_train_steps=15000 \ --validation_steps=100 \ --checkpointing_steps=200 \ --validation_image "./example_images/conditioning_image_1.png" "./example_images/conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --num_double_layers=2 \ --num_single_layers=0 \ --seed=42 \ --enable_model_cpu_offload \ --use_8bit_adam \ --use_adafactor \ --gradient_checkpointing \

Logs

[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

System Info

  • 🤗 Diffusers version: 0.31.0.dev0
  • Platform: Linux-5.14.0-427.33.1.el9_4.x86_64-x86_64-with-glibc2.34
  • Running on Google Colab?: No
  • Python version: 3.12.4
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.24.7
  • Transformers version: 4.45.0
  • Accelerate version: 0.33.0
  • PEFT version: 0.12.0
  • Bitsandbytes version: 0.44.1
  • Safetensors version: 0.4.4
  • xFormers version: 0.0.28
  • Accelerator: NVIDIA RTX A6000, 49140 MiB
    NVIDIA RTX A6000, 49140 MiB
    NVIDIA RTX A6000, 49140 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: Yes

Who can help?

@sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions