[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

sayakpaul · 2024-06-03T08:03:07Z

Context

Refer to #8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too:

advanced_diffusion_training
- train_dreambooth_lora_sd15_advanced.py
- train_dreambooth_lora_sdxl_advanced.py
consistency_distillation
- train_lcm_distill_lora_sdxl.py
controlnet
- train_controlnet.py
- train_controlnet_sdxl.py
custom_diffusion
- train_custom_diffusion.py
dreambooth
- train_dreambooth.py
- train_dreambooth_lora.py
- train_dreambooth_lora_sdxl.py
instruct_pix2pix
- train_instruct_pix2pix.py
- rain_instruct_pix2pix_sdxl.py
kandinsky2_2/text_to_image
- train_text_to_image_decoder.py
- train_text_to_image_prior.py
- train_text_to_image_lora_decoder.py
- train_text_to_image_lora_prior.py
t2i_adapter
- train_t2i_adapter_sdxl.py
text_to_image
- train_text_to_image.py
- train_text_to_image_sdxl.py
- train_text_to_image_lora.py
- train_text_to_image_lora_sdxl.py
textual_inversion
- textual_inversion.py
- textual_inversion_sdxl.py
unconditional_image_generation
- train_unconditional.py
wuerstchen
- text_to_image/train_text_to_image_prior.py
- text_to_image/train_text_to_image_lora_prior.py
research_projects (low-priority)
- consistency_training/train_cm_ct_unconditional.py
- diffusion_dpo/train_diffusion_dpo.py
- diffusion_dpo/train_diffusion_dpo_sdxl.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
- dreambooth_inpaint/train_dreambooth_inpaint.py
- dreambooth_inpaint/train_dreambooth_inpaint_lora.py
- instructpix2pix_lora/train_instruct_pix2pix_lora.py
- intel_opts/textual_inversion/textual_inversion_bf16.py
- intel_opts/textual_inversion_dfq/textual_inversion.py
- lora/train_text_to_image_lora.py
- multi_subject_dreambooth/train_multi_subject_dreambooth.py
- multi_token_textual_inversion/textual_inversion.py
- onnxruntime/text_to_image/train_text_to_image.py
- onnxruntime/textual_inversion/textual_inversion.py
- onnxruntime/unconditional_image_generation/train_unconditional.py
- realfill/train_realfill.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py

The following scripts do not have the argument --num_train_epochs:

amused
- train_amused.py
research_projects
- multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

So, they don't need to be updated.

Then we have the following scripts that don't use accelerator to prepare the datasets:

Distributed dataset sharding is done by WebDataset, not accelerator. So, we can skip them for now.

consistency_distillation
- train_lcm_distill_sd_wds.py
- train_lcm_distill_sdxl_wds.py
- train_lcm_distill_lora_sd_wds.py
- train_lcm_distill_lora_sdxl_wds.py
research_projects
- controlnet/train_controlnet_webdataset.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

Steps to follow when opening PRs

Target one AND only one training script in a single PR.
When you open a PR, please mention this issue.
Mention @sayakpaul and @geniuspatrick for a review.
Accompany your PR with a minimal training command using the num_train_epochs CLI arg.
Enjoy!

The text was updated successfully, but these errors were encountered:

sayakpaul added Good second issue contributions-welcome labels Jun 3, 2024

rootonchair mentioned this issue Jun 9, 2024

[train_lcm_distill_lora_sdxl.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env #8446

Merged

6 tasks

This was referenced Jun 10, 2024

[train_controlnet.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env #8461

Open

[train_controlnet_sdxl.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env #8476

Merged

WenheLI mentioned this issue Jun 13, 2024

fix the LR schedulers for dreambooth_lora #8510

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

sayakpaul commented Jun 3, 2024

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

Comments

sayakpaul commented Jun 3, 2024

Context

Steps to follow when opening PRs

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384