Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

Open
1 of 48 tasks
sayakpaul opened this issue Jun 3, 2024 · 0 comments

Comments

@sayakpaul
Copy link
Member

Context

Refer to #8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too:

  • advanced_diffusion_training

    • train_dreambooth_lora_sd15_advanced.py
    • train_dreambooth_lora_sdxl_advanced.py
  • consistency_distillation

    • train_lcm_distill_lora_sdxl.py
  • controlnet

    • train_controlnet.py
    • train_controlnet_sdxl.py
  • custom_diffusion

    • train_custom_diffusion.py
  • dreambooth

    • train_dreambooth.py
    • train_dreambooth_lora.py
    • train_dreambooth_lora_sdxl.py
  • instruct_pix2pix

    • train_instruct_pix2pix.py
    • rain_instruct_pix2pix_sdxl.py
  • kandinsky2_2/text_to_image

    • train_text_to_image_decoder.py
    • train_text_to_image_prior.py
    • train_text_to_image_lora_decoder.py
    • train_text_to_image_lora_prior.py
  • t2i_adapter

    • train_t2i_adapter_sdxl.py
  • text_to_image

    • train_text_to_image.py
    • train_text_to_image_sdxl.py
    • train_text_to_image_lora.py
    • train_text_to_image_lora_sdxl.py
  • textual_inversion

    • textual_inversion.py
    • textual_inversion_sdxl.py
  • unconditional_image_generation

    • train_unconditional.py
  • wuerstchen

    • text_to_image/train_text_to_image_prior.py
    • text_to_image/train_text_to_image_lora_prior.py
  • research_projects (low-priority)

    • consistency_training/train_cm_ct_unconditional.py
    • diffusion_dpo/train_diffusion_dpo.py
    • diffusion_dpo/train_diffusion_dpo_sdxl.py
    • diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
    • dreambooth_inpaint/train_dreambooth_inpaint.py
    • dreambooth_inpaint/train_dreambooth_inpaint_lora.py
    • instructpix2pix_lora/train_instruct_pix2pix_lora.py
    • intel_opts/textual_inversion/textual_inversion_bf16.py
    • intel_opts/textual_inversion_dfq/textual_inversion.py
    • lora/train_text_to_image_lora.py
    • multi_subject_dreambooth/train_multi_subject_dreambooth.py
    • multi_token_textual_inversion/textual_inversion.py
    • onnxruntime/text_to_image/train_text_to_image.py
    • onnxruntime/textual_inversion/textual_inversion.py
    • onnxruntime/unconditional_image_generation/train_unconditional.py
    • realfill/train_realfill.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
    • scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
    • scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py

The following scripts do not have the argument --num_train_epochs:

  • amused
    • train_amused.py
  • research_projects
    • multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

So, they don't need to be updated.

Then we have the following scripts that don't use accelerator to prepare the datasets:

Distributed dataset sharding is done by WebDataset, not accelerator. So, we can skip them for now.

  • consistency_distillation
    • train_lcm_distill_sd_wds.py
    • train_lcm_distill_sdxl_wds.py
    • train_lcm_distill_lora_sd_wds.py
    • train_lcm_distill_lora_sdxl_wds.py
  • research_projects
    • controlnet/train_controlnet_webdataset.py
    • diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

Steps to follow when opening PRs

  • Target one AND only one training script in a single PR.
  • When you open a PR, please mention this issue.
  • Mention @sayakpaul and @geniuspatrick for a review.
  • Accompany your PR with a minimal training command using the num_train_epochs CLI arg.
  • Enjoy!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant