fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

geniuspatrick · 2024-05-29T10:45:46Z

What does this PR do?

TL;DR

In a distributed training scenario, passing the argument --num_train_epochs to any of the training scripts disrupts the functioning of the learning rate scheduler. Essentially, the learning rate decays num_processes times slower than expected. Related issues #8236, #3954, and PR #3983 shed further light on this.

Explanation

In our training setup, we utilize accelerator instead of PyTorch's native DistributedSampler when creating the train_dataloader. This means we create the train_dataloader directly as if for standalone training and subsequently employ accelerator.prepare to shard the samples across different processes.

When referencing step in training scripts such as lr_warmup_steps, max_train_steps, etc., we're indicating the optimizing step. In essence, each step consumes num_processes * gradient_accumulation_steps batches of data. In the script, the learning rate scheduler is initialized before accelerator.prepare is called. At this stage, the train_dataloader hasn't yet sharded the samples, specifically the batched samples.

To accurately calculate num_update_steps_per_epoch, we need the length of the train_dataloader after distributed sharding. How do we achieve this? Typically, accelerator.prepare replaces train_dataloader.batch_sampler with BatchSamplerShard. The length of the distributed sharded train_dataloader (still a DataLoader instance) becomes the length of BatchSamplerShard. Hence, we derive a formula for estimating the length of the sharded train_dataloader, which aligns with current training scripts (where accelerator.prepare is called with no extra arguments).

As per accelerator principles, the prepared scheduler calls the step() of the unprepared scheduler num_processes times at each optimizing step (once gradient accumulation is completed). This necessitates dividing num_*_steps_for_scheduler by gradient_accumulation_steps and multiplying it by num_processes.

Feeling a bit confused? Not to worry, let's visualize it.

Experiments

We utilize Fine-tuning for text2image with LoRA as an example. Below is the training command:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/naruto-blip-captions"

# Example of --num_train_epochs
accelerate launch examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=128 --random_flip --max_train_samples=171 \
  --train_batch_size=4 \
  --num_train_epochs=6 \
  --learning_rate=1e-04 --lr_scheduler="cosine_with_restarts" --lr_warmup_steps=3 \
  --gradient_accumulation_steps=5 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-epoch"

# Example of --max_train_steps
accelerate launch examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=128 --random_flip --max_train_samples=171 \
  --train_batch_size=4 \
  --max_train_steps=30 \
  --learning_rate=1e-04 --lr_scheduler="cosine_with_restarts" --lr_warmup_steps=3 \
  --gradient_accumulation_steps=5 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-step"

The hyper-parameters are:

Batch size: 4
Number of processes (n_gpus): 2
Dataset length: 171
Gradient accumulation steps: 5

Thus,

len_dataloader_standalone = ceil(171/4) = 43
len_dataloader_distribute = ceil(43/2) = 22
num_update_steps_per_epoch = ceil(22/5) = 5

And epochs=6 is equivalent to steps=30.

Additionally, introducing the argument num_cycles=2 to the function get_scheduler exacerbates the error.

Before the PR

--num_train_epochs

--max_train_steps

After the PR

--num_train_epochs

--max_train_steps

Fixes #8236

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul @eliphatfs

sayakpaul

I just have minor comments but this is very very nicely done. Thanks so much!

examples/text_to_image/train_text_to_image_lora.py

sayakpaul · 2024-05-29T11:05:44Z

examples/text_to_image/train_text_to_image_lora.py

+                "The length of the 'train_dataloader' after 'accelerator.prepare' does not match "
+                "the length that was expected when the learning rate scheduler was created. "
+                "This inconsistency may result in the learning rate scheduler not functioning properly."


Should we also include the values or "The length of the 'train_dataloader'" and "the length that was expected when the learning rate scheduler was created"?

Do we have drop_last settings or similar that may cause this to happen?

Should we also include the values or "The length of the 'train_dataloader'" and "the length that was expected when the learning rate scheduler was created"?

Yep, the values are included!

Do we have drop_last settings or similar that may cause this to happen?

For all current period training scripts, the answer is no. Our estimate of the length of the sliced dataloader is always correct, and this warning message is never triggered.

HuggingFaceDocBuilderDev · 2024-05-29T11:12:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2024-05-30T07:56:13Z

Sorry for pinging late but @geniuspatrick could we keep the changes in this PR to a bare minimum i.e., targeting a single script only and then opening the rest to the community? That will be easier to manage IMO.

geniuspatrick · 2024-05-30T08:03:24Z

Sorry for pinging late but @geniuspatrick could we keep the changes in this PR to a bare minimum i.e., targeting a single script only and then opening the rest to the community? That will be easier to manage IMO.

OK, I'll change the script examples/text_to_image/train_text_to_image_lora.py only.

…arios

geniuspatrick · 2024-05-30T08:22:51Z

Hi, @sayakpaul . I think it's ready now. Any suggestions?

sayakpaul

Thanks a ton!

geniuspatrick · 2024-05-31T01:55:21Z

Hi, @sayakpaul here's a TODO list for follow-up contributions from the community.

What should be changed

advanced_diffusion_training
- train_dreambooth_lora_sd15_advanced.py
- train_dreambooth_lora_sdxl_advanced.py
consistency_distillation
- train_lcm_distill_lora_sdxl.py
controlnet
- train_controlnet.py
- train_controlnet_sdxl.py
custom_diffusion
- train_custom_diffusion.py
dreambooth
- train_dreambooth.py
- train_dreambooth_lora.py
- train_dreambooth_lora_sdxl.py
instruct_pix2pix
- train_instruct_pix2pix.py
- rain_instruct_pix2pix_sdxl.py
kandinsky2_2/text_to_image
- train_text_to_image_decoder.py
- train_text_to_image_prior.py
- train_text_to_image_lora_decoder.py
- train_text_to_image_lora_prior.py
research_projects
- consistency_training/train_cm_ct_unconditional.py
- diffusion_dpo/train_diffusion_dpo.py
- diffusion_dpo/train_diffusion_dpo_sdxl.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
- dreambooth_inpaint/train_dreambooth_inpaint.py
- dreambooth_inpaint/train_dreambooth_inpaint_lora.py
- instructpix2pix_lora/train_instruct_pix2pix_lora.py
- intel_opts/textual_inversion/textual_inversion_bf16.py
- intel_opts/textual_inversion_dfq/textual_inversion.py
- lora/train_text_to_image_lora.py
- multi_subject_dreambooth/train_multi_subject_dreambooth.py
- multi_token_textual_inversion/textual_inversion.py
- onnxruntime/text_to_image/train_text_to_image.py
- onnxruntime/textual_inversion/textual_inversion.py
- onnxruntime/unconditional_image_generation/train_unconditional.py
- realfill/train_realfill.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py
t2i_adapter
- train_t2i_adapter_sdxl.py
text_to_image
- train_text_to_image.py
- train_text_to_image_sdxl.py
- train_text_to_image_lora.py
- train_text_to_image_lora_sdxl.py
textual_inversion
- textual_inversion.py
- textual_inversion_sdxl.py
unconditional_image_generation
- train_unconditional.py
wuerstchen
- text_to_image/train_text_to_image_prior.py
- text_to_image/train_text_to_image_lora_prior.py

What should NOT be changed

Category 1

The script does not have the argument --num_train_epochs.

amused
- train_amused.py
research_projects
- multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

Category 2

Distributed dataset sharding is done by WebDataset, not accelerator.

consistency_distillation
- train_lcm_distill_sd_wds.py
- train_lcm_distill_sdxl_wds.py
- train_lcm_distill_lora_sd_wds.py
- train_lcm_distill_lora_sdxl_wds.py
research_projects
- controlnet/train_controlnet_webdataset.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

BTW, if you need more extra hands, I would like to help!

sayakpaul · 2024-05-31T06:42:56Z

Great! Thank you so much, @geniuspatrick!

I will create an issue similar to #6545 so that the community can easily pick them up.

geniuspatrick changed the title ~~fix(training): lr scheduler doesn't work properly~~ [WIP] fix(training): lr scheduler doesn't work properly May 29, 2024

sayakpaul approved these changes May 29, 2024

View reviewed changes

geniuspatrick force-pushed the train_lr branch 3 times, most recently from 4b00cbf to f6742ea Compare May 30, 2024 05:08

geniuspatrick changed the title ~~[WIP] fix(training): lr scheduler doesn't work properly~~ [WIP] fix(training): lr scheduler doesn't work properly in distributed scenarios May 30, 2024

geniuspatrick force-pushed the train_lr branch from f6742ea to 30f3ab5 Compare May 30, 2024 07:47

geniuspatrick force-pushed the train_lr branch 3 times, most recently from 60f9f39 to 92f7262 Compare May 30, 2024 08:16

fix(training): lr scheduler doesn't work properly in distributed scen…

7780eea

…arios

geniuspatrick force-pushed the train_lr branch from 92f7262 to 7780eea Compare May 30, 2024 08:18

geniuspatrick changed the title ~~[WIP] fix(training): lr scheduler doesn't work properly in distributed scenarios~~ fix(training): lr scheduler doesn't work properly in distributed scenarios May 30, 2024

sayakpaul approved these changes May 30, 2024

View reviewed changes

sayakpaul merged commit 3511a96 into huggingface:main May 30, 2024
8 checks passed

sayakpaul mentioned this pull request Jun 3, 2024

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

Open

48 tasks

Nikolai10 mentioned this pull request Jun 13, 2024

Fix lr scheduler for case "if args.max_train_steps is None" Nikolai10/PerCo#2

Closed

a-r-r-o-w added a commit to a-r-r-o-w/diffusers that referenced this pull request Jun 30, 2024

apply changes from huggingface#8312

74c009f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

geniuspatrick commented May 29, 2024 •

edited

Loading

sayakpaul left a comment

sayakpaul May 29, 2024

eliphatfs May 29, 2024

geniuspatrick May 30, 2024

geniuspatrick May 30, 2024

HuggingFaceDocBuilderDev commented May 29, 2024

sayakpaul commented May 30, 2024

geniuspatrick commented May 30, 2024

geniuspatrick commented May 30, 2024

sayakpaul left a comment

geniuspatrick commented May 31, 2024

sayakpaul commented May 31, 2024

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

Conversation

geniuspatrick commented May 29, 2024 • edited Loading

What does this PR do?

TL;DR

Explanation

Experiments

Before the PR

After the PR

Before submitting

Who can review?

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul May 29, 2024

Choose a reason for hiding this comment

eliphatfs May 29, 2024

Choose a reason for hiding this comment

geniuspatrick May 30, 2024

Choose a reason for hiding this comment

geniuspatrick May 30, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 29, 2024

sayakpaul commented May 30, 2024

geniuspatrick commented May 30, 2024

geniuspatrick commented May 30, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

geniuspatrick commented May 31, 2024

What should be changed

What should NOT be changed

Category 1

Category 2

sayakpaul commented May 31, 2024

geniuspatrick commented May 29, 2024 •

edited

Loading