You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I have been working on training scripts for multiple models (T2I, IP2P) and found the different logic to calculate step and epoch while resuming training different across scripts.
In train_text_to_image.py script link
In the similar issue, some changes are made for the progress bar inconsistency but I am bit confused with the following things:-
The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script
In general, when does global-step indicate and how does it's being updated, in both the scripts I can see the following code but couldn't understand it from accelerate documentation
if accelerator.sync_gradients:
if args.use_ema:
ema_unet.step(unet.parameters())
progress_bar.update(1)
global_step += 1
accelerator.log({"train_loss": train_loss}, step=global_step)
train_loss = 0.0
If we are using multiple GPUs with gradient accumulation, at what event global_step is updated- is it being updated independently by each GPU (since the code is not wrapped with accelerator.is_main_process), also how accumulation affecting the tracking here?
The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script
Why should it not be the case? It's based on steps and without the GA steps, the calculation would be improper, no?
Ccing @muellerzr for further clarification in light of accelerate.
I am not sure about the calculation but do find it different in these two scripts. Is one of them outdated or wrong?
To deduce this calculation, I tried to understand the how global_step is updated but couldn't understand. In general, it is incremented by 1 when the accelerator.sync_gradients is true. The following code is used to update global_step
if accelerator.sync_gradients:
if args.use_ema:
ema_unet.step(unet.parameters())
progress_bar.update(1)
global_step += 1
accelerator.log({"train_loss": train_loss}, step=global_step)
train_loss = 0.0
What does this code imply?
Is this counter updated by each gpu (multiple process scenario) or not? Does this sync_gradient flag takes care of gradient accumulation or not? Based on that only, I can deduce the calculation
Describe the bug
Hi,
I have been working on training scripts for multiple models (T2I, IP2P) and found the different logic to calculate
step
andepoch
while resuming training different across scripts.In
train_text_to_image.py
script linkIn
train_instruct_pix2pix.py
script linkIn the similar issue, some changes are made for the progress bar inconsistency but I am bit confused with the following things:-
args.gradient_accumulation_steps
intrain_instruct_pix2pix.py
scriptaccelerate
documentationIf we are using multiple GPUs with gradient accumulation, at what event
global_step
is updated- is it being updated independently by each GPU (since the code is not wrapped withaccelerator.is_main_process
), also how accumulation affecting the tracking here?Reproduction
Logs
No response
System Info
Who can help?
@sayakpaul
The text was updated successfully, but these errors were encountered: