New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the Scheduler adjust the steps taken relative to the gradient accumulation steps #1187
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Just not sure about the API (adding yet another argument to the Accelerator) for this.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@sgugger good for review, a few notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! I agree that the default should be True
for adjust_scheduler
. The gradient accumulation API is quite new anyway, so it's okay if it's adjusted a bit like this.
src/accelerate/utils/dataclasses.py
Outdated
|
||
num_steps: int = field(default=None, metadata={"help": "The number of steps to accumulate gradients for."}) | ||
adjust_scheduler: bool = field( | ||
default=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be True by default I think.
Let the
AcceleratedScheduler
handle gradient accumulation steppageWhat does this add?
This PR adjusts the logic in the
AcceleratedScheduler
to take into account gradient accumulation steps.Who is it for?
Closes #1170
Closes #1160
Why is it needed?
Currently a behavior does not exist that will automatically "cut" the LR scheduler of a user if they pass in
gradient_accumulation_steps
, so unless they are careful and adjust their LR scheduler beforehand, they're not actually stepping properly with the lr scheduler.What parts of the API does this impact?
User-facing:
A new
GradientAccumulationPlugin
is being added which will handlegradient_accumulation_steps
and optionally disabling the extra steppage involved with the scheduler when performing gradient accumulationInternal structure:
AcceleratedScheduler
'sstep
function now will runn*num_processes
wheren==gradient_accumulation_steps
to account for the difference.To test performance, I ran the equivalent training of
gradient_accumulation_steps==2
on a bs of 16, and a regular batch size of 32, with negligible performance differences (accuracy += 0.5%, F1 of 0.3). Likely due to batch norm layersUsage Example:
When building the
Accelerator
, pass inadjust_scheduler_to_accumulation
(defaultFalse
) to enable this behavior: