Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Grad Accum #54

Closed
ravi-mosaicml opened this issue Oct 29, 2021 · 1 comment · Fixed by #485
Closed

Auto Grad Accum #54

ravi-mosaicml opened this issue Oct 29, 2021 · 1 comment · Fixed by #485
Assignees
Labels
enhancement New (engineering) enhancements, such as features or API changes.

Comments

@ravi-mosaicml
Copy link
Contributor

🚀 Feature Request

The trainer can automatically determine the appropriate grad_accum to use based off of hardware properties.

Motivation

It is cumbersome to manually specify the grad accum for every hardware and model.

Implementation

while True:
  try:
      train_model()
  except CudaOOOM:
      state.grad_accum += 1
@ravi-mosaicml ravi-mosaicml added the enhancement New (engineering) enhancements, such as features or API changes. label Oct 29, 2021
@jbloxham
Copy link
Contributor

jbloxham commented Jan 6, 2022

Been thinking about this one a lot lately - a few open questions I see:

  1. In terms of implementation, does an auto-tuner have to be a wrapper around the composer library? In the event the call to train_model fails, I imagine we would have to pretty much just reinitialize everything in order to try again.

  2. What if the model's memory usage changes significantly over time? Some reasons this could happen include batch skips due to bad loss scaling, algorithms that increase the effective model size over time, and even checkpointing.

  3. Auto-tuning will presumably take some time, and it would be ideal to not have to repeat this process every time a new run is launched. Is there a way we can persist discovered grad_accum settings across runs with our current infrastructure?

@ravi-mosaicml ravi-mosaicml added this to the Backlog milestone Feb 15, 2022
@ravi-mosaicml ravi-mosaicml removed this from the Backlog milestone Feb 28, 2022
@hanlint hanlint linked a pull request Mar 4, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants