Auto Grad Accum #54

ravi-mosaicml · 2021-10-29T15:02:01Z

🚀 Feature Request

The trainer can automatically determine the appropriate grad_accum to use based off of hardware properties.

It is cumbersome to manually specify the grad accum for every hardware and model.

while True:
  try:
      train_model()
  except CudaOOOM:
      state.grad_accum += 1

jbloxham · 2022-01-06T20:40:51Z

Been thinking about this one a lot lately - a few open questions I see:

In terms of implementation, does an auto-tuner have to be a wrapper around the composer library? In the event the call to train_model fails, I imagine we would have to pretty much just reinitialize everything in order to try again.
What if the model's memory usage changes significantly over time? Some reasons this could happen include batch skips due to bad loss scaling, algorithms that increase the effective model size over time, and even checkpointing.
Auto-tuning will presumably take some time, and it would be ideal to not have to repeat this process every time a new run is launched. Is there a way we can persist discovered grad_accum settings across runs with our current infrastructure?

ravi-mosaicml added the enhancement New (engineering) enhancements, such as features or API changes. label Oct 29, 2021

ravi-mosaicml added this to the Backlog milestone Feb 15, 2022

ravi-mosaicml assigned mvpatel2000 Feb 28, 2022

ravi-mosaicml removed this from the Backlog milestone Feb 28, 2022

hanlint linked a pull request Mar 4, 2022 that will close this issue

Dynamic Shrinking Microbatches #485

Merged

ravi-mosaicml added Added to JIRA and removed Added to JIRA labels Mar 30, 2022

hanlint closed this as completed in #485 Apr 6, 2022