You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Been thinking about this one a lot lately - a few open questions I see:
In terms of implementation, does an auto-tuner have to be a wrapper around the composer library? In the event the call to train_model fails, I imagine we would have to pretty much just reinitialize everything in order to try again.
What if the model's memory usage changes significantly over time? Some reasons this could happen include batch skips due to bad loss scaling, algorithms that increase the effective model size over time, and even checkpointing.
Auto-tuning will presumably take some time, and it would be ideal to not have to repeat this process every time a new run is launched. Is there a way we can persist discovered grad_accum settings across runs with our current infrastructure?
🚀 Feature Request
The trainer can automatically determine the appropriate grad_accum to use based off of hardware properties.
Motivation
It is cumbersome to manually specify the grad accum for every hardware and model.
Implementation
The text was updated successfully, but these errors were encountered: