-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] add layer-wise optimizers #29732
Comments
Just putting myself out here--I'd be happy to hear requests/requirements for design/support from the PyTorch side! |
Hi! |
@younesbelkada Yes, it'll be interesting to discuss the current known pain points/requirements. Some questions that have already appeared are around DDP (which saves around buckets of gradients to accumulate once all the data's been processed) and gradient accumulation (which saves the gradients across iterations of fwd-bwd until an optim step). In both these cases, layer-wise optimizers will not save memory. Though a side note for GaLore: since the gradients should be smaller, the buffers holding the previously fullsized grads should be enabled to be smaller too. Beyond those two instances, we should be able to theoretically compose this technology with FSDP/all other use cases where the gradients are allowed to be freed right after the optimizer update. I'm curious where people have run into problems with:
|
@janeyx99 @peterjc123 I guess this is similar to what Badam is doing: #30308 |
Feature request
Context: #29588 (comment)
Motivation
The layer-wise optimizers is not GaLore-specific. We could apply it to generic optimizers to save memory. For example, the 8bit Adam optimizer paired with the layer-wise optimization sounds like a pretty good option for me.
Your contribution
Have a try when it is supported
The text was updated successfully, but these errors were encountered: