Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] add layer-wise optimizers #29732

Open
peterjc123 opened this issue Mar 19, 2024 · 5 comments
Open

[Feature Request] add layer-wise optimizers #29732

peterjc123 opened this issue Mar 19, 2024 · 5 comments
Labels
Feature request Request for a new feature optimization

Comments

@peterjc123
Copy link
Contributor

peterjc123 commented Mar 19, 2024

Feature request

Context: #29588 (comment)

Motivation

The layer-wise optimizers is not GaLore-specific. We could apply it to generic optimizers to save memory. For example, the 8bit Adam optimizer paired with the layer-wise optimization sounds like a pretty good option for me.

Your contribution

Have a try when it is supported

@amyeroberts amyeroberts added Feature request Request for a new feature optimization labels Mar 19, 2024
@amyeroberts
Copy link
Collaborator

cc @younesbelkada

@janeyx99
Copy link

Just putting myself out here--I'd be happy to hear requests/requirements for design/support from the PyTorch side!

@younesbelkada
Copy link
Contributor

younesbelkada commented Mar 21, 2024

Hi!
Thanks so much @janeyx99 ! 🙏
Currently our approach is to use post_accumulate_gradient_hook approach: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1300-L1309 by using dummy optimizers that return no-ops during step() (idea borrowed from @hiyouga), per my understanding this do not support well some training schemes such as distributed training, we would probably need some help on PT if possible to support that (cc also @jiaweizzhao as we've been discussing this as well)

@janeyx99
Copy link

@younesbelkada Yes, it'll be interesting to discuss the current known pain points/requirements. Some questions that have already appeared are around DDP (which saves around buckets of gradients to accumulate once all the data's been processed) and gradient accumulation (which saves the gradients across iterations of fwd-bwd until an optim step). In both these cases, layer-wise optimizers will not save memory. Though a side note for GaLore: since the gradients should be smaller, the buffers holding the previously fullsized grads should be enabled to be smaller too.

Beyond those two instances, we should be able to theoretically compose this technology with FSDP/all other use cases where the gradients are allowed to be freed right after the optimizer update. I'm curious where people have run into problems with:

  • checkpointing (saving + loading)
  • general composition with existing trainers + configs.

@acwme111
Copy link

acwme111 commented May 7, 2024

@janeyx99 @peterjc123 I guess this is similar to what Badam is doing: #30308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature optimization
Projects
None yet
Development

No branches or pull requests

5 participants