[Feature Request] add layer-wise optimizers #29732

peterjc123 · 2024-03-19T13:18:13Z

Feature request

Motivation

The layer-wise optimizers is not GaLore-specific. We could apply it to generic optimizers to save memory. For example, the 8bit Adam optimizer paired with the layer-wise optimization sounds like a pretty good option for me.

Your contribution

Have a try when it is supported

amyeroberts · 2024-03-19T14:48:02Z

cc @younesbelkada

janeyx99 · 2024-03-20T21:30:48Z

Just putting myself out here--I'd be happy to hear requests/requirements for design/support from the PyTorch side!

younesbelkada · 2024-03-21T08:52:43Z

Hi!
Thanks so much @janeyx99 ! 🙏
Currently our approach is to use post_accumulate_gradient_hook approach: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1300-L1309 by using dummy optimizers that return no-ops during step() (idea borrowed from @hiyouga), per my understanding this do not support well some training schemes such as distributed training, we would probably need some help on PT if possible to support that (cc also @jiaweizzhao as we've been discussing this as well)

janeyx99 · 2024-03-26T17:49:55Z

@younesbelkada Yes, it'll be interesting to discuss the current known pain points/requirements. Some questions that have already appeared are around DDP (which saves around buckets of gradients to accumulate once all the data's been processed) and gradient accumulation (which saves the gradients across iterations of fwd-bwd until an optim step). In both these cases, layer-wise optimizers will not save memory. Though a side note for GaLore: since the gradients should be smaller, the buffers holding the previously fullsized grads should be enabled to be smaller too.

Beyond those two instances, we should be able to theoretically compose this technology with FSDP/all other use cases where the gradients are allowed to be freed right after the optimizer update. I'm curious where people have run into problems with:

checkpointing (saving + loading)
general composition with existing trainers + configs.

acwme111 · 2024-05-07T15:05:53Z

@janeyx99 @peterjc123 I guess this is similar to what Badam is doing: #30308

amyeroberts added Feature request Request for a new feature optimization labels Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] add layer-wise optimizers #29732

[Feature Request] add layer-wise optimizers #29732

peterjc123 commented Mar 19, 2024 •

edited

amyeroberts commented Mar 19, 2024

janeyx99 commented Mar 20, 2024

younesbelkada commented Mar 21, 2024 •

edited

janeyx99 commented Mar 26, 2024

acwme111 commented May 7, 2024

[Feature Request] add layer-wise optimizers #29732

[Feature Request] add layer-wise optimizers #29732

Comments

peterjc123 commented Mar 19, 2024 • edited

Feature request

Motivation

Your contribution

amyeroberts commented Mar 19, 2024

janeyx99 commented Mar 20, 2024

younesbelkada commented Mar 21, 2024 • edited

janeyx99 commented Mar 26, 2024

acwme111 commented May 7, 2024

peterjc123 commented Mar 19, 2024 •

edited

younesbelkada commented Mar 21, 2024 •

edited