Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question Regarding Optimizer States Management in Lisa Implementation #729

Closed
AlvL1225 opened this issue Apr 1, 2024 · 5 comments
Closed

Comments

@AlvL1225
Copy link

AlvL1225 commented Apr 1, 2024

I'm currently delving into the implementation details of Lisa and have encountered a point of confusion regarding the management of optimizer states, specifically in comparison to the traditional AdamW optimizer.

In the typical AdamW setup, for each model parameter, some momentum states (m1, m2, weight copy) are maintained in fp32 precision. These copies are known to be the primary consumers of GPU memory during training.

My question pertains to how this scenario is handled within the Lisa framework. Suppose we activate layer 0, thereby retaining the AdamW states (m1 and m2) for layer 0 in memory. What occurs upon switching to layer 1? Are the AdamW states (m1 and m2) for layer 0 discarded from memory, or is there a different mechanism in place to manage these states as we switch between layers?

This clarification will greatly aid in my understanding of Lisa's memory management and optimization processes.

Thank you for your time and assistance.

@research4pan
Copy link
Contributor

Thanks for your interest in LMFlow! Currently they are treated as being discarded at the end of each LISA interval for intermediate layers. If we need to maintain those m1 and m2 states, that will incur large memory consumptions, essentially making LISA the same as full-parameter training.

The suggestion of maintaining them in a smarter way is great! I think there can be some mechanisms in engineer to offload them to CPUs or disks occasionally, but this feature is still under implementation and not integrated yet. Hope this information can be helpful 😄

@AlvL1225
Copy link
Author

AlvL1225 commented Apr 1, 2024

Thanks for your interest in LMFlow! Currently they are treated as being discarded at the end of each LISA interval for intermediate layers. If we need to maintain those m1 and m2 states, that will incur large memory consumptions, essentially making LISA the same as full-parameter training.

The suggestion of maintaining them in a smarter way is great! I think there can be some mechanisms in engineer to offload them to CPUs or disks occasionally, but this feature is still under implementation and not integrated yet. Hope this information can be helpful 😄

Thank you for your prompt response!

I have another question regarding the implementation of AdamW in PyTorch. Specifically, does the native PyTorch implementation of AdamW accommodate dynamic adjustments, such as disregarding the states of frozen layers and initializing states for newly activated layers? Alternatively, have you customized the AdamW class to support these functionalities for Lisa?

@research4pan
Copy link
Contributor

Thanks for your comments! It is a great question. In our paper, we avoid this risk by conducting experiments that run each LISA interval separately via loading & saving models each time, this makes the implementation easier for early-stage experiments.

We haven't examined our current implementation in LMFlow yet, but we have been monitoring the memory consumption and it is much lower than full-parameter training, so we conjectured that part is not a serious problem. If so, I think LISA's memory consumption in LMFlow can be further reduced and that would be great 😄

@AlvL1225
Copy link
Author

AlvL1225 commented Apr 1, 2024

Gocha, thank you very much!

@AlvL1225 AlvL1225 closed this as completed Apr 1, 2024
@eric8607242
Copy link

Hi @research4pan,
Thanks for your great work.

I have a question regarding the implementation where the optimizer state of the freeze layer is discarded in LMFlow.
I've been trying to locate this particular section in the code, but I couldn't find any corresponding implementation in https://github.com/OptimalScale/LMFlow/blob/main/src/lmflow/pipeline/finetuner.py#L301.

Your help in figuring this out would be greatly appreciated.
Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants