Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] efficiently deal with frozen weights during training #2615

Closed
stas00 opened this issue Dec 15, 2022 · 10 comments · Fixed by #2653
Closed

[REQUEST] efficiently deal with frozen weights during training #2615

stas00 opened this issue Dec 15, 2022 · 10 comments · Fixed by #2653
Assignees
Labels
enhancement New feature or request

Comments

@stas00
Copy link
Contributor

stas00 commented Dec 15, 2022

Is your feature request related to a problem? Please describe.

This is a need for HF accelerate and m4 projects.

We have more and more situations where a large part of the model that's being trained is frozen.

At the moment the only way to emulate frozen weights in deepspeed is to create a param group with lr=0, which works but a huge amount of memory is wasted to allocate states that aren't being trained (grads too?) and the compute is wasted too.

Describe the solution you'd like

Ideally, if requires_grad == False those weights shouldn't be in the optimizer and not allocating any optim state/grad memory.

Thank you!

@tjruwase

@pacman100 - would you like to add/expand anything to this request?

@stas00 stas00 added the enhancement New feature or request label Dec 15, 2022
@pacman100
Copy link
Contributor

pacman100 commented Dec 20, 2022

Current workaround is to create a param_group with all the frozen parameters and set the lr=0, an example is shown below:

if getattr(accelerator.state, "deepspeed_plugin", None) is not None:
        optimizer_grouped_parameters = [
            {
                "params": [p for p in model.parameters() if p.requires_grad],
                "lr": lr
            },
            {
                "params": [p for p in model.parameters() if not p.requires_grad],
                "lr": 0
            },
        ]
        
        for param in model.parameters():
            param.requires_grad = True
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters)

Below are the screenshots of memory usage (in MB) when using DS Stage 3 (GPU: 65 GB, CPU: 327 GB 🤯) with CPU offloading vs plain PyTorch (GPU: 56 GB, CPU: 3 GB). The model is MT0-XXL (13B params) and is using LoRA for parameter efficient fine-tuning.

Plain Pytorch
Screenshot 2022-12-09 at 4 51 13 PM

DS ZeRO-3 with CPU offloading
Screenshot 2022-12-09 at 4 42 15 PM

Request: Efficiently deal with frozen weights during training such that large models could be offloaded on CPUs/sharded across GPUs properly with storage of the optimizer state only for the trainable parameters, e.g., we can see that using plain PyTorch, mt0-xxl (13B params) model takes up 56GB on GPU, now, it would be really helpful if one could do CPU offloading such that training could work on a 16GB or 24GB GPU.

@tjruwase tjruwase self-assigned this Dec 22, 2022
@tjruwase
Copy link
Contributor

@pacman100, this is a great demo of the issue. Could you please share the repro steps?

@tjruwase
Copy link
Contributor

@stas00, @pacman100 please try out the linked PR.

@pacman100
Copy link
Contributor

pacman100 commented Dec 28, 2022

Hello @tjruwase,

I tried the PR and it works great! Thanks a lot 😄, DeepSpeed team rocks✨🚀. Now, DeepSpeed ZeRO-3 with CPU offloading works as expected and the GPU memory usage is down from 65GB to 22GB and CPU usage is down from 327GB to 52GB (before PR vs after). For comparison, plain PyTorch takes (GPU: 56 GB, CPU: 3 GB) memory, so memory savings = 2.54X (PyTorch/DeepSpeed) on GPU and now would fit on consumer GPUs with 24GB GPU memory.

Screenshot 2022-12-28 at 12 23 11 PM

@yuvalkirstain
Copy link

yuvalkirstain commented Jan 16, 2023

@stas00 @pacman100 is this already integrated into accelerate?
If not, can somehow use this feature while using accelerate (e.g. pip installing the linked PR)?

@stas00
Copy link
Contributor Author

stas00 commented Jan 16, 2023

With this PR Accelerate doesn't need to be touched. Though it could be changed to prep param_groups removing params that require no grads from it and then deepspeed doesn't have to do it. I wonder if this is really a proper way to go, rather than relying on the scalability frameworks.

So the fastest is to install deepspeed from #2653

@pacman100
Copy link
Contributor

Hello @yuvalkirstain, yes, you can just get this working by installing the corresponding PR.

Hello @stas00, I believe the PR would be really beneficial as it doesn't require users to make all those changes similar to the plain PyTorch case making their lives simpler.

@stas00
Copy link
Contributor Author

stas00 commented Jan 16, 2023

Just to clarify I'm not arguing against merging the DS PR, I was just thinking that perhaps HF Trainer and Accelerate should do that in the first place before creating the optimizer.

Is there any reason why one would want frozen params in the optimizer?

@tjruwase
Copy link
Contributor

We plan to merge this PR as we have had internal requests as well.

@mayank31398
Copy link
Contributor

Awesome catch. However, I was able to get around this behaviour by having a separate deepspeed.initialize() on bloom and the prompt vocab for prompt tuning. But this had made code handling a bit harder.

Thanks a lot for this one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants