[REQUEST] efficiently deal with frozen weights during training #2615

stas00 · 2022-12-15T23:21:01Z

Is your feature request related to a problem? Please describe.

This is a need for HF accelerate and m4 projects.

We have more and more situations where a large part of the model that's being trained is frozen.

At the moment the only way to emulate frozen weights in deepspeed is to create a param group with lr=0, which works but a huge amount of memory is wasted to allocate states that aren't being trained (grads too?) and the compute is wasted too.

Describe the solution you'd like

Ideally, if requires_grad == False those weights shouldn't be in the optimizer and not allocating any optim state/grad memory.

Thank you!

@tjruwase

@pacman100 - would you like to add/expand anything to this request?

The text was updated successfully, but these errors were encountered:

pacman100 · 2022-12-20T09:35:39Z

Current workaround is to create a param_group with all the frozen parameters and set the lr=0, an example is shown below:

if getattr(accelerator.state, "deepspeed_plugin", None) is not None:
        optimizer_grouped_parameters = [
            {
                "params": [p for p in model.parameters() if p.requires_grad],
                "lr": lr
            },
            {
                "params": [p for p in model.parameters() if not p.requires_grad],
                "lr": 0
            },
        ]
        
        for param in model.parameters():
            param.requires_grad = True
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters)

Below are the screenshots of memory usage (in MB) when using DS Stage 3 (GPU: 65 GB, CPU: 327 GB 🤯) with CPU offloading vs plain PyTorch (GPU: 56 GB, CPU: 3 GB). The model is MT0-XXL (13B params) and is using LoRA for parameter efficient fine-tuning.

Plain Pytorch

DS ZeRO-3 with CPU offloading

Request: Efficiently deal with frozen weights during training such that large models could be offloaded on CPUs/sharded across GPUs properly with storage of the optimizer state only for the trainable parameters, e.g., we can see that using plain PyTorch, mt0-xxl (13B params) model takes up 56GB on GPU, now, it would be really helpful if one could do CPU offloading such that training could work on a 16GB or 24GB GPU.

tjruwase · 2022-12-22T17:01:40Z

@pacman100, this is a great demo of the issue. Could you please share the repro steps?

tjruwase · 2022-12-27T19:42:32Z

@stas00, @pacman100 please try out the linked PR.

pacman100 · 2022-12-28T07:26:29Z

Hello @tjruwase,

I tried the PR and it works great! Thanks a lot 😄, DeepSpeed team rocks✨🚀. Now, DeepSpeed ZeRO-3 with CPU offloading works as expected and the GPU memory usage is down from 65GB to 22GB and CPU usage is down from 327GB to 52GB (before PR vs after). For comparison, plain PyTorch takes (GPU: 56 GB, CPU: 3 GB) memory, so memory savings = 2.54X (PyTorch/DeepSpeed) on GPU and now would fit on consumer GPUs with 24GB GPU memory.

yuvalkirstain · 2023-01-16T05:43:50Z

@stas00 @pacman100 is this already integrated into accelerate?
If not, can somehow use this feature while using accelerate (e.g. pip installing the linked PR)?

stas00 · 2023-01-16T14:41:09Z

With this PR Accelerate doesn't need to be touched. Though it could be changed to prep param_groups removing params that require no grads from it and then deepspeed doesn't have to do it. I wonder if this is really a proper way to go, rather than relying on the scalability frameworks.

So the fastest is to install deepspeed from #2653

pacman100 · 2023-01-16T15:00:12Z

Hello @yuvalkirstain, yes, you can just get this working by installing the corresponding PR.

Hello @stas00, I believe the PR would be really beneficial as it doesn't require users to make all those changes similar to the plain PyTorch case making their lives simpler.

stas00 · 2023-01-16T15:23:33Z

Just to clarify I'm not arguing against merging the DS PR, I was just thinking that perhaps HF Trainer and Accelerate should do that in the first place before creating the optimizer.

Is there any reason why one would want frozen params in the optimizer?

tjruwase · 2023-01-16T19:05:52Z

We plan to merge this PR as we have had internal requests as well.

mayank31398 · 2023-01-27T13:36:45Z

Awesome catch. However, I was able to get around this behaviour by having a separate deepspeed.initialize() on bloom and the prompt vocab for prompt tuning. But this had made code handling a bit harder.

Thanks a lot for this one :)

stas00 added the enhancement New feature or request label Dec 15, 2022

tjruwase self-assigned this Dec 22, 2022

tjruwase mentioned this issue Dec 27, 2022

ZeRO3 handling frozen weights #2653

Merged

jeffra closed this as completed in #2653 Jan 17, 2023

KeremTurgutlu mentioned this issue Jul 28, 2023

Zero Stage-2 Frozen Layers[BUG] #4055

Open

Dominic789654 mentioned this issue Mar 31, 2024

[BUG] Dynamically switching freeze layers during the training process #5341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] efficiently deal with frozen weights during training #2615

[REQUEST] efficiently deal with frozen weights during training #2615

stas00 commented Dec 15, 2022 •

edited

pacman100 commented Dec 20, 2022 •

edited

tjruwase commented Dec 22, 2022

tjruwase commented Dec 27, 2022

pacman100 commented Dec 28, 2022 •

edited

yuvalkirstain commented Jan 16, 2023 •

edited

stas00 commented Jan 16, 2023 •

edited

pacman100 commented Jan 16, 2023

stas00 commented Jan 16, 2023

tjruwase commented Jan 16, 2023

mayank31398 commented Jan 27, 2023

[REQUEST] efficiently deal with frozen weights during training #2615

[REQUEST] efficiently deal with frozen weights during training #2615

Comments

stas00 commented Dec 15, 2022 • edited

pacman100 commented Dec 20, 2022 • edited

tjruwase commented Dec 22, 2022

tjruwase commented Dec 27, 2022

pacman100 commented Dec 28, 2022 • edited

yuvalkirstain commented Jan 16, 2023 • edited

stas00 commented Jan 16, 2023 • edited

pacman100 commented Jan 16, 2023

stas00 commented Jan 16, 2023

tjruwase commented Jan 16, 2023

mayank31398 commented Jan 27, 2023

stas00 commented Dec 15, 2022 •

edited

pacman100 commented Dec 20, 2022 •

edited

pacman100 commented Dec 28, 2022 •

edited

yuvalkirstain commented Jan 16, 2023 •

edited

stas00 commented Jan 16, 2023 •

edited