-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT / Optim: Add GaLore optimizer #29588
FEAT / Optim: Add GaLore optimizer #29588
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
see OpenAccess-AI-Collective/axolotl#1370 (comment) for intermediate results |
OptimizerNames.GALORE_ADAMW_8BIT, | ||
OptimizerNames.GALORE_ADAFACTOR, | ||
]: | ||
from galore_torch import GaLoreAdafactor, GaLoreAdamW, GaLoreAdamW8bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an import check here, no? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah indeed, many things can be optimized, for now it's a really rough draft, will focus on polishing everything next!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, ping me when you're ready for a full review :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall this looks good to me, nice! Excited to see this in transformers :)
Left some suggestions.
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool work, thanks for adding this Younes. After this, we should also add this to PEFT ;)
src/transformers/training_args.py
Outdated
@@ -696,6 +699,11 @@ class TrainingArguments: | |||
for instruction fine-tuning. Check out the [original paper](https://arxiv.org/abs/2310.05914) and the | |||
[original code](https://github.com/neelsjain/NEFTune). Support transformers `PreTrainedModel` and also | |||
`PeftModel` from peft. | |||
galore_target_modules (`List[str]`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if we could use the same mechanism as for target_modules
in PEFT:
- allow str for regex match
- if list of str, not only match exact names, but also if the module name ends with the target module name
- passing
all-linear
to match all linear layers
But I understand that we don't want to copy the whole mechanism to transformers. Maybe we can think of a way to factor this code out in the future, so that we can re-use it in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be now supported !
|
||
galore_params = [] | ||
for module_name, module in model.named_modules(): | ||
if not isinstance(module, nn.Linear): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should raise an error if the target module name matches but the layer type is not Linear
. Let's assume that a user matches N linear layers and accidentally 1 other type like Embedding
, currently the embedding layer would be ignored but the user doesn't get any error message or warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I think we should pass a warning to be able to pass simple regex such as .*.attn.*.
- lmk wdyt !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
Just a comment on the training args
Is it possible to integrate per-layer weight updates as described in the galore paper? It seems to reduce memory usage significantly: But the original implementation is somewhat complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @younesbelkada for adding GaLore 🔥! Others have added most of the comments that I agree with.
Sure, per-layer weight update is crucial to the current implementation of GaLore (may not be in the future). Without that, it would require an additional 14GB GRAM for the gradients for a 7B model, making full-parameter fine-tuning infeasible with a 24GB GPU. You may refer to our discussion jiaweizzhao/GaLore#6 for more empirical results. |
docs/source/en/trainer.md
Outdated
First make sure to install GaLore official repository: | ||
|
||
```bash | ||
pip install git+https://github.com/jiaweizzhao/GaLore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GaLore has released an official package: pip install galore-torch
https://github.com/jiaweizzhao/GaLore?tab=readme-ov-file#install-galore-optimizer
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work adding this!
Just some small nits. It would be great if you could address the warning for the attention layers before merge
docs/source/en/trainer.md
Outdated
pip install galore-torch | ||
``` | ||
|
||
Then simply add one of `["galore_adamw", "galore_adafactor", "galore_adamw_8bit"]` in `optim` together with `optim_target_modules`, which can be a list of strings, regew or full path corresponding to the target module names you want to adapt. Below is an end-to-end example script (make sure to `pip install trl datasets`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then simply add one of `["galore_adamw", "galore_adafactor", "galore_adamw_8bit"]` in `optim` together with `optim_target_modules`, which can be a list of strings, regew or full path corresponding to the target module names you want to adapt. Below is an end-to-end example script (make sure to `pip install trl datasets`): | |
Then simply add one of `["galore_adamw", "galore_adafactor", "galore_adamw_8bit"]` in `optim` together with `optim_target_modules`, which can be a list of strings, regex or full path corresponding to the target module names you want to adapt. Below is an end-to-end example script (make sure to `pip install trl datasets`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arghf, sadly it has been taken care by 898a3c5 already
trainer.train() | ||
``` | ||
|
||
Note layerwise optimization is a bit experimental and does not support DDP (Distributed Data Parallel), thus you can run the training script only on a single GPU. Please see [this appropriate section](https://github.com/jiaweizzhao/GaLore?tab=readme-ov-file#train-7b-model-with-a-single-gpu-with-24gb-memory) for more details. Other features such as gradient clipping, DeepSpeed, etc might not be supported out of the box. Please [raise an issue on GitHub](https://github.com/huggingface/transformers/issues) if you encounter such issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice note :)
src/transformers/training_args.py
Outdated
@@ -1354,6 +1366,13 @@ class TrainingArguments: | |||
}, | |||
) | |||
|
|||
optim_target_modules: Optional[List[str]] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
optim_target_modules: Optional[List[str]] = field( | |
optim_target_modules: Optional[Union[str, List[str]]] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI that suggestion broke the CI for some reason; 73dcabb fixed it I think Optional
and Union
are not compatible somehow
It progresses to start training now, but I've been getting an odd error and I'm not sure if it's a transformers, axolotl, or galore_torch issue.
|
So setting
I believe this is the correct way according to transformers/src/transformers/trainer.py Lines 1047 to 1051 in 838b87a
|
@younesbelkada the bug is in the |
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Nice catch @winglian , in 57e7096 i should properly take care of everything and I can confirm it works fine. The correct way of passing galore args is the following: args = TrainingArguments(
tmpdir,
learning_rate=1e-9,
logging_steps=5,
optim="galore_adamw",
optim_args="rank=64, update_proj_gap=100, scale=0.10",
optim_target_modules=[r".*attn.*", r".*mlp.*"],
) |
Hi,
When the official version of GaLore is released, will it be integrated into Transformers? |
Is there anyway to get this optimizer fully offloaded to the cpu? Similar to how deepspeed does it with a large 32bit adam optimizer. That way the entire system can be working together with little interference through the pcei bus (compared to going into shared memory). Optimizing gpu memory to the max, allowing the gpu to handle the highest batch sizes possible, while the cpu deals with the optimizer. |
If you use ZeRO stage 2/3 (offload all optimizer state to CPU) as done in deepspeed/FSDP then you will not have any VRAM usage from optimizer state. There's not much point in using GaLoRE if you're already offloading the optimizer state to the cpu, because GaLoRE only reduces the optimizer state. The layerwise backprop isn't part of GaLoRE, it can be done by any optimizer. |
Wait you can do this without having to use adam 32bit cpu? So I could do like adam 8bit fully offloaded to cpu? Last I tried this it didn’t work but I must’ve not done it the way you are saying. cause I’m trying to get the cpu to do the work, not just hold the memory in system ram, (the way paged adam works). Apologies for this being off topic. To make up for it, I started testing GaLoRE with Unsloth/hf trainer (on windows), I can confirm it works, and i’m getting great results. The loss is what I would normally get, with the added bonus that Its letting me do a batch size of 8 where previously with paged adam 8bit I was at batch size of 4 training a 7b model. |
Could you try doing this the way it's referenced here TimDettmers/bitsandbytes#89 (comment) I'm not sure why it wouldn't work (sorry to go off-topic on a PR with many people and commits on it) |
* add galore v1 * add import * add tests and doc * fix doctest * forward contrib credits from discussions * forward contrib credits from discussions * Apply suggestions from code review Co-authored-by: Zach Mueller <muellerzr@gmail.com> * fix failing tests' * switch to `optim_target_modules` and clarify docs * more clarification * enhance lookup logic * update a test to add peak memory * add regex, all-linear and single string support * add layer-wise optimization through DummyOptimizers and LRSchedulers * forward contrib credits from discussions and original idea * add a section about DDP not supported in layerwise * Update src/transformers/trainer.py Co-authored-by: Zach Mueller <muellerzr@gmail.com> * fix self * check only if layer_wise * Update src/transformers/training_args.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * oops * make use of intervals * clarify comment * add matching tests * GaLoRe -> GaLore * move to `get_scheduler` * add note on docs * add a warning * adapt a bit the docs * update docstring * support original API * Update docs/source/en/trainer.md * slightly refactor * Update docs/source/en/trainer.md Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix args parsing and add tests * remove warning for regex * fix type hint * add note about extra args * make `is_regex` return optional --------- Co-authored-by: Maxime <maximegmd @users.noreply.github.com> Co-authored-by: Wing Lian <winglian @users.noreply.github.com> Co-authored-by: Zach Mueller <muellerzr@gmail.com> Co-authored-by: hiyouga <hiyouga@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
What does this PR do?
As per title, adds the GaLore optimizer from https://github.com/jiaweizzhao/GaLore
Fixes: #29512
This is how I am currently testing the API:
cc @pacman100 @muellerzr