-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient checkpointing should have no functional impact #26221
Comments
No answer or re-action yet, but not stale either. |
Gentle ping @muellerzr @pacman100 |
@pacman100, @muellerz |
@pacman100, @muellerzr, @younesbelkada. Anything I can do here to help you acknowledge the ticket? If I am hearing nothing I will let it auto-close. |
Hello @marianokamp, Thank you for your patience. As I don't have a clear minimal reproducer here, I ran the below experiments and don't see a diff in performance with and without gradient checkpointing.
import argparse
import os
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from peft import (
get_peft_config,
get_peft_model,
get_peft_model_state_dict,
set_peft_model_state_dict,
LoraConfig,
PeftType,
PrefixTuningConfig,
PromptEncoderConfig,
)
import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
from tqdm import tqdm
+ set_seed(100)
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, return_dict=True)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
model
+ model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant":False})
Observations: No performance gap between runs with gradient checkpointing and without gradient checkpointing. |
Thanks @pacman100. I got it now - a minimalist example is needed. I will try to create one over the weekend. |
@pacman100. Hi Sourab, thanks for investing the time! You didn't say otherwise, so it's confirmed that using gradient checkpointing should not change the functional impact of the model, correct? I now have a minimal implementation sample notebook that shows the issue. Background: The original code is from an article that illustrates for educational purposes how a simple LoRA implementation looks like. It's just Python code and worked fine, until I tried gradient checkpointing in the 2nd article. I am not aware of specific expectations that the transformers lib has on code. But there are two things I do in my example that may be worth pointing out as not being in the middle of the road. (a) Freezing modules and (b) overwriting the forward function in the module to be adapted to point it to the adapter implementation in the forward pass. Both work fine without gradient checkpointing, but maybe they are problematic with gradient checkpointing? The code is in the example I linked above, but for easier consumption I reproduce this method here: def adapt_model(model):
class MinimalLoRAAdapter(nn.Module):
def __init__(self,
adaptee):
super().__init__()
self.adaptee = adaptee
self.orig_forward = adaptee.forward
adaptee.forward = self.forward # <-----------------
r = 1
adaptee.lora_A = nn.Parameter(
torch.randn(adaptee.in_features, r) / math.sqrt(adaptee.in_features)
)
adaptee.lora_B = nn.Parameter(torch.zeros(r, adaptee.out_features))
def forward(self, x, *args, **kwargs):
return (
self.orig_forward(x, *args, **kwargs) # <-----------------
+ F.dropout(x, 0.1) @ self.adaptee.lora_A @ self.adaptee.lora_B
)
# freeze all layers, incl. embeddings, except for the classifier
for m in model.roberta.modules():
m.requires_grad_(False) # <-----------------
# Adapt linear modules in transformer layers
for m in model.roberta.encoder.modules():
if isinstance(m, nn.Linear):
MinimalLoRAAdapter(m) Here is an excerpt from the output. Full output in the linked notebook (check eval_accuracy):
I tried the above with both GPU and CPU and I can observe the same behavior. Hope that helps to narrow it down. |
Gentle ping @pacman100 |
Hello @marianokamp, Thank you for the minimal reproducer via the notebook. I ran it using the latest versions with the below changes: + gradient_checkpointing_kwargs = None
if cp_enabled:
- model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant":False})
+ gradient_checkpointing_kwargs = {"use_reentrant":False}
training_args = TrainingArguments(
gradient_checkpointing=cp_enabled,
+ gradient_checkpointing_kwargs=gradient_checkpointing_kwargs,
... The issue you are facing with gradient checkpointing with LoRA is as follows:
Output with the above changes: Code:
|
@pacman100, thanks for your help and walking me through the solution in detail. I am still a bit confused by the API, but I understand the steps you showed me and following them fixed my issue in my original, non-minimal, code. All clear for me now. Much appreciated, Sourab! |
System Info
Latest released and py3.10.
accelerate-0.21.0 aiohttp-3.8.5 aiosignal-1.3.1 async-timeout-4.0.3 bitsandbytes-0.41.0 datasets-2.14.5 evaluate-0.4.0 frozenlist-1.4.0 huggingface-hub-0.17.1 multidict-6.0.4 peft-0.4.0 pynvml-11.5.0 regex-2023.8.8 responses-0.18.0 safetensors-0.3.3 sagemaker-inference-1.10.0 tensorboardX-2.6.2.2 tokenizers-0.13.3 transformers-4.33.2 xxhash-3.3.0 yarl-1.9.2
Who can help?
@pacman100, @muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi @pacman100, @muellerzr.
I was wondering about the memory use of LoRA. Specifically what happens if I adapt modules that are
Given that the number of parameters to train remains the same in both cases, the memory usage should be the same, except that to calculate the gradients for (bottom) we would need to keep more activations around from the forward pass. If that were the case, then turning on gradient checkpointing should make (top) and (bottom) use the same memory, as we are discarding the activations and recalculating them on the backward pass. That is correct, no (@younesbelkada)?
Trying this out, I can see that behavior as expected. However, the accuracy also changed.
My understanding would be that with gradient checkpointing we would now need less memory, more time, but the functional aspects, here model performance, should be unchanged. Hence the issue.
Details
Below you can see on the x-axis on which layer of a 12 layer RoBERTa Base the adapters were applied. As you can see the memory for (bottom - lower layer numbers, closer to the embeddings) are higher than for (top - higher layer numbers, closer to the head), when not using gradient checkpointing, and they are same when using gradient checkpointing.
However, when looking at the model performance we can see that we have a difference of 0.1 between using and not using checkpointing.
Not that it matters, but this is using the glue/sst-2 dataset. I am not changing anything, but passing 0 or 1 as an argument to Trainer's gradient_checkpointing attribute (and 0 and 1 to empty-cuda-cache every 30 seconds).
Expected behavior
No functional change when using gradient_checkpointing.
The text was updated successfully, but these errors were encountered: