LORA finetuning gradients are scaled by a unknown constant factor #1893

goliaro · 2024-06-28T05:23:31Z

System Info

torch: 2.3.0+cu121
transformers: 4.41.2
peft: 0.11.1
datasets: 2.20.0

Who can help?

@BenjaminBossan @sayakpaul

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

You can run the following Colab notebook: https://colab.research.google.com/drive/1lgFyKZaZ3ySXWRcfImsry92X7dhrVgZz?usp=sharing

There are two sections in the linked Collab doc.

"Run finetuning" contains the code to fine-tune for two steps and save the weights and gradients to a file.
"Check optimizer" loads the saved weights/gradients from file and compares the updated weights with the expected values, printing the constant mismatch factor when there is one.

Expected behavior

I'm trying to integrate the peft library in our framework, but I am running into an unexplained behavior when performing LORA finetuning. I've noticed that an unidentified factor is scaling the gradients before they are used to update the weights in each optimization step.

For example, when using the SGD optimizer with parameters {lr: 1.0, maximize: False, momentum: 0, nesterov: False, weight_decay: 0.0} and a constant learning rate scheduler, you would expect the weights to be updated as follows at each step:

updated_weight = original_weight - lr * weight_gradient

However, weights are instead updated as follows (note the c constant factor):

updated_weight = original_weight - lr * c * weight_gradient

Where does c come from, and what is its formula? With rank=lora_alpha=16, I'd expect a scaling of $16/16=1.0$. I have already looked through the code, and printed any scaling constants, such as this one: https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L122, which is always 1.0 as expected. I have also checked, and the learning rate at each optimizer stage is 1.0 as I've set it.

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2024-06-28T09:30:49Z

There is really a lot going on in your notebook, so it's hard for me to tell where this constant is coming from. Therefore, I created the most simple version of this problem and this revealed that for SGD with lr=1, the parameter upgrade is exactly equal to the gradient:

import copy

import torch
from torch import nn
from peft import get_peft_model, LoraConfig


class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(10, 5)

    def forward(self, x):
        return self.lin(x)

torch.manual_seed(0)
x = torch.randn(8, 10)

# without LoRA
torch.manual_seed(0)
model = MyModule()

sd = copy.deepcopy(model).state_dict()
sgd = torch.optim.SGD(model.parameters(), lr=1, momentum=0, maximize=False, nesterov=False, weight_decay=0)

# train
sgd.zero_grad()
out = model(x)
loss = out.sum()
loss.backward()
sgd.step()

# compare
sd2 = model.state_dict()
grad = model.lin.weight.grad
torch.testing.assert_close(sd['lin.weight'] - sd2['lin.weight'], grad)

# with LoRA
torch.manual_seed(0)
model = MyModule()
config = LoraConfig(target_modules=["lin"], init_lora_weights=False)
model = get_peft_model(model, config)

sd = copy.deepcopy(model).state_dict()
sgd = torch.optim.SGD(model.parameters(), lr=1, momentum=0, maximize=False, nesterov=False, weight_decay=0)

#train
sgd.zero_grad()
out = model(x)
loss = out.sum()
loss.backward()
sgd.step()

# compare
sd2 = model.state_dict()
assert model.base_model.lin.base_layer.weight.grad is None

grad = model.base_model.lin.lora_A["default"].weight.grad
torch.testing.assert_close(sd['base_model.model.lin.lora_A.default.weight'] - sd2['base_model.model.lin.lora_A.default.weight'], grad)

grad = model.base_model.lin.lora_B["default"].weight.grad
torch.testing.assert_close(sd['base_model.model.lin.lora_B.default.weight'] - sd2['base_model.model.lin.lora_B.default.weight'], grad)

This doesn't solve your issue, but it shows that the discrepancy is unlikely to come from the LoRA implementation. To investigate further, here are some ideas:

apply your script to a much simpler model
don't use Trainer but a simple training loop
apply the same analysis but without using LoRA (full fine-tuning)

goliaro · 2024-06-29T18:19:22Z

Thanks for your input! I was able to find the cause of the bug. I was registering a parameter hook as below, but for some reason, the gradient observed by such a hook is different from the one in the state dictionary. All checks pass when using the state dictionary!

for name, module in model.named_modules():
  for param_name, param in module.named_parameters(recurse=False):
    if param.requires_grad:
      param.register_hook(save_gradient_hook(module))

goliaro closed this as completed Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LORA finetuning gradients are scaled by a unknown constant factor #1893

LORA finetuning gradients are scaled by a unknown constant factor #1893

goliaro commented Jun 28, 2024 •

edited

Loading

BenjaminBossan commented Jun 28, 2024

goliaro commented Jun 29, 2024

LORA finetuning gradients are scaled by a unknown constant factor #1893

LORA finetuning gradients are scaled by a unknown constant factor #1893

Comments

goliaro commented Jun 28, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

BenjaminBossan commented Jun 28, 2024

goliaro commented Jun 29, 2024

goliaro commented Jun 28, 2024 •

edited

Loading