-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resuming PeftModel checkpoints in Trainer #24274
Fix resuming PeftModel checkpoints in Trainer #24274
Conversation
Fix an error occurred when resuming a PeftModel from a training checkpoint. That was caused since PeftModel.pre_trained saves only adapter-related data while _load_from_checkpoint was expecting a torch sved model. This PR fix this issue and allows the adapter checkpoint to be loaded. Resolves: huggingface#24252
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your great work
Managed to reproduce the issue and your proposed fix successfully fixes it.
Handy snippet:
import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
dataset = load_dataset("imdb", split="train")
output_dir = "test"
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
max_steps=5,
save_steps=1,
save_strategy='steps'
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
peft_config=peft_config
)
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train(resume_from_checkpoint=True)
Thanks a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnaks for working on this, just have one comment.
src/transformers/trainer.py
Outdated
# Load_adapter has no return value present, modify it when appropriate. | ||
from torch.nn.modules.module import _IncompatibleKeys | ||
|
||
load_result = _IncompatibleKeys([], []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just set it to None
and adapt _issue_warnings_after_load
to return early if the load_result
is None
cause this is a bit long for nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe completely remove that line since load_result
is not used at all after: https://github.com/llohann-speranca/transformers/blob/dae3c92d48ad8941bb502baefc06d6284498714b/src/transformers/trainer.py#L2079
Although this continues training but it doesnt retain the old stuff for me. Can someone look into this? |
This doesn't seem like older resume from checkpoint that has it for pytorch models. Inside trainer.train we need to pass resume from checkpoint parameters as our last checkpoint path. while passing the path, it shows as can't find a valid checkpoint. Could someone please post some code snippet on how to use resume from checkpoint for PEFT models? |
@pacman100 Requesting to review and merge the changes, thanks! |
As shared in the snippet above, to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @llohann-speranca
Again thanks for your great work on this, I think this seems a rather important fix that might unlock a lot of users, if that's ok for you, I can quickly take over the PR and address the last comment so that we can merge the PR. What do you think ?
Hi @younesbelkada can you look at my issue with the code and please address it? |
Yes @adityaaryan77 , sure, please have a look at my comment on the PEFT issue and discuss there |
Thank you for your response. Please look at below code snippet: -- coding: utf-8 --"""Untitled345.ipynb Automatically generated by Colaboratory. Original file is located at ! pip install datasets transformers peft evaluate !git clone https://github.com/llohann-speranca/transformers.git -b fix-resume-checkpoint-for-peftmodel !cp -r /content/transformers /usr/local/lib/python3.10/dist-packages/transformers import transformers num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2 args = TrainingArguments( class SavePeftModelCallback(TrainerCallback):
def compute_metrics(eval_pred): trainer.save_model("/content/bert-large-uncased-finetuned1-cola/checkpoint-1") trainer.train(resume_from_checkpoint='/content/bert-large-uncased-finetuned1-cola/checkpoint-1070/adapter_model') Inside the resume from checkpoint i have tried with below options
Everywhere I'm getting the same message of Can't find a valid checkpoint at . |
Hi @younesbelkada. Sure! I have been very busy and have still to learn how to deal with PRs. Sorry about that. |
@llohann-speranca thanks! |
Thanks a lot on your response. If you don't mind, could you please try executing the any of huggingface example code inserting PEFT with the trainer & resume from checkpoint? Then you might be able to replicate. |
@techthiyanes I think it works as expected with this PR, as explained in #24274 (review) I have tried the attached snippet that was not working before the PR as mentioned and this PR properly fixes it by loading the checkpoint. If you want you can try to replicate using a smaller example (for example imdb as attached) and let me know if you still face an issue by opening a new ticket |
Sure..Thanks a lot.. Let me try above snippet for classification models then let you know. |
Still able to replicate the issue. Raised a separate issue on the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
Just some small comments. Not super familiar with Trainer or PEFT. However, as @sgugger has already reviewed with just a small (resolved) comment, I think this should be OK to merge once the others have been addressed
src/transformers/trainer.py
Outdated
elif is_peft_available() and isinstance(model, PeftModel): | ||
# If train a model using PEFT & LoRA, assume that adapter have been saved properly. | ||
if hasattr(model, "active_adapter") and hasattr(model, "load_adapter"): | ||
if os.path.exists(resume_from_checkpoint) or os.path.exists(resume_from_checkpoint): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this if x or y
check, x
and y
are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
src/transformers/trainer.py
Outdated
"The intermediate checkpoints of PEFT may not be saved correctly, " | ||
f"using `TrainerCallback` to save {ADAPTER_WEIGHTS_NAME} in corresponding folders, " | ||
"here are some examples https://github.com/huggingface/peft/issues/96" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not able to parse this warning sentence. Is it saying I should use TrainingCallback
to resolve this issue, using TrainingCallback
caused this issue or that using TrainerCallback
will be used as a result of this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, as this was a copy paste, I have modified the original version of the warning as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
adapter_weights_file = os.path.join(resume_from_checkpoint, ADAPTER_WEIGHTS_NAME) | ||
adapter_safe_weights_file = os.path.join(resume_from_checkpoint, ADAPTER_SAFE_WEIGHTS_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding, why don't we need the equivalent adapter_weights_index_file
or adapter_safe_weights_index_file
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I think index files are used for sharded checkpoints only. In PEFT the saved checkpoints are always extremely light (order of magnitude of few MBs) - even for very large models - thus we never save sharded checkpoint, therefore we don't save index files!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining!
Hi, nice job! Does this new feature available if I |
Thank you so much @llohann-speranca and @younesbelkada for adding this 🤗! @beyondguo, please install from source as this isn't yet part of the release. |
Thanks for iterating! Will we perform inference in the same manner? Specifically, |
I am trying to resume in this way: from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM
config = PeftConfig.from_pretrained("huggingface_path_TO_MY_PREVIOUSLY_TRAINED_LORA")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large") # underlying model
model.enable_input_require_grads() # to make training possible
model = PeftModel.from_pretrained(model, "huggingface_path_TO_MY_PREVIOUSLY_TRAINED_LORA")
model.print_trainable_parameters()
# trainable params: 0 || all params: 792,587,264 || trainable%: 0.0 Then the usual code: from transformers import DataCollatorForSeq2Seq
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=label_pad_token_id,
)
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
output_dir="./t5-large-r32-lora-JOIN-FIX-RESUME"
#batch_size = 8
# Define training args
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
auto_find_batch_size=True,
save_strategy="steps",
save_steps=500,
gradient_accumulation_steps=4,
learning_rate=1e-3, # higher learning rate
weight_decay=0.01,
num_train_epochs=2,
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=100,
push_to_hub=True)
# Create Trainer instance
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=TRAINING,
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
Will @younesbelkada you please make me sure if whatever I am doing is right? |
I've encountered the same issue. Have you resolved it? @AayushSameerShah |
@XM-Dong Nah... it seems like LoRA needs some "special script" :( |
Hi, I'm also wondering how can I get these changes? Are they in a new transformers version or PEFT version? my current versions are:
|
What does this PR do?
Fix an error occurred when Trainer tries to resume a PeftModel from a training checkpoint. That was caused since PeftModel.pre_trained saves only adapter-related data while _load_from_checkpoint expects a saved torch model. This PR fix this issue and allows the adapter checkpoint to be loaded.
Resolves: #24252
Fixes #24252
Before submitting
Pull Request section?
to it if that's the case. (Peft Model not resuming from Checkpoint #24252)
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@younesbelkada