-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEFT Models are not resuming from checkpoint as expected. #24354
Comments
Hi @techthiyanes Without `resume_from_checkpoint`import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
dataset = load_dataset("imdb", split="train")
output_dir = "test"
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
max_steps=5,
save_steps=1,
save_strategy='steps'
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train(resume_from_checkpoint=True) With `resume_from_checkpoint`import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
dataset = load_dataset("imdb", split="train")
output_dir = "test"
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
max_steps=5,
save_steps=1,
save_strategy='steps'
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train() Can you elaborate more on:
Thanks! |
So far I'm able to replicate the issue. Steps I have followed: Libaries Installed: Clone PEFT resume from chekpoint branch: Replace this folder where the transformers library installed: Restart the run time. Then below code snippet: import os dataset = load_dataset("imdb", split="train") training_args = TrainingArguments( peft_config = LoraConfig( trainer = SFTTrainer( @younesbelkada @@llohann-speranca I guess you would have run the snippet via already from modified trainer code that resides internally. Could you please try running the code that is downloaded from git on specific branch? Thanks a lot on your effort on validating this. |
Hi @techthiyanes pip install git+https://github.com/llohann-speranca/transformers.git@fix-resume-checkpoint-for-peftmodel The line 1991 of your traceback doesn't match with the line 1991 of the fork: https://github.com/llohann-speranca/transformers/blob/e01a4aa77073b847b9451c92c2df718a67960df1/src/transformers/trainer.py#L1991 so I believe you did not installed correctly transformers from that branch |
Thanks a lot on finding and fixing to help this issue. |
System Info
transformers : 4.30
Who can help?
@llohann-speranca @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Please try below code snippet as per example:
For the above code snippet I have pulled @llohann-speranca's resume from checkpoint repo then replaced the installed transformers repo.
Inital version of trainer.train() is working without any issues.
As mentioned that I have overridden the model by using trainer.save_model(path of saved model).
For resuming from checkpoint i have updated num of epochs much higher than previous one.
while passing as trainer.train(resume from checkpoint=True) then it is showing as can't find a valid checkpoint.
Also while passing as trainer.train(resume from checkpoint = path of saved model)then it is showing as can't find a valid checkpoint.
The same issue persists in the transformers source installed version as well.
Expected behavior
The model should be resumed from checkpoint.
The text was updated successfully, but these errors were encountered: