Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEFT Models are not resuming from checkpoint as expected. #24354

Closed
4 tasks
techthiyanes opened this issue Jun 19, 2023 · 4 comments
Closed
4 tasks

PEFT Models are not resuming from checkpoint as expected. #24354

techthiyanes opened this issue Jun 19, 2023 · 4 comments

Comments

@techthiyanes
Copy link

techthiyanes commented Jun 19, 2023

System Info

transformers : 4.30

Who can help?

@llohann-speranca @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Please try below code snippet as per example:

import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_steps=5,
    save_steps=1,
    save_strategy='steps'
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train()

For the above code snippet I have pulled @llohann-speranca's resume from checkpoint repo then replaced the installed transformers repo.

Inital version of trainer.train() is working without any issues.
As mentioned that I have overridden the model by using trainer.save_model(path of saved model).

For resuming from checkpoint i have updated num of epochs much higher than previous one.
while passing as trainer.train(resume from checkpoint=True) then it is showing as can't find a valid checkpoint.
Also while passing as trainer.train(resume from checkpoint = path of saved model)then it is showing as can't find a valid checkpoint.

The same issue persists in the transformers source installed version as well.

Expected behavior

The model should be resumed from checkpoint.

@techthiyanes techthiyanes changed the title PEFT Models are not resumingfrom checkpoint as expected. PEFT Models are not resuming from checkpoint as expected. Jun 19, 2023
@younesbelkada
Copy link
Contributor

younesbelkada commented Jun 19, 2023

Hi @techthiyanes
Thank you very much for double checking, here are the snippets that I have ran and they work fine on my end using the branh you have mentioned:

Without `resume_from_checkpoint`
import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_steps=5,
    save_steps=1,
    save_strategy='steps'
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train(resume_from_checkpoint=True)
With `resume_from_checkpoint`
import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_steps=5,
    save_steps=1,
    save_strategy='steps'
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train()

Can you elaborate more on:

For resuming from checkpoint i have updated num of epochs much higher than previous one.
while passing as trainer.train(resume from checkpoint=True) then it is showing as can't find a valid checkpoint.
Also while passing as trainer.train(resume from checkpoint = path of saved model)then it is showing as can't find a valid checkpoint.

Thanks!

@techthiyanes
Copy link
Author

```python
trainer.train(resume_from_checkpoint=True)

So far I'm able to replicate the issue.

Steps I have followed:

Libaries Installed:
! pip install datasets peft evaluate
!pip install git+https://github.com/huggingface/transformers

Clone PEFT resume from chekpoint branch:
!git clone https://github.com/llohann-speranca/transformers.git -b fix-resume-checkpoint-for-peftmodel

Replace this folder where the transformers library installed:
!cp -r /content/transformers /usr/local/lib/python3.10/dist-packages/transformers

Restart the run time.

Then below code snippet:

import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
max_steps=5,
save_steps=1,
save_strategy='steps'
)

peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train(resume_from_checkpoint=True)

image

@younesbelkada @@llohann-speranca

I guess you would have run the snippet via already from modified trainer code that resides internally.

Could you please try running the code that is downloaded from git on specific branch?

Thanks a lot on your effort on validating this.

@younesbelkada
Copy link
Contributor

Hi @techthiyanes
Can you try to install transformers with the following command ?

pip install git+https://github.com/llohann-speranca/transformers.git@fix-resume-checkpoint-for-peftmodel

The line 1991 of your traceback doesn't match with the line 1991 of the fork: https://github.com/llohann-speranca/transformers/blob/e01a4aa77073b847b9451c92c2df718a67960df1/src/transformers/trainer.py#L1991 so I believe you did not installed correctly transformers from that branch

@techthiyanes
Copy link
Author

pip install git+https://github.com/llohann-speranca/transformers.git@fix-resume-checkpoint-for-peftmodel

Thanks a lot on finding and fixing to help this issue.
Now I am able to resume from checkpoint. It's working for classification and seq2seq models as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants