PEFT Models are not resuming from checkpoint as expected. #24354

techthiyanes · 2023-06-19T13:02:33Z

System Info

transformers : 4.30

Who can help?

@llohann-speranca @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Please try below code snippet as per example:

import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_steps=5,
    save_steps=1,
    save_strategy='steps'
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train()

For the above code snippet I have pulled @llohann-speranca's resume from checkpoint repo then replaced the installed transformers repo.

Inital version of trainer.train() is working without any issues.
As mentioned that I have overridden the model by using trainer.save_model(path of saved model).

For resuming from checkpoint i have updated num of epochs much higher than previous one.
while passing as trainer.train(resume from checkpoint=True) then it is showing as can't find a valid checkpoint.
Also while passing as trainer.train(resume from checkpoint = path of saved model)then it is showing as can't find a valid checkpoint.

The same issue persists in the transformers source installed version as well.

Expected behavior

The model should be resumed from checkpoint.

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-06-19T13:31:06Z

Hi @techthiyanes
Thank you very much for double checking, here are the snippets that I have ran and they work fine on my end using the branh you have mentioned:

Without `resume_from_checkpoint`

import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_steps=5,
    save_steps=1,
    save_strategy='steps'
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train(resume_from_checkpoint=True)

With `resume_from_checkpoint`

import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_steps=5,
    save_steps=1,
    save_strategy='steps'
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train()

Can you elaborate more on:

For resuming from checkpoint i have updated num of epochs much higher than previous one.
while passing as trainer.train(resume from checkpoint=True) then it is showing as can't find a valid checkpoint.
Also while passing as trainer.train(resume from checkpoint = path of saved model)then it is showing as can't find a valid checkpoint.

Thanks!

techthiyanes · 2023-06-19T13:52:52Z

```python
trainer.train(resume_from_checkpoint=True)

So far I'm able to replicate the issue.

Steps I have followed:

Libaries Installed:
! pip install datasets peft evaluate
!pip install git+https://github.com/huggingface/transformers

Clone PEFT resume from chekpoint branch:
!git clone https://github.com/llohann-speranca/transformers.git -b fix-resume-checkpoint-for-peftmodel

Replace this folder where the transformers library installed:
!cp -r /content/transformers /usr/local/lib/python3.10/dist-packages/transformers

Restart the run time.

Then below code snippet:

import os
from transformers import TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

dataset = load_dataset("imdb", split="train")
output_dir = "test"

training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
max_steps=5,
save_steps=1,
save_strategy='steps'
)

peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
peft_config=peft_config
)
trainer.train()
trainer.save_model(os.path.join(output_dir, "checkpoint-1"))
trainer.train(resume_from_checkpoint=True)

@younesbelkada @@llohann-speranca

I guess you would have run the snippet via already from modified trainer code that resides internally.

Could you please try running the code that is downloaded from git on specific branch?

Thanks a lot on your effort on validating this.

younesbelkada · 2023-06-19T13:56:15Z

Hi @techthiyanes
Can you try to install transformers with the following command ?

pip install git+https://github.com/llohann-speranca/transformers.git@fix-resume-checkpoint-for-peftmodel

The line 1991 of your traceback doesn't match with the line 1991 of the fork: https://github.com/llohann-speranca/transformers/blob/e01a4aa77073b847b9451c92c2df718a67960df1/src/transformers/trainer.py#L1991 so I believe you did not installed correctly transformers from that branch

techthiyanes · 2023-06-19T14:25:15Z

pip install git+https://github.com/llohann-speranca/transformers.git@fix-resume-checkpoint-for-peftmodel

Thanks a lot on finding and fixing to help this issue.
Now I am able to resume from checkpoint. It's working for classification and seq2seq models as well.

techthiyanes changed the title ~~PEFT Models are not resumingfrom checkpoint as expected.~~ PEFT Models are not resuming from checkpoint as expected. Jun 19, 2023

techthiyanes mentioned this issue Jun 19, 2023

Fix resuming PeftModel checkpoints in Trainer #24274

Merged

5 tasks

techthiyanes closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEFT Models are not resuming from checkpoint as expected. #24354

PEFT Models are not resuming from checkpoint as expected. #24354

techthiyanes commented Jun 19, 2023 •

edited by younesbelkada

Loading

younesbelkada commented Jun 19, 2023 •

edited

Loading

techthiyanes commented Jun 19, 2023

younesbelkada commented Jun 19, 2023

techthiyanes commented Jun 19, 2023

PEFT Models are not resuming from checkpoint as expected. #24354

PEFT Models are not resuming from checkpoint as expected. #24354

Comments

techthiyanes commented Jun 19, 2023 • edited by younesbelkada Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

younesbelkada commented Jun 19, 2023 • edited Loading

techthiyanes commented Jun 19, 2023

younesbelkada commented Jun 19, 2023

techthiyanes commented Jun 19, 2023

techthiyanes commented Jun 19, 2023 •

edited by younesbelkada

Loading

younesbelkada commented Jun 19, 2023 •

edited

Loading