Incorrect Saving Peft Models using HuggingFace Trainer #96

agemagician · 2023-02-16T12:30:18Z

Hello,

Thanks a lot for the great project.

I am fine-tuning Flan-T5-XXL using HuggingFace Seq2SeqTrainer and hyperparameter_search.
However, the trainer doesn't store Peft models correctly because it is not a "PreTrainedModel" type.
It stores the whole PyTorch model, including the Flan-T5-XXL, which is around 42 GB.

I have dug into the code, and I made a hacky solution inside "trainer.py" for now:

    def _save(self, output_dir: Optional[str] = None, state_dict=None):
        # If we are executing this function, we are the process zero, so we don't check for that.
        output_dir = output_dir if output_dir is not None else self.args.output_dir
        os.makedirs(output_dir, exist_ok=True)
        logger.info(f"Saving model checkpoint to {output_dir}")
        from peft.peft_model import PeftModelForSeq2SeqLM
        if isinstance(self.model, PeftModelForSeq2SeqLM):
            self.model.save_pretrained(output_dir, state_dict=state_dict)
        # Save a trained model and configuration using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        elif not isinstance(self.model, PreTrainedModel):
            if isinstance(unwrap_model(self.model), PreTrainedModel):
                if state_dict is None:
                    state_dict = self.model.state_dict()
                unwrap_model(self.model).save_pretrained(output_dir, state_dict=state_dict)
            else:
                logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
                if state_dict is None:
                    state_dict = self.model.state_dict()
                torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
        else:
            self.model.save_pretrained(output_dir, state_dict=state_dict)
        if self.tokenizer is not None:
            self.tokenizer.save_pretrained(output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))

Do you have a better solution for saving the "Peft models" correctly using HuggingFace Seq2SeqTrainer and hyperparameter_search ?

The text was updated successfully, but these errors were encountered:

agemagician · 2023-02-16T13:01:51Z

Another way that I created to save storage without any code modifications in HF is to create a callback:

from transformers.trainer_callback import TrainerCallback
import os
class PeftSavingCallback(TrainerCallback):
    def on_train_end(self, args, state, control, **kwargs):
        peft_model_path = os.path.join(state.best_model_checkpoint, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(state.best_model_checkpoint, "pytorch_model.bin")
        os.remove(pytorch_model_path) if os.path.exists(pytorch_model_path) else None

trainer = Seq2SeqTrainer(
    #model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    model_init=model_init,
    callbacks=[PeftSavingCallback]

)

pacman100 · 2023-02-16T13:48:52Z

That is really clean way to save the PEFT checkpoints, I think that should serve the purpose.

sorenmulli · 2023-03-08T12:27:05Z

Very nice solution, @agemagician. I have adapted it to the use-case when saving at steps/epochs instead of end of training

from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(
            args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}"
        )       

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

github-actions · 2023-04-03T15:03:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

jonathanasdf · 2023-04-19T20:35:01Z

How do you resume_from_checkpoint with this approach? Removing pytorch_model.bin causes "ValueError: Can't find a valid checkpoint"

SebastianBodza · 2023-05-16T07:54:01Z

Is there any way to only save the peft-Model with such a callback? As far as i understood the Callbacks, they save the whole model and later remove the main model.

The problem is, when Training in 8bit mode this leads to a crash because of OOM.

pie3636 · 2023-05-16T13:09:31Z

Is there any way to only save the peft-Model with such a callback? As far as i understood the Callbacks, they save the whole model and later remove the main model.

The problem is, when Training in 8bit mode this leads to a crash because of OOM.

You can do so by subclassing the Trainer class and overwriting the _save_checkpoint method as well as using callbacks. This is what worked in my case, but I only kept the parts of _save that I needed, so you might need to adapt the code for your use:

class PeftTrainer(Trainer):
    def _save_checkpoint(self, _, trial, metrics=None):
        """ Don't save base model, optimizer etc.
            but create checkpoint folder (needed for saving adapter) """
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        run_dir = self._get_output_dir(trial=trial)
        output_dir = os.path.join(run_dir, checkpoint_folder)

        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (self.state.best_metric is None or self.state.best_model_checkpoint is None
                or operator(metric_value, self.state.best_metric)):
                self.state.best_metric = metric_value

                self.state.best_model_checkpoint = output_dir

        os.makedirs(output_dir, exist_ok=True)

        if self.args.should_save:
            self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

class PeftSavingCallback(TrainerCallback):
    """ Correctly save PEFT model and not full model """
    def _save(self, model, folder):
        peft_model_path = os.path.join(folder, "adapter_model")
        model.save_pretrained(peft_model_path)

    def on_train_end(self, args: TrainingArguments, state: TrainerState,
            control: TrainerControl, **kwargs):
        """ Save final best model adapter """
        self._save(kwargs['model'], state.best_model_checkpoint)

    def on_epoch_end(self, args: TrainingArguments, state: TrainerState,
            control: TrainerControl, **kwargs):
        """ Save intermediate model adapters in case of interrupted training """
        folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")
        self._save(kwargs['model'], folder)

hanyin88 · 2023-06-09T04:56:21Z

Is there any way to only save the peft-Model with such a callback? As far as i understood the Callbacks, they save the whole model and later remove the main model.
The problem is, when Training in 8bit mode this leads to a crash because of OOM.

You can do so by subclassing the Trainer class and overwriting the _save_checkpoint method as well as using callbacks. This is what worked in my case, but I only kept the parts of _save that I needed, so you might need to adapt the code for your use:

class PeftTrainer(Trainer):
    def _save_checkpoint(self, _, trial, metrics=None):
        """ Don't save base model, optimizer etc.
            but create checkpoint folder (needed for saving adapter) """
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        run_dir = self._get_output_dir(trial=trial)
        output_dir = os.path.join(run_dir, checkpoint_folder)

        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (self.state.best_metric is None or self.state.best_model_checkpoint is None
                or operator(metric_value, self.state.best_metric)):
                self.state.best_metric = metric_value

                self.state.best_model_checkpoint = output_dir

        os.makedirs(output_dir, exist_ok=True)

        if self.args.should_save:
            self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

class PeftSavingCallback(TrainerCallback):
    """ Correctly save PEFT model and not full model """
    def _save(self, model, folder):
        peft_model_path = os.path.join(folder, "adapter_model")
        model.save_pretrained(peft_model_path)

    def on_train_end(self, args: TrainingArguments, state: TrainerState,
            control: TrainerControl, **kwargs):
        """ Save final best model adapter """
        self._save(kwargs['model'], state.best_model_checkpoint)

    def on_epoch_end(self, args: TrainingArguments, state: TrainerState,
            control: TrainerControl, **kwargs):
        """ Save intermediate model adapters in case of interrupted training """
        folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")
        self._save(kwargs['model'], folder)

Hello there, may I ask how did you define the model_init in this user case? Not sure the appropriate way to do it with LoRA (comparing to the usual way as outlined in this tutorial. Many thanks. @agemagician @pie3636

pacman100 added the solved solved label Feb 24, 2023

alvanli mentioned this issue Mar 8, 2023

Add local saving for whisper largev2 example notebook #163

Merged

younesbelkada mentioned this issue Apr 8, 2023

How to load weights from a specific checkpoint? #273

Closed

github-actions bot closed this as completed Apr 13, 2023

tengxiaoliu mentioned this issue May 18, 2023

The weights saved during training using deepspeed zero stage3 are incomplete. #453

Closed

dkqkxx mentioned this issue May 22, 2023

fix: load_best_model_at_end error when load_in_8bit is True huggingface/transformers#23443

Merged

5 tasks

winglian mentioned this issue May 26, 2023

qlora save peft on final callback axolotl-ai-cloud/axolotl#60

Closed

ricksun2023 mentioned this issue May 27, 2023

Multi GPU can not save adapter_model.bin tloen/alpaca-lora#483

Open

Forever-Young-l mentioned this issue Jun 1, 2023

使用finetune.py微调代码，最后保存peft训练好的lora模型报错 SCIR-HI/Huatuo-Llama-Med-Chinese#54

Closed

fabianlim mentioned this issue Apr 25, 2024

Propose ADR for Training Acceleration foundation-model-stack/fms-hf-tuning#119

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Saving Peft Models using HuggingFace Trainer #96

Incorrect Saving Peft Models using HuggingFace Trainer #96

agemagician commented Feb 16, 2023

agemagician commented Feb 16, 2023

pacman100 commented Feb 16, 2023

sorenmulli commented Mar 8, 2023 •

edited

Loading

github-actions bot commented Apr 3, 2023

jonathanasdf commented Apr 19, 2023

SebastianBodza commented May 16, 2023

pie3636 commented May 16, 2023

hanyin88 commented Jun 9, 2023 •

edited

Loading

Incorrect Saving Peft Models using HuggingFace Trainer #96

Incorrect Saving Peft Models using HuggingFace Trainer #96

Comments

agemagician commented Feb 16, 2023

agemagician commented Feb 16, 2023

pacman100 commented Feb 16, 2023

sorenmulli commented Mar 8, 2023 • edited Loading

github-actions bot commented Apr 3, 2023

jonathanasdf commented Apr 19, 2023

SebastianBodza commented May 16, 2023

pie3636 commented May 16, 2023

hanyin88 commented Jun 9, 2023 • edited Loading

sorenmulli commented Mar 8, 2023 •

edited

Loading

hanyin88 commented Jun 9, 2023 •

edited

Loading