Skip to content

Conversation

kiritoxkiriko
Copy link
Contributor

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

When trl >= 0.20.0
When use DDP for ppo, save model will return AttributeError: 'DistributedDataParallel' object has no attribute 'config'.

This pr fix the error occured when save model after training #5287, but this issue will trigger when save ckpt during traning.

Before call _save_checkpoint, trl will call create_model_card to create model card, which will use value model.config, which will not exist in ddp model, so we need to unwrap it, see:

# part of trl's create_model_card functon
  def create_model_card(
      self,
      model_name: Optional[str] = None,
      dataset_name: Optional[str] = None,
      tags: Union[str, list[str], None] = None,
  ):
      """
      Creates a draft of a model card using the information available to the `Trainer`.

      Args:
          model_name (`str` or `None`, *optional*, defaults to `None`):
              Name of the model.
          dataset_name (`str` or `None`, *optional*, defaults to `None`):
              Name of the dataset used for training.
          tags (`str`, `list[str]` or `None`, *optional*, defaults to `None`):
              Tags to be associated with the model card.
      """
      if not self.is_world_process_zero():
          return

      if hasattr(self.model.config, "_name_or_path") and not os.path.isdir(self.model.config._name_or_path):
          base_model = self.model.config._name_or_path
      else:
          base_model = None

This PR will fix this by unwraping the DDP model before call _save_checkpoint.

Experiment results

[rank0]:   File "/app/.venv/lib/python3.10/site-packages/swift/llm/train/sft.py", line 235, in train
[rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/ppo_trainer.py", line 62, in train
[rank0]:     super().train()
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/swift/trainers/mixin.py", line 676, in train
[rank0]:     res = super().train(*args, **kwargs)
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 651, in train
[rank0]:     self._save_checkpoint(model, trial=None)
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/ppo_trainer.py", line 66, in _save_checkpoint
[rank0]:     return super()._save_checkpoint(*args, **kwargs)
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/swift/trainers/mixin.py", line 458, in _save_checkpoint
[rank0]:     result = super()._save_checkpoint(*args, **kwargs)
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 757, in _save_checkpoint
[rank0]:     self.create_model_card(model_name=model_name)
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/trl/trainer/ppo_trainer.py", line 780, in create_model_card
[rank0]:     if hasattr(self.model.config, "_name_or_path") and not os.path.isdir(self.model.config._name_or_path):
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'config'

Paste your experiment result here(if needed).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kiritoxkiriko, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the PPO trainer when utilizing Distributed Data Parallel (DDP) for training. It resolves an AttributeError that previously occurred during model saving, which was caused by the trl library's internal create_model_card function attempting to access configuration directly from a DDP-wrapped model. The implemented solution ensures that the model is correctly unwrapped before saving, thereby preventing crashes and enabling seamless checkpointing during distributed PPO training.

Highlights

  • Bug Fix for DDP PPO Training: This pull request resolves an AttributeError that occurs when saving a model during PPO training with Distributed Data Parallel (DDP) enabled, specifically when trl version is 0.20.0 or higher.
  • Root Cause Identification: The error stems from trl's create_model_card function attempting to access self.model.config directly on a DistributedDataParallel object, which does not expose the config attribute directly.
  • Model Unwrapping Mechanism: The fix involves temporarily unwrapping the DDP model before calling the superclass's _save_checkpoint method, allowing trl to access the underlying model's configuration, and then restoring the DDP-wrapped model afterwards.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an AttributeError during PPO training with DDP by unwrapping the model before saving a checkpoint. The overall change is good. I've provided one suggestion to improve the robustness of the implementation by using a try...finally block. This ensures the model's state is correctly restored even if an error occurs during the checkpointing process.

@hjh0119
Copy link
Collaborator

hjh0119 commented Sep 16, 2025

thanks for your contribution

please pass the lint test

kiritoxkiriko and others added 2 commits September 19, 2025 15:43
use more robust error check

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@kiritoxkiriko kiritoxkiriko force-pushed the fix/ppo-bug branch 2 times, most recently from 095442c to e70b3a5 Compare September 19, 2025 08:18
@kiritoxkiriko
Copy link
Contributor Author

thanks for your contribution

please pass the lint test

passed

@hjh0119 hjh0119 merged commit c25d275 into modelscope:main Sep 19, 2025
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants