Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix DeepSpeed zero-3 issue #182

Merged
merged 8 commits into from
Mar 28, 2023

Conversation

younesbelkada
Copy link
Collaborator

What does this PR do?

This is an attempt to fix #171

For now the sentiment script hangs, so I need to investigate

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 28, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @younesbelkada for fixing TRL+DS integration. Left comment. Sentiment pipeline related changes have shared offline.

) = self.accelerator.prepare(
self.model, self.ref_model, self.optimizer, self.data_collator, self.dataloader, self.lr_scheduler
# Safety checkers for DS integration
is_deepspeed_zero_3 = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes is irrespective of DS Stage, it should be applied for all DS Stages

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved
Comment on lines +291 to +292
if self.accelerator.state.deepspeed_plugin.zero_stage == 3:
self.model.train()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the offline discussion I had with @pacman100 , I confirm this hack is needed to make DS3 work

@younesbelkada younesbelkada requested review from pacman100 and lvwerra and removed request for pacman100 March 27, 2023 15:21
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, otherwise looks good!

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved
@younesbelkada younesbelkada merged commit 2672a94 into huggingface:main Mar 28, 2023
@younesbelkada younesbelkada deleted the ds-fix-issue branch March 28, 2023 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Fix DeepSpeed zero-3 issue
4 participants