Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bnb multi gpu training #2714

Merged
merged 6 commits into from Apr 26, 2024
Merged

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented Apr 26, 2024

What does this PR do ?

This PR fixes a bug that users who want to perform multi-gpu training with bnb gets when the first gpu is not used. In this PR, you see that we did not allow multi-gpu training before and all the code after that assume that we are in a single gpu setup. Since then, we added the possibility to perform multi-gpu training using naive PP. However, the logic after the multi-gpu check stayed the same as you can see on main . A quick fix would be to just add an elif statement.

Fixes #2713 #2429

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense thanks @SunMarc !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 🚀

@SunMarc SunMarc merged commit cd7df41 into huggingface:main Apr 26, 2024
23 checks passed
jambayk added a commit to microsoft/Olive that referenced this pull request Apr 27, 2024
## Describe your changes
There is a bug in accelerate where it assumes that bnb (loaded_in_4bit)
models on multiple gpus cannot be trained but this is not the case. On
AML computes with multiple GPUs, the model doesn't have any weights on
device 0 which is the default device for accelerator and catches the bug
https://github.com/huggingface/accelerate/blob/e82de1215ae701b6bf567eb705615c656e7f55c7/src/accelerate/accelerator.py#L1374.
This has been fixed on main by
huggingface/accelerate#2714 but is not in a
release yet.

We workaround this bug for current stable releases of accelerate by
setting `ACCELERATE_TORCH_DEVICE` to the same device as the first device
the model is on.
`
## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.
- [ ] Is this PR including examples changes? If yes, please remember to
update [example
documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md)
in a follow-up PR.

## (Optional) Issue link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants