Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage3: efficient compute of scaled_global_grad_norm #5256

Merged
merged 4 commits into from
Apr 14, 2024

Conversation

nelyahu
Copy link
Contributor

@nelyahu nelyahu commented Mar 11, 2024

using torch.norm instead of inefficient for loop

using torch.norm instead of inefficient for loop
@tjruwase
Copy link
Contributor

@nelyahu, please help resolve the formatting issue.

@nelyahu
Copy link
Contributor Author

nelyahu commented Apr 14, 2024

@nelyahu, please help resolve the formatting issue.

@tjruwase Done.

@tjruwase tjruwase added this pull request to the merge queue Apr 14, 2024
Merged via the queue into microsoft:master with commit 54c0687 Apr 14, 2024
14 checks passed
lekurile added a commit that referenced this pull request Apr 24, 2024
github-merge-queue bot pushed a commit that referenced this pull request Apr 25, 2024
…#5461)

This reverts commit 54c0687 due to
#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled.

This bug was discovered due to failures in the DS Chat CI workflow.
Failing tests across CI failures:
| Failing Test Name |
| --- |
| test_ds_chat[zero3--offload-] |
| test_ds_chat[zero3--offload-lora] |
| test_ds_chat[zero3-he-offload-] |
| test_ds_chat[zero3-he-offload-lora] |

Error message:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
```

It seems that `torch.stack()` or `torch.norm()` is having issues when
the offload feature is enabled and tensors are split between CPU/GPU,
however this is just an initial guess and would require more
investigation.

@nelyahu Since you are the original author of the PR, if you have some
bandwidth, any help here is greatly appreciated!

After reverting this commit, all tests pass in the DS Chat CI workflow:

https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763

@tjruwase for context.
nelyahu added a commit to nelyahu/DeepSpeed that referenced this pull request May 1, 2024
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
using torch.norm instead of inefficient for loop

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
using torch.norm instead of inefficient for loop

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
…ft#5256)" (microsoft#5461)

This reverts commit 54c0687 due to
microsoft#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled.

This bug was discovered due to failures in the DS Chat CI workflow.
Failing tests across CI failures:
| Failing Test Name |
| --- |
| test_ds_chat[zero3--offload-] |
| test_ds_chat[zero3--offload-lora] |
| test_ds_chat[zero3-he-offload-] |
| test_ds_chat[zero3-he-offload-lora] |

Error message:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
```

It seems that `torch.stack()` or `torch.norm()` is having issues when
the offload feature is enabled and tensors are split between CPU/GPU,
however this is just an initial guess and would require more
investigation.

@nelyahu Since you are the original author of the PR, if you have some
bandwidth, any help here is greatly appreciated!

After reverting this commit, all tests pass in the DS Chat CI workflow:

https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763

@tjruwase for context.
@nelyahu nelyahu deleted the stage_3_scaled_global_norm_calc branch June 9, 2024 10:06
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
using torch.norm instead of inefficient for loop

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
…ft#5256)" (microsoft#5461)

This reverts commit 54c0687 due to
microsoft#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled.

This bug was discovered due to failures in the DS Chat CI workflow.
Failing tests across CI failures:
| Failing Test Name |
| --- |
| test_ds_chat[zero3--offload-] |
| test_ds_chat[zero3--offload-lora] |
| test_ds_chat[zero3-he-offload-] |
| test_ds_chat[zero3-he-offload-lora] |

Error message:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
```

It seems that `torch.stack()` or `torch.norm()` is having issues when
the offload feature is enabled and tensors are split between CPU/GPU,
however this is just an initial guess and would require more
investigation.

@nelyahu Since you are the original author of the PR, if you have some
bandwidth, any help here is greatly appreciated!

After reverting this commit, all tests pass in the DS Chat CI workflow:

https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763

@tjruwase for context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants