stage3: efficient compute of scaled_global_grad_norm #5256

nelyahu · 2024-03-11T16:29:53Z

using torch.norm instead of inefficient for loop

tjruwase · 2024-04-12T18:46:33Z

@nelyahu, please help resolve the formatting issue.

nelyahu · 2024-04-14T17:58:29Z

@nelyahu, please help resolve the formatting issue.

@tjruwase Done.

This reverts commit 54c0687.

@nelyahu

…#5461) This reverts commit 54c0687 due to #5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: | Failing Test Name | | --- | | test_ds_chat[zero3--offload-] | | test_ds_chat[zero3--offload-lora] | | test_ds_chat[zero3-he-offload-] | | test_ds_chat[zero3-he-offload-lora] | Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.

…icrosoft#5256)" (microsoft#5461)" This reverts commit bc48371.

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

@nelyahu

…ft#5256)" (microsoft#5461) This reverts commit 54c0687 due to microsoft#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: | Failing Test Name | | --- | | test_ds_chat[zero3--offload-] | | test_ds_chat[zero3--offload-lora] | | test_ds_chat[zero3-he-offload-] | | test_ds_chat[zero3-he-offload-lora] | Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

@nelyahu

…ft#5256)" (microsoft#5461) This reverts commit 54c0687 due to microsoft#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: | Failing Test Name | | --- | | test_ds_chat[zero3--offload-] | | test_ds_chat[zero3--offload-lora] | | test_ds_chat[zero3-he-offload-] | | test_ds_chat[zero3-he-offload-lora] | Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.

stage3: efficient compute of scaled_global_grad_norm

eebeaf3

using torch.norm instead of inefficient for loop

nelyahu requested review from tjruwase and mrwyattii as code owners March 11, 2024 16:29

Merge branch 'master' into stage_3_scaled_global_norm_calc

c353b9e

tjruwase approved these changes Apr 12, 2024

View reviewed changes

Fix formatting: remove unused import

e559d62

Merge branch 'master' into stage_3_scaled_global_norm_calc

ff9f46a

tjruwase added this pull request to the merge queue Apr 14, 2024

Merged via the queue into microsoft:master with commit 54c0687 Apr 14, 2024
14 checks passed

lekurile added a commit that referenced this pull request Apr 24, 2024

Revert "stage3: efficient compute of scaled_global_grad_norm (#5256)"

65d6266

This reverts commit 54c0687.

lekurile mentioned this pull request Apr 24, 2024

Revert "stage3: efficient compute of scaled_global_grad_norm (#5256)" #5461

Merged

nelyahu added a commit to nelyahu/DeepSpeed that referenced this pull request May 1, 2024

Revert "Revert "stage3: efficient compute of scaled_global_grad_norm (m…

63a89be

…icrosoft#5256)" (microsoft#5461)" This reverts commit bc48371.

rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024

stage3: efficient compute of scaled_global_grad_norm (microsoft#5256)

ea7e250

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024

stage3: efficient compute of scaled_global_grad_norm (microsoft#5256)

0f4f5cc

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

nelyahu deleted the stage_3_scaled_global_norm_calc branch June 9, 2024 10:06

dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024

stage3: efficient compute of scaled_global_grad_norm (microsoft#5256)

a79bf1b

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage3: efficient compute of scaled_global_grad_norm #5256

stage3: efficient compute of scaled_global_grad_norm #5256

nelyahu commented Mar 11, 2024

tjruwase commented Apr 12, 2024

nelyahu commented Apr 14, 2024

stage3: efficient compute of scaled_global_grad_norm #5256

stage3: efficient compute of scaled_global_grad_norm #5256

Conversation

nelyahu commented Mar 11, 2024

tjruwase commented Apr 12, 2024

nelyahu commented Apr 14, 2024