Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the sequence-parallelism for the dense model architecture #4530

Merged
merged 8 commits into from Oct 25, 2023

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

This PR fixes some convergence issues for when SP > 1.
We have seen that the gradients were lower when using SP=2 for a dense model, and by further investigation, we find that the gradients were scaled with the total world size, however, they should have been summed across the SP ranks and averaged on the DP-world. Here is the initial curve comparing grad norm of SP=1 (grey) vs SP=2 (green):
image
After adding the fix for scaling the gradients using the right scale, we get parity for the grad_norm, however, it keeps gradually increasing over time, and results in inferior LM validation (orange: SP1, grey: SP-2).
image

Fortunately, we are able to fix this by increasing the precision of the gradients before summing them up. The following curves show the LM validation loss of different cases of debugging the SP convergence issue (orange: SP1, grey: SP-2(bf16 gradient), blue: SP2(fp32 gradient)):
image
cc: @samadejacobs @tohtana

@RezaYazdaniAminabadi RezaYazdaniAminabadi changed the title Fix the sequence-parallelism for the dense models Fix the sequence-parallelism for the dense model architecture Oct 17, 2023
@samadejacobs samadejacobs requested review from samadejacobs and removed request for samyam October 19, 2023 19:54
@tjruwase tjruwase requested a review from tohtana October 21, 2023 22:16
@mrwyattii mrwyattii merged commit ec029e7 into master Oct 25, 2023
15 checks passed
@@ -1395,15 +1397,15 @@ def allreduce_bucket(self, bucket, rank=None, log=None):

tensor_to_allreduce = tensor

if pg_correctness_test:
if pg_correctness_test or self.sequence_parallel_size > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.sequence_parallel_size >1 is now redundant given the ds_config flag, right?

baodii pushed a commit to baodii/DeepSpeed that referenced this pull request Nov 7, 2023
…oft#4530)

Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
…oft#4530)

Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants