Fix the sequence-parallelism for the dense model architecture #4530

RezaYazdaniAminabadi · 2023-10-17T19:24:30Z

This PR fixes some convergence issues for when SP > 1.
We have seen that the gradients were lower when using SP=2 for a dense model, and by further investigation, we find that the gradients were scaled with the total world size, however, they should have been summed across the SP ranks and averaged on the DP-world. Here is the initial curve comparing grad norm of SP=1 (grey) vs SP=2 (green):

After adding the fix for scaling the gradients using the right scale, we get parity for the grad_norm, however, it keeps gradually increasing over time, and results in inferior LM validation (orange: SP1, grey: SP-2).

Fortunately, we are able to fix this by increasing the precision of the gradients before summing them up. The following curves show the LM validation loss of different cases of debugging the SP convergence issue (orange: SP1, grey: SP-2(bf16 gradient), blue: SP2(fp32 gradient)):

cc: @samadejacobs @tohtana

deepspeed/runtime/engine.py

tjruwase · 2023-10-26T00:34:29Z

deepspeed/runtime/zero/stage_1_and_2.py

@@ -1395,15 +1397,15 @@ def allreduce_bucket(self, bucket, rank=None, log=None):

        tensor_to_allreduce = tensor

-        if pg_correctness_test:
+        if pg_correctness_test or self.sequence_parallel_size > 1:


self.sequence_parallel_size >1 is now redundant given the ds_config flag, right?

…oft#4530) Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

fix the sequence-parallelism for the dense models

066644d

RezaYazdaniAminabadi requested review from jeffra, tjruwase, samyam and mrwyattii as code owners October 17, 2023 19:24

RezaYazdaniAminabadi changed the title ~~Fix the sequence-parallelism for the dense models~~ Fix the sequence-parallelism for the dense model architecture Oct 17, 2023

Reza Yazdani and others added 3 commits October 18, 2023 21:33

fix the gradient scale for when zero is not enabled

8d901bf

fix comm group for allreduce

0bb9594

fix format

aaae994

samadejacobs requested review from samadejacobs and removed request for samyam October 19, 2023 19:54

Merge branch 'master' into fix-sp-dense

7ae577c

tjruwase requested a review from tohtana October 21, 2023 22:16

tjruwase reviewed Oct 21, 2023

View reviewed changes

deepspeed/runtime/engine.py Show resolved Hide resolved

samadejacobs and others added 3 commits October 25, 2023 07:53

Allow users to set/override sp comm data type from ds config

01ccf33

Fix formatting

568ae5a

Merge branch 'master' into fix-sp-dense

bff46e5

mrwyattii merged commit ec029e7 into master Oct 25, 2023
15 checks passed

tjruwase reviewed Oct 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the sequence-parallelism for the dense model architecture #4530

Fix the sequence-parallelism for the dense model architecture #4530

RezaYazdaniAminabadi commented Oct 17, 2023

tjruwase Oct 26, 2023

Fix the sequence-parallelism for the dense model architecture #4530

Fix the sequence-parallelism for the dense model architecture #4530

Conversation

RezaYazdaniAminabadi commented Oct 17, 2023

tjruwase Oct 26, 2023

Choose a reason for hiding this comment