[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

siddharth9820 · 2024-02-05T09:27:10Z

Running Megatron-Deepspeed with pipelining seems to call PipeModule with the type:transformer partioning method which leads to this line of code - (

DeepSpeed/deepspeed/runtime/pipe/module.py

Line 391 in 2eafe41

    
           self.parts = ds_utils.partition_balanced(weights=binary_weights, num_parts=num_stages)

)

I tried running this with a model with 42 layers, tensor parallel=4, and pipeline=16. pipe ranks 15 and 16 were assigned 0 layers. Something needs to be changed to ensure that non-zero layers are assigned to each rank.

tjruwase · 2024-02-05T11:05:48Z

@siddharth9820, thanks for reporting this error. I am curious if this is a recent regression due to the below PR that changed the balancing algorithm:
#4312

Can you please try earlier DS versions (v. 0.13.0 or 0.12.6) or revert the PR?

siddharth9820 · 2024-02-05T13:53:37Z

@tjruwase I am able to reproduce the error outside of Megatron-DeepSpeed as well -

I'll try the other versions too. Thanks for the pointer.

About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.

tjruwase · 2024-02-05T14:43:43Z

@siddharth9820, thanks for the update. This seems like an implementation bug as I find it hard to believe both the new and old algorithms fail these seemingly practical cases.

Old algorithm - Fast Optimal Load Balancing Algorithms for 1D Partitioning
New algorithm - https://www8.cs.umu.se/kurser/TDBAfl/VT06/algorithms/BOOK/BOOK2/NODE45.HTM

tjruwase · 2024-02-05T14:47:14Z

About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.

Yes, it does not seem like this approach would be balanced. I think it will only increase the minimum from zero to one. Right?

siddharth9820 · 2024-02-05T14:52:11Z

Yes it won't be balanced. But atleast it will "run" with Megatron Deepspeed. With the current approach, I was getting "empty parameter" errors during optimizer initialization. I believe this was happening on the second last pp rank, since it became parameterless.

siddharth9820 changed the title ~~'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank.~~ [BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. Feb 5, 2024

tjruwase assigned loadams Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

siddharth9820 commented Feb 5, 2024 •

edited

tjruwase commented Feb 5, 2024

siddharth9820 commented Feb 5, 2024

tjruwase commented Feb 5, 2024

tjruwase commented Feb 5, 2024

siddharth9820 commented Feb 5, 2024

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

Comments

siddharth9820 commented Feb 5, 2024 • edited

tjruwase commented Feb 5, 2024

siddharth9820 commented Feb 5, 2024

tjruwase commented Feb 5, 2024

tjruwase commented Feb 5, 2024

siddharth9820 commented Feb 5, 2024

siddharth9820 commented Feb 5, 2024 •

edited