[BUG] Uneven work distribution caused by get_shard_size changes #5515

oelayan7 · 2024-05-09T11:20:05Z

DeepSpeed/deepspeed/module_inject/tp_shard.py

Line 35 in 3dd7ccf

def get_shard_size(total_size, mp_size, name=None, rank=None):

This change introduces a new way for AutoTP to handle work when split_shape isn't divisible by num_kv_heads. This was of sharding is done in mlp, lm_head, and embed_out as well.

In the new approach the total size is split into chunks of 64, and for the remainders a block of 64 is always needed which causes new allocations.
In addition to that, the allocations are always sent to the devices based on their ranks. Which means that devices with lower ranks will always get the extra allocations which end up in full memory and allocation failure.

We saw such failures in Llama 2-7B and llama_2_70b, where a lot of those extra allocations happen in MLP which ended up in allocation failures.

I am not sure what is the real motivation for this change, or what was reason behind choosing that number exactly. Were there any experiments with numbers. results, and conclusions for picking it?

It would be better if this number was configurable as well (in a json or an env variable).

The real issue is finding a better way to evenly distribute those allocations, because in the current implementation all those allocations are going to devices with low rank (0, 1, 2) and so on.

nelyahu · 2024-05-09T12:03:39Z

Tagging original author of the PR @Yejing-Lai
CC: @tjruwase

Yejing-Lai · 2024-05-09T13:10:10Z

Hi @oelayan7. Can you share your allocation failure log or screenshot?

This change logic is for balance sharding.
In attention uneven sharding, we need to depend on the num_kv_heads. However, in MLP/lm_head uneven sharding, we can use the near even division.

DNN library favors tensor size in granularity of power of 2, we pick 64 as a common granularity size.

More details please see this PR(#4697 ) comment.

oelayan7 · 2024-05-12T17:21:41Z

The thing isn't with near division, it is more with the modulo operation; it depends on the rank of the device and those with low ranks for example rank 0 will always get the extra blocks and will allocate much more than others.

Yejing-Lai · 2024-05-13T03:04:26Z

Could you give me an example?

For the attention layer, we need to rely on kv_heads to split qkv weight, during uneven sharding, there will always be more blocks in the first few ranks. for example, llama-7b, kv_heads=32. The kv_heads is divided into [11, 11, 10], so weight shape=4096 will finally be divided into [1408, 1408, 1280].

The fundamental reason for uneven attention division is that the kv corresponding to each q must be on the same rank, that is, weight sharding depends on kv_heads for sharding.
Regarding this issue, we are also looking for better ways to perform uneven sharding. If you have any suggestions, we can communicate together, Thanks!

For MLP and lm_head layer, we divide the weight as evenly as possible under the better performance (block_size=64, the difference between each rank will not be greater than this value). This is a trade-off.

oelayan7 · 2024-05-15T07:14:17Z

@Yejing-Lai
What is DNN Library? Why should everyone work with this way if a library favors that way? Maybe it is better to make it configurable and generic using the accelerator class.

Even though I don't think so; let's assume it is as evenly as possible, what do you mean by better performance? Can you back up the claim with numbers and models that were tried? What were the experiments exactly and what was the performance gain.

Yejing-Lai · 2024-05-15T07:28:35Z

Which uneven sharding method do you expect? Can you give an example? Or can you submit a PR to add another correct path to support the uneven sharding~

Maybe we can make the uneven sharding more perfect. Many thanks!

oelayan7 · 2024-05-15T08:18:45Z

I'm not expecting any method, I'm just trying to understand the current one.
On which HW did you test your change in the PR? What was the performance gain?

nelyahu · 2024-05-19T07:18:01Z

Hi @tjruwase,
PR #4697 was merged while back, without concrete performance motivation.
In addition it increases in un-even balancing of weight sharding, as it enforces blocks of 64 elements, under the claim that it works faster - but real number were never provided (not here and not in the original PR).
We suggest to revert this change, and re-introduce it with more explanations, or maybe option to configure it per accelerator where the default should remain as it was before.

Yejing-Lai · 2024-05-21T07:42:12Z

Hi,
We can add an option to configure the block size, default config block size is 1. It will change the MLP sharding way.
For example llama-7b MLP first layer sharding weight shape=11008 with TP=3:
block size=64 will be sharding to [3712, 3648, 3648]
block size=1 will be sharding to [3670, 3669, 3669]

However, this modification will not change the attention unbalanced situation (Just like the above example with llama 7b attention sharding shape [1408, 1408, 1280]).
In attention. May I know If per rank gets the large shape in turn, it will improve your extra allocation situation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Uneven work distribution caused by get_shard_size changes #5515

[BUG] Uneven work distribution caused by get_shard_size changes #5515

oelayan7 commented May 9, 2024 •

edited

nelyahu commented May 9, 2024 •

edited

Yejing-Lai commented May 9, 2024

oelayan7 commented May 12, 2024

Yejing-Lai commented May 13, 2024

oelayan7 commented May 15, 2024 •

edited

Yejing-Lai commented May 15, 2024

oelayan7 commented May 15, 2024

nelyahu commented May 19, 2024

Yejing-Lai commented May 21, 2024

[BUG] Uneven work distribution caused by get_shard_size changes #5515

[BUG] Uneven work distribution caused by get_shard_size changes #5515

Comments

oelayan7 commented May 9, 2024 • edited

nelyahu commented May 9, 2024 • edited

Yejing-Lai commented May 9, 2024

oelayan7 commented May 12, 2024

Yejing-Lai commented May 13, 2024

oelayan7 commented May 15, 2024 • edited

Yejing-Lai commented May 15, 2024

oelayan7 commented May 15, 2024

nelyahu commented May 19, 2024

Yejing-Lai commented May 21, 2024

oelayan7 commented May 9, 2024 •

edited

nelyahu commented May 9, 2024 •

edited

oelayan7 commented May 15, 2024 •

edited