Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Uneven work distribution caused by get_shard_size changes #5515

Open
oelayan7 opened this issue May 9, 2024 · 9 comments
Open

[BUG] Uneven work distribution caused by get_shard_size changes #5515

oelayan7 opened this issue May 9, 2024 · 9 comments

Comments

@oelayan7
Copy link
Contributor

oelayan7 commented May 9, 2024

def get_shard_size(total_size, mp_size, name=None, rank=None):

This change introduces a new way for AutoTP to handle work when split_shape isn't divisible by num_kv_heads. This was of sharding is done in mlp, lm_head, and embed_out as well.

In the new approach the total size is split into chunks of 64, and for the remainders a block of 64 is always needed which causes new allocations.
In addition to that, the allocations are always sent to the devices based on their ranks. Which means that devices with lower ranks will always get the extra allocations which end up in full memory and allocation failure.

We saw such failures in Llama 2-7B and llama_2_70b, where a lot of those extra allocations happen in MLP which ended up in allocation failures.

I am not sure what is the real motivation for this change, or what was reason behind choosing that number exactly. Were there any experiments with numbers. results, and conclusions for picking it?

It would be better if this number was configurable as well (in a json or an env variable).

The real issue is finding a better way to evenly distribute those allocations, because in the current implementation all those allocations are going to devices with low rank (0, 1, 2) and so on.

@nelyahu
Copy link
Contributor

nelyahu commented May 9, 2024

Tagging original author of the PR @Yejing-Lai
CC: @tjruwase

@Yejing-Lai
Copy link
Contributor

Hi @oelayan7. Can you share your allocation failure log or screenshot?

This change logic is for balance sharding.
In attention uneven sharding, we need to depend on the num_kv_heads. However, in MLP/lm_head uneven sharding, we can use the near even division.

DNN library favors tensor size in granularity of power of 2, we pick 64 as a common granularity size.

More details please see this PR(#4697 ) comment.

@oelayan7
Copy link
Contributor Author

The thing isn't with near division, it is more with the modulo operation; it depends on the rank of the device and those with low ranks for example rank 0 will always get the extra blocks and will allocate much more than others.

@Yejing-Lai
Copy link
Contributor

Could you give me an example?

For the attention layer, we need to rely on kv_heads to split qkv weight, during uneven sharding, there will always be more blocks in the first few ranks. for example, llama-7b, kv_heads=32. The kv_heads is divided into [11, 11, 10], so weight shape=4096 will finally be divided into [1408, 1408, 1280].

The fundamental reason for uneven attention division is that the kv corresponding to each q must be on the same rank, that is, weight sharding depends on kv_heads for sharding.
Regarding this issue, we are also looking for better ways to perform uneven sharding. If you have any suggestions, we can communicate together, Thanks!

For MLP and lm_head layer, we divide the weight as evenly as possible under the better performance (block_size=64, the difference between each rank will not be greater than this value). This is a trade-off.

@oelayan7
Copy link
Contributor Author

oelayan7 commented May 15, 2024

@Yejing-Lai
What is DNN Library? Why should everyone work with this way if a library favors that way? Maybe it is better to make it configurable and generic using the accelerator class.

Even though I don't think so; let's assume it is as evenly as possible, what do you mean by better performance? Can you back up the claim with numbers and models that were tried? What were the experiments exactly and what was the performance gain.

@Yejing-Lai
Copy link
Contributor

Which uneven sharding method do you expect? Can you give an example? Or can you submit a PR to add another correct path to support the uneven sharding~

Maybe we can make the uneven sharding more perfect. Many thanks!

@oelayan7
Copy link
Contributor Author

I'm not expecting any method, I'm just trying to understand the current one.
On which HW did you test your change in the PR? What was the performance gain?

@nelyahu
Copy link
Contributor

nelyahu commented May 19, 2024

Hi @tjruwase,
PR #4697 was merged while back, without concrete performance motivation.
In addition it increases in un-even balancing of weight sharding, as it enforces blocks of 64 elements, under the claim that it works faster - but real number were never provided (not here and not in the original PR).
We suggest to revert this change, and re-introduce it with more explanations, or maybe option to configure it per accelerator where the default should remain as it was before.

@Yejing-Lai
Copy link
Contributor

Hi,
We can add an option to configure the block size, default config block size is 1. It will change the MLP sharding way.
For example llama-7b MLP first layer sharding weight shape=11008 with TP=3:
block size=64 will be sharding to [3712, 3648, 3648]
block size=1 will be sharding to [3670, 3669, 3669]

However, this modification will not change the attention unbalanced situation (Just like the above example with llama 7b attention sharding shape [1408, 1408, 1280]).
In attention. May I know If per rank gets the large shape in turn, it will improve your extra allocation situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants