Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Closed
Fragile-azalea opened this issue Jul 31, 2022 · 6 comments
Closed
Labels
bug Something isn't working

Comments

@Fragile-azalea
Copy link

My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?

if capacity_factor > 0:
capacity = top_k * int(capacity_factor * samples_per_expert)
else:
capacity = torch.max(torch.cat(locations_s, dim=0))
capacity = int(simple_all_reduce(capacity, group=group, op=torch.distributed.ReduceOp.MAX)) + 1
if capacity_factor < 0:
capacity = min(capacity, top_k * int(-capacity_factor * ((int(scores.size(0)) + num_global_experts - 1) // num_global_experts)))

@ghostplant
Copy link
Contributor

ghostplant commented Aug 1, 2022

Nop, only capacity_factor <= 0 would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!

@ghostplant
Copy link
Contributor

@Fragile-azalea BTW, for your requirement (unbalanced input tokens), a tiny all_reduce in each step is unavoidable, since local padding may also work but it is usually slower for its following compute and all_to_all.

Considering most usual training doesn't have this requirement ever, we'll just add an extra flag for this which is disabled by default. Do you think it okay for you?

@ghostplant ghostplant added the bug Something isn't working label Aug 1, 2022
@Fragile-azalea
Copy link
Author

Fragile-azalea commented Aug 1, 2022

Nop, only capacity_factor <= 0 would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!

Thank you for your quick response. If I want to set capacity_factor = 2.0 with the largest input tokens, Could I set capacity_factor = -2.0 to achieve the expected result?

@ghostplant
Copy link
Contributor

@Fragile-azalea Only capacity_factor = 0 can guarantee no such problem exists. For other capacity_factor values, they are always or conditionally related to local scores.size(0), so that capacity result may become different for your case.

ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 1, 2022
ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 1, 2022
ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 1, 2022
ghostplant added a commit that referenced this issue Aug 1, 2022
@ghostplant
Copy link
Contributor

Hi, the latest commit allows inequivalent input tokens feeding to different GPUs, just by explicitly specifying inequivalent_tokens=True in forward function which is False by default. You can always do that or specifying it only when you cannot guarantee. e.g. the last batch of the epoch.

def forward(self, ..):
  ..
  y = self._moe_layer(x, inequivalent_tokens=True)
  ..

@Fragile-azalea
Copy link
Author

It seems to work well now! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants