Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Fragile-azalea · 2022-07-31T16:57:53Z

My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?

tutel/tutel/impls/fast_dispatch.py

Lines 177 to 183 in ceba363

    
           if capacity_factor > 0: 
        
               capacity = top_k * int(capacity_factor * samples_per_expert) 
        
           else: 
        
               capacity = torch.max(torch.cat(locations_s, dim=0)) 
        
               capacity = int(simple_all_reduce(capacity, group=group, op=torch.distributed.ReduceOp.MAX)) + 1 
        
               if capacity_factor < 0: 
        
                   capacity = min(capacity, top_k * int(-capacity_factor * ((int(scores.size(0)) + num_global_experts - 1) // num_global_experts)))

ghostplant · 2022-08-01T01:48:25Z

Nop, only capacity_factor <= 0 would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!

ghostplant · 2022-08-01T02:17:02Z

@Fragile-azalea BTW, for your requirement (unbalanced input tokens), a tiny all_reduce in each step is unavoidable, since local padding may also work but it is usually slower for its following compute and all_to_all.

Considering most usual training doesn't have this requirement ever, we'll just add an extra flag for this which is disabled by default. Do you think it okay for you?

Fragile-azalea · 2022-08-01T02:38:14Z

Nop, only capacity_factor <= 0 would use that, but your point is great. For whatever cases, it currently allows the local batch size on each GPU to change in different forward steps, but should keep identical size with others in each of the corresponding step. For your case, I think we need an enhancement to handle what you need. Thanks!

Thank you for your quick response. If I want to set capacity_factor = 2.0 with the largest input tokens, Could I set capacity_factor = -2.0 to achieve the expected result?

ghostplant · 2022-08-01T03:09:46Z

@Fragile-azalea Only capacity_factor = 0 can guarantee no such problem exists. For other capacity_factor values, they are always or conditionally related to local scores.size(0), so that capacity result may become different for your case.

ghostplant · 2022-08-01T04:43:43Z

Hi, the latest commit allows inequivalent input tokens feeding to different GPUs, just by explicitly specifying inequivalent_tokens=True in forward function which is False by default. You can always do that or specifying it only when you cannot guarantee. e.g. the last batch of the epoch.

def forward(self, ..):
  ..
  y = self._moe_layer(x, inequivalent_tokens=True)
  ..

Fragile-azalea · 2022-08-01T06:20:30Z

It seems to work well now! Thank you!

ghostplant added the bug Something isn't working label Aug 1, 2022

ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 1, 2022

a bunch of fixes for microsoft#167 and microsoft#173

e632cfe

ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 1, 2022

a bunch of fixes for microsoft#167 and microsoft#173

24f842d

ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 1, 2022

a bunch of fixes for microsoft#167 and microsoft#173

4e27b7f

ghostplant added a commit that referenced this issue Aug 1, 2022

a bunch of fixes for #167 and #173 (#174)

f9ea8b3

Fragile-azalea closed this as completed Aug 1, 2022

ghostplant mentioned this issue Aug 9, 2022

My code seems to hang when skip_remainder_batch=False. #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Fragile-azalea commented Jul 31, 2022

ghostplant commented Aug 1, 2022 •

edited

ghostplant commented Aug 1, 2022

Fragile-azalea commented Aug 1, 2022 •

edited

ghostplant commented Aug 1, 2022

ghostplant commented Aug 1, 2022

Fragile-azalea commented Aug 1, 2022

Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Is simple_all_reduce also required for capacity_factor > 0 cases? #173

Comments

Fragile-azalea commented Jul 31, 2022

ghostplant commented Aug 1, 2022 • edited

ghostplant commented Aug 1, 2022

Fragile-azalea commented Aug 1, 2022 • edited

ghostplant commented Aug 1, 2022

ghostplant commented Aug 1, 2022

Fragile-azalea commented Aug 1, 2022

ghostplant commented Aug 1, 2022 •

edited

Fragile-azalea commented Aug 1, 2022 •

edited