New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is simple_all_reduce also required for capacity_factor > 0 cases? #173
Comments
Nop, only |
@Fragile-azalea BTW, for your requirement (unbalanced input tokens), a tiny all_reduce in each step is unavoidable, since local padding may also work but it is usually slower for its following compute and all_to_all. Considering most usual training doesn't have this requirement ever, we'll just add an extra flag for this which is disabled by default. Do you think it okay for you? |
Thank you for your quick response. If I want to set capacity_factor = 2.0 with the largest input tokens, Could I set capacity_factor = -2.0 to achieve the expected result? |
@Fragile-azalea Only capacity_factor = 0 can guarantee no such problem exists. For other capacity_factor values, they are always or conditionally related to local |
Hi, the latest commit allows inequivalent input tokens feeding to different GPUs, just by explicitly specifying def forward(self, ..):
..
y = self._moe_layer(x, inequivalent_tokens=True)
.. |
It seems to work well now! Thank you! |
My code seems to hang when unbalanced workloads exist in two different GPUs(i.e. scores.size(0) is unequal in different GPUs such as, at the end of a dataset). It further leads to inequality in the capacity of Line 178 in different GPUs. Is simple_all_reduce also required for capacity_factor > 0 cases?
tutel/tutel/impls/fast_dispatch.py
Lines 177 to 183 in ceba363
The text was updated successfully, but these errors were encountered: