Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MOE: Fix capacity when using TP for non-MoE
When non-expert layers use TP and experts do not use TP, we drop duplicate tokens sent to experts. Dropping duplicate tokens is done by slicing the tokens tensor sent to experts where each expert handles only 1/TP of the tokens. However, for that, we need to make sure that the capacity is divisible by TP. Signed-off-by: Moshe Island <misland@habana.ai>
- Loading branch information