Non-surface function utilities only work for contiguous input data #218

lyd126 · 2023-11-06T10:41:08Z

According to the paper, when the 'expert' value is set to 1, the score (scores = F.softmax(logits_w_noise, dim=1)) should always equal 1. Consequently, the output variable "y" (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype), in the 'moe_layer.py' file on line 304) should be equal to the input variable "x". However, in my experiment, the "x" and "y" values are sometimes found to be different. This difference is first shown in "ctx.config.func_fwd(g, i, l, reshaped_input, dispatched_input, extra=[ctx.config.indices_[0].size(0), ctx.config.aligned_dim, ctx.config.capacity]) in fast_dispatch.py,line28" and the root source is "tutel_custom_kernel.invoke(inputs, extra, blocks, ctx) in jit_compiler.py line33". How can I fix this problem?

ghostplant · 2023-11-06T12:53:23Z

Can you explain why "x == y" for y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)?

lyd126 · 2023-11-06T14:08:36Z

Thank you for your reply. To clarify, while running the example code
(# Input Example:
import torch
x = torch.ones([6, 1024], device='cuda:0')......),
when I set the 'expert' value to 1 and do not use any activation function
(moe_layer = tutel_moe.moe_layer(
gate_type={'type': 'top', 'k': 1, 'capacity_factor': 0, 'gate_noise': 1.0},
experts={'type': 'ffn',
'count_per_node': 1,
'hidden_size_per_expert': 32,
'output_dim': 32,
'activation_fn': lambda x: x},
model_dim=x.shape[-1],
scan_expert_func=lambda name, param: setattr(param, 'skip_allreduce', True),
)
I observe that the 'scores' always equals 1 and the output variable 'y' is the same as the input variable 'x' (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype). However, when I replace the fully connected layer in my own model with the exact same MoE layer, and maintain the same input and weights as in the example code, the results "y" are entirely different from those in the example. I have identified that the discrepancy occurs specifically at this line: "y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)". I believe that the two models should produce the same output, but the results are completely different. I am unsure where the issue lies. Please refer to the attached file for specific details.
Issue_detial.docx
Extremely grateful for your response.

ghostplant · 2023-11-06T15:09:06Z

Can you set gate_noise = 0 for both and check if they produce the same results?

lyd126 · 2023-11-06T15:29:58Z

Thank you very much for your prompt reply. I set the gate_noise to 0 according to what you said, but the result is still the same as before, and it is still different (when expert equals 1, the softmax result equals 1, and the gate modification may not affect the final result at this time?)

ghostplant · 2023-11-06T18:42:38Z

OK, can you help to provides these things? For both solutions, please add the following codes after y = fast_encode(..):

In example code:

...
torch.save([x, crit, y], 'test_cast_example.py')

In model code:

...
torch.save([x, crit, y], 'test_cast_model.py')

It'll help us to reproduce and look into what happens for your cases. BTW, I assume you use default setting of self.is_postscore, which equals True, right?

lyd126 · 2023-11-07T02:33:51Z

Thank you again for your reply. I saved the results according to your prompt, please see the attachment. As you said, self.is_postscore always equals True. In addition, I would also like to ask about the function of the self.is_postscore.
test_cast_example.zip
test_cast_model.zip

ghostplant · 2023-11-07T09:07:00Z

Hi, current fast_encode(x, ) requires x to be contiguous, while your model case is not satisfied, so you can get correct result by calling fast_encode(x.contiguous()). If you directly use MoELayer, it will cast to be contiguous outside (https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L247) so that won't result in this problem.

Thanks for this finding, since you directly use internal function utilities. You can create a PR and turn it into contiguous inside this function to always guarantee the assumption.

lyd126 · 2023-11-07T09:21:30Z

I made the changes you mentioned and the problem was solved perfectly. Also, I'd like to ask if I want to ignore the score at this point in the top-k=1 case, i.e. y=expert1(x)+expert2(x)+.... +expertn(x) instead of y=score1expert1(x)+score2expert2(x)+.... +scoren*expertn(x) , how should I set it?

ghostplant · 2023-11-07T09:31:52Z

For now, score tensor applies to either one of x and y, which is specified by is_postpone. Do you want to always not using score tensor? If so, the gating section becomes useless.

To force doing that, please: scores = torch.ones_like(scores); moe.top_k_routing(scores, top_k=k); ..

lyd126 · 2023-11-07T09:42:34Z

I hope to use scores to determine which expert to work with, i.e. y=expert_n(x), n=softmax(score1, score2....), but I want to ignore the scores i.e. y=expert_n(x) instead of y=score_n*expert_n(x). Is this possible?

ghostplant · 2023-11-07T10:01:08Z

For your purpose, I think you need to delete *self.gates_ from L125 and L129, and rebuild from from.

lyd126 · 2023-11-07T15:02:37Z

Thank you very much, the problem has been solved perfectly~~

ghostplant changed the title ~~Something strange when expert=1~~ Non-surface function utilities only work for contiguous input data Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-surface function utilities only work for contiguous input data #218

Non-surface function utilities only work for contiguous input data #218

lyd126 commented Nov 6, 2023 •

edited

ghostplant commented Nov 6, 2023

lyd126 commented Nov 6, 2023 •

edited

ghostplant commented Nov 6, 2023

lyd126 commented Nov 6, 2023

ghostplant commented Nov 6, 2023 •

edited

lyd126 commented Nov 7, 2023

ghostplant commented Nov 7, 2023

lyd126 commented Nov 7, 2023 •

edited

ghostplant commented Nov 7, 2023

lyd126 commented Nov 7, 2023

ghostplant commented Nov 7, 2023 •

edited

lyd126 commented Nov 7, 2023

Non-surface function utilities only work for contiguous input data #218

Non-surface function utilities only work for contiguous input data #218

Comments

lyd126 commented Nov 6, 2023 • edited

ghostplant commented Nov 6, 2023

lyd126 commented Nov 6, 2023 • edited

ghostplant commented Nov 6, 2023

lyd126 commented Nov 6, 2023

ghostplant commented Nov 6, 2023 • edited

lyd126 commented Nov 7, 2023

ghostplant commented Nov 7, 2023

lyd126 commented Nov 7, 2023 • edited

ghostplant commented Nov 7, 2023

lyd126 commented Nov 7, 2023

ghostplant commented Nov 7, 2023 • edited

lyd126 commented Nov 7, 2023

lyd126 commented Nov 6, 2023 •

edited

lyd126 commented Nov 6, 2023 •

edited

ghostplant commented Nov 6, 2023 •

edited

lyd126 commented Nov 7, 2023 •

edited

ghostplant commented Nov 7, 2023 •

edited