Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-surface function utilities only work for contiguous input data #218

Open
lyd126 opened this issue Nov 6, 2023 · 12 comments
Open

Non-surface function utilities only work for contiguous input data #218

lyd126 opened this issue Nov 6, 2023 · 12 comments

Comments

@lyd126
Copy link

lyd126 commented Nov 6, 2023

According to the paper, when the 'expert' value is set to 1, the score (scores = F.softmax(logits_w_noise, dim=1)) should always equal 1. Consequently, the output variable "y" (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype), in the 'moe_layer.py' file on line 304) should be equal to the input variable "x". However, in my experiment, the "x" and "y" values are sometimes found to be different. This difference is first shown in "ctx.config.func_fwd(g, i, l, reshaped_input, dispatched_input, extra=[ctx.config.indices_[0].size(0), ctx.config.aligned_dim, ctx.config.capacity]) in fast_dispatch.py,line28" and the root source is "tutel_custom_kernel.invoke(inputs, extra, blocks, ctx) in jit_compiler.py line33". How can I fix this problem?

@ghostplant
Copy link
Contributor

Can you explain why "x == y" for y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)?

@lyd126
Copy link
Author

lyd126 commented Nov 6, 2023

Thank you for your reply. To clarify, while running the example code
(# Input Example:
import torch
x = torch.ones([6, 1024], device='cuda:0')......),
when I set the 'expert' value to 1 and do not use any activation function
(moe_layer = tutel_moe.moe_layer(
gate_type={'type': 'top', 'k': 1, 'capacity_factor': 0, 'gate_noise': 1.0},
experts={'type': 'ffn',
'count_per_node': 1,
'hidden_size_per_expert': 32,
'output_dim': 32,
'activation_fn': lambda x: x},
model_dim=x.shape[-1],
scan_expert_func=lambda name, param: setattr(param, 'skip_allreduce', True),
)
I observe that the 'scores' always equals 1 and the output variable 'y' is the same as the input variable 'x' (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype). However, when I replace the fully connected layer in my own model with the exact same MoE layer, and maintain the same input and weights as in the example code, the results "y" are entirely different from those in the example. I have identified that the discrepancy occurs specifically at this line: "y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)". I believe that the two models should produce the same output, but the results are completely different. I am unsure where the issue lies. Please refer to the attached file for specific details.
Issue_detial.docx
Extremely grateful for your response.

@ghostplant
Copy link
Contributor

Can you set gate_noise = 0 for both and check if they produce the same results?

@lyd126
Copy link
Author

lyd126 commented Nov 6, 2023

Thank you very much for your prompt reply. I set the gate_noise to 0 according to what you said, but the result is still the same as before, and it is still different (when expert equals 1, the softmax result equals 1, and the gate modification may not affect the final result at this time?)

@ghostplant
Copy link
Contributor

ghostplant commented Nov 6, 2023

OK, can you help to provides these things? For both solutions, please add the following codes after y = fast_encode(..):

In example code:

...
torch.save([x, crit, y], 'test_cast_example.py')

In model code:

...
torch.save([x, crit, y], 'test_cast_model.py')

It'll help us to reproduce and look into what happens for your cases. BTW, I assume you use default setting of self.is_postscore, which equals True, right?

@lyd126
Copy link
Author

lyd126 commented Nov 7, 2023

Thank you again for your reply. I saved the results according to your prompt, please see the attachment. As you said, self.is_postscore always equals True. In addition, I would also like to ask about the function of the self.is_postscore.
test_cast_example.zip
test_cast_model.zip

@ghostplant
Copy link
Contributor

Hi, current fast_encode(x, ) requires x to be contiguous, while your model case is not satisfied, so you can get correct result by calling fast_encode(x.contiguous()). If you directly use MoELayer, it will cast to be contiguous outside (https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L247) so that won't result in this problem.

Thanks for this finding, since you directly use internal function utilities. You can create a PR and turn it into contiguous inside this function to always guarantee the assumption.

@ghostplant ghostplant changed the title Something strange when expert=1 Non-surface function utilities only work for contiguous input data Nov 7, 2023
@lyd126
Copy link
Author

lyd126 commented Nov 7, 2023

I made the changes you mentioned and the problem was solved perfectly. Also, I'd like to ask if I want to ignore the score at this point in the top-k=1 case, i.e. y=expert1(x)+expert2(x)+.... +expertn(x) instead of y=score1expert1(x)+score2expert2(x)+.... +scoren*expertn(x) , how should I set it?

@ghostplant
Copy link
Contributor

For now, score tensor applies to either one of x and y, which is specified by is_postpone. Do you want to always not using score tensor? If so, the gating section becomes useless.

To force doing that, please: scores = torch.ones_like(scores); moe.top_k_routing(scores, top_k=k); ..

@lyd126
Copy link
Author

lyd126 commented Nov 7, 2023

I hope to use scores to determine which expert to work with, i.e. y=expert_n(x), n=softmax(score1, score2....), but I want to ignore the scores i.e. y=expert_n(x) instead of y=score_n*expert_n(x). Is this possible?

@ghostplant
Copy link
Contributor

ghostplant commented Nov 7, 2023

For your purpose, I think you need to delete *self.gates_ from L125 and L129, and rebuild from from.

@lyd126
Copy link
Author

lyd126 commented Nov 7, 2023

Thank you very much, the problem has been solved perfectly~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants