New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mask is set for Attn during training. #67
Comments
The mask is used in: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Translator.py#L130 |
It seems that this apply mask is not used during training. |
@magic282 There's no mask in during training in the implementation. I'm not sure whether it would make a huge difference. |
I don't think it does, but I also haven't run any comparison tests |
I assumed since the sentences are sorted by length, with small enough batches and large enough datasets, training batches will be fully filled out? Now I'm not sure anymore... |
@vene But with option -extra-shuffle, I guess things will be different. |
Anecdotally speaking, I ran an informal comparison and it made almost no difference, since as @vene said my dataset was large enough and the batch size was small enough that the majority of batches had no padding. |
Thanks for checking @nelson-liu, that makes sense! I wonder if skipping the masking really saves a lot of time during training. With |
old thread, if someone is motivated to implement, just reopen. |
In Decoder.forward, no mask is set for attention model before attention computation. The softmax will has 0 (padding value) as input and the output will be exp(0)/sum exp(x_i) != 0
The text was updated successfully, but these errors were encountered: