Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mask is set for Attn during training. #67

Closed
yuntang opened this issue Jun 15, 2017 · 9 comments
Closed

Add mask is set for Attn during training. #67

yuntang opened this issue Jun 15, 2017 · 9 comments

Comments

@yuntang
Copy link

yuntang commented Jun 15, 2017

In Decoder.forward, no mask is set for attention model before attention computation. The softmax will has 0 (padding value) as input and the output will be exp(0)/sum exp(x_i) != 0

@donglixp
Copy link
Contributor

@magic282
Copy link
Contributor

It seems that this apply mask is not used during training.

@donglixp
Copy link
Contributor

@magic282 There's no mask in during training in the implementation. I'm not sure whether it would make a huge difference.

@jekbradbury
Copy link

I don't think it does, but I also haven't run any comparison tests

@vene
Copy link

vene commented Jun 27, 2017

I assumed since the sentences are sorted by length, with small enough batches and large enough datasets, training batches will be fully filled out? Now I'm not sure anymore...

@magic282
Copy link
Contributor

@vene But with option -extra-shuffle, I guess things will be different.

@nelson-liu
Copy link

Anecdotally speaking, I ran an informal comparison and it made almost no difference, since as @vene said my dataset was large enough and the batch size was small enough that the majority of batches had no padding.

@vene
Copy link

vene commented Jun 29, 2017

Thanks for checking @nelson-liu, that makes sense!

I wonder if skipping the masking really saves a lot of time during training. With -extra-shuffle it indeed seems like this is a bug, as @magic282 points out. Even with sorted batches, and with a huge number of sentences for each length bin, there will be some unfortunate batches with one sentence of length d+1 and N-1 sentences of length d, where the code does not correctly reflect the intended model, then.

@srush srush changed the title No mask is set for Attn for nmt Add mask is set for Attn during training. Jul 12, 2017
@vince62s
Copy link
Member

vince62s commented Sep 3, 2018

old thread, if someone is motivated to implement, just reopen.

@vince62s vince62s closed this as completed Sep 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants