Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mask for decoder #6

Closed
XiaoLiuAI opened this issue Aug 18, 2018 · 6 comments
Closed

mask for decoder #6

XiaoLiuAI opened this issue Aug 18, 2018 · 6 comments

Comments

@XiaoLiuAI
Copy link

hello, I suspect that the mask you used for decoder is not correct.
In decoder, the mask you used is a matrix of which elements in the right upper triangle are one.

mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
In [4]: np.cumsum(np.eye(5), 1) Out[4]: array([[1., 1., 1., 1., 1.], [0., 1., 1., 1., 1.], [0., 0., 1., 1., 1.], [0., 0., 0., 1., 1.], [0., 0., 0., 0., 1.]])

That means, when you compute self attention, the first word will take the entire output sequence into account by WmaskV. That is not correct during training. And this problem could also impact the prediction.

@XiaoLiuAI
Copy link
Author

Hi, thank you for your response. But I still want to make sure that I understand correctly. Let me put the attention block below:

       attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
        if mask is not None:
            mmask = Lambda(lambda x:(-1e+10)*(1-x))(mask)
            attn = Add()([attn, mmask])
        attn = Activation('softmax')(attn)
        attn = self.dropout(attn)
        output = Lambda(lambda x:K.batch_dot(x[0], x[1]))([attn, v])

Is the Activation layer assure that the mask is "column-based"? It applies softmax on the last dimension, which is the column of attention matrix?

@XiaoLiuAI
Copy link
Author

If I make a matrix with upper right parts zero and multiply element-wise with attention matrix. For example

mask = tf.matrix_band_part(tf.ones_like(q.shape[1], k.shape[1]), -1, 0)
...
attn = Multiply()([attn, mask])

Would it take equivalent effects?

@lsdefine
Copy link
Owner

Sorry, my previous answer is wrong and I have found the right answers.
The experiment shows that using 1-mask+eye. The training accu & dev accu quickly go to near 100% but the model cannot process any user inputs. This situation means the model is using the future information.
The problem is: the axis 1 is not the column because there is a "Batch" axis.

>>> K.eval(GetSubMask(q))   # mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
array([[[1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1.]]], dtype=float32)
>>> np.cumsum(np.eye(5), 1)   # Your question
array([[1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1.],
       [0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.]])
>>> np.cumsum(np.eye(5), 0)   # If no "Batch" axis, the cum axis is 0
array([[1., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0.],
       [1., 1., 1., 1., 0.],
       [1., 1., 1., 1., 1.]])

@lsdefine
Copy link
Owner

We surely need a lower left triangular mask, as our expectation.

@XiaoLiuAI
Copy link
Author

Thanks, that is clear.

@lsdefine
Copy link
Owner

lsdefine commented Aug 21, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants