Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.sign(tf.abs(tf.reduce_sum vs tf.sign(tf.reduce_sum(tf.abs( for generating masks? #14

Closed
pmixer opened this issue Sep 15, 2020 · 4 comments

Comments

@pmixer
Copy link
Contributor

pmixer commented Sep 15, 2020

key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)

Hi Guys,
I'm reading the code for porting the implementation to PyTorch for personal use, the code looks well written and documented, thx for the great work :)

Moreover, as self attention module is borrowed from another project, some details may not be 100% right according to my observation(despite some magic numbers like -2^32+1 for enforcing softmax to output 0 for the entry which kills code readability), as an example, for query and key mask generation, the code used tf.sign+tf.abs+tf.reduce_sum combination for generating the masks but the order seems slightly wrong, as we are trying to mask the query/key of all 0 values in channel/embedding-dim, the right way might be firstly apply abs, then do reduce_sum and finally use sign to generate the results, but current implementation firstly use reduce_sum, later used abs and lastly apply sign, the two approaches should generate same results for most case as sum-to-zero is of low probability for high-dimensional fp32 vectors but it's still wrong and may generate incorrect outputs for corner cases.

Just want to check my assumption as stated above, pls respond if you happen to have time @kang205 @JiachengLi1995, thx!

Regards,
Zan

@kang205
Copy link
Owner

kang205 commented Sep 15, 2020

Thanks for looking into the details of the implementation. It seems the self-attention modules in the original repo (https://github.com/Kyubyong/transformer) is updated, and the code is looking mucher simpler and cleaner right now. And I remember that I've used that for SASRec in my own experiments (e.g. replacing the code in modules.py), and it seems work as well. I'm not sure if they fixed this issue. If that's the case, we may update the github code to use the newer implementation of self-attention.

@pmixer
Copy link
Contributor Author

pmixer commented Sep 15, 2020

Thanks for looking into the details of the implementation. It seems the self-attention modules in the original repo (https://github.com/Kyubyong/transformer) is updated, and the code is looking mucher simpler and cleaner right now. And I remember that I've used that for SASRec in my own experiments (e.g. replacing the code in modules.py), and it seems work as well. I'm not sure if they fixed this issue. If that's the case, we may update the github code to use the newer implementation of self-attention.

Thank you Wang-Cheng! It's just a small issue, and yes, they have updated the transformer repo and commented this part(as they have identified our life would be easier if masks are set as arguments and easily obtained in seq level rather than in seq embedding level by src_masks = tf.math.equal(x, 0) lol), I'll try to get it confirmed by cc the thread to transformer repo you mentioned and keep you updated for future feedback.

The issue got created just FYI, no need to update the repo for experiment usage, but it would be better to let peers know the detail.

Have a good day :)

@pmixer
Copy link
Contributor Author

pmixer commented Sep 15, 2020

Thanks for looking into the details of the implementation. It seems the self-attention modules in the original repo (https://github.com/Kyubyong/transformer) is updated, and the code is looking mucher simpler and cleaner right now. And I remember that I've used that for SASRec in my own experiments (e.g. replacing the code in modules.py), and it seems work as well. I'm not sure if they fixed this issue. If that's the case, we may update the github code to use the newer implementation of self-attention.

Oh, I just found they did correct it before commenting the part, pls see:

https://github.com/Kyubyong/transformer/blob/fb023bb097e08d53baf25b46a9da490beba51a21/modules.py#L134

as I expected, abs should be applied before reduce_sum, no need to ping the guys in that repo now :)

@kang205
Copy link
Owner

kang205 commented Sep 15, 2020

Thanks for checking this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants