New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9
Comments
@tomweingarten Hi Tom! I cannot believe you spotted this! The reason is because there is an outstanding bug that I couldn't solve. The bug occurs in a specific edge case when reversibility is turned on for the decoder (encoder is fine). For some reason, having auxiliary losses summed up from both encoder / decoder breaks things. Feel free to leave this open until I (or maybe you?) resolve it! |
@tomweingarten Are you using reversibility in your decoder? If not, I could push a new version where both auxiliary losses are summed when decoder is not using reversibility. I still found that the networks converge even without the commitment loss, so I thought I would just omit the encoder commitment loss. |
@tomweingarten I just pushed a new version where the encoder auxiliary loss is added in the case that the decoder is not reversible. I just realized, since I added mixture of experts recently (and it needs an auxiliary loss), that I may have to disable reversibility in decoder altogether until this is fixed |
also, what are you training RT on? 🤔 |
@tomweingarten I discovered a semi-reasonable solution to the problem! 93ec372 |
closing because solution is found, even if not the cleanest |
Thanks for the fix and sorry for my slow reply! I'll email you with some more details. I'm mostly interested in biological uses for Transformers but for fun I've been playing around on a timeseries with mixed inputs and outputs. |
@tomweingarten yea! I think sparse attention is perfect for bioinformatics! |
Sorry if this is a dumb question, but I couldn't find a good explanation. The auxiliary loss of the decoder is summed with the cross-entropy loss and returned for back-prorogation. The auxiliary loss of the encoder is just thrown away. What's the rationale for that? Thanks!
The text was updated successfully, but these errors were encountered: