Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

Closed
tomweingarten opened this issue Jul 18, 2020 · 8 comments
Closed

Comments

@tomweingarten
Copy link
Contributor

Sorry if this is a dumb question, but I couldn't find a good explanation. The auxiliary loss of the decoder is summed with the cross-entropy loss and returned for back-prorogation. The auxiliary loss of the encoder is just thrown away. What's the rationale for that? Thanks!

@lucidrains
Copy link
Owner

@tomweingarten Hi Tom! I cannot believe you spotted this! The reason is because there is an outstanding bug that I couldn't solve. The bug occurs in a specific edge case when reversibility is turned on for the decoder (encoder is fine). For some reason, having auxiliary losses summed up from both encoder / decoder breaks things. Feel free to leave this open until I (or maybe you?) resolve it!

@lucidrains
Copy link
Owner

lucidrains commented Jul 18, 2020

@tomweingarten Are you using reversibility in your decoder? If not, I could push a new version where both auxiliary losses are summed when decoder is not using reversibility. I still found that the networks converge even without the commitment loss, so I thought I would just omit the encoder commitment loss.

@lucidrains
Copy link
Owner

@tomweingarten I just pushed a new version where the encoder auxiliary loss is added in the case that the decoder is not reversible. I just realized, since I added mixture of experts recently (and it needs an auxiliary loss), that I may have to disable reversibility in decoder altogether until this is fixed

@lucidrains
Copy link
Owner

also, what are you training RT on? 🤔

@lucidrains
Copy link
Owner

@tomweingarten I discovered a semi-reasonable solution to the problem! 93ec372

@lucidrains
Copy link
Owner

closing because solution is found, even if not the cleanest

@tomweingarten
Copy link
Contributor Author

Thanks for the fix and sorry for my slow reply! I'll email you with some more details. I'm mostly interested in biological uses for Transformers but for fun I've been playing around on a timeseries with mixed inputs and outputs.

@lucidrains
Copy link
Owner

@tomweingarten yea! I think sparse attention is perfect for bioinformatics!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants