Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

tomweingarten · 2020-07-18T20:49:37Z

Sorry if this is a dumb question, but I couldn't find a good explanation. The auxiliary loss of the decoder is summed with the cross-entropy loss and returned for back-prorogation. The auxiliary loss of the encoder is just thrown away. What's the rationale for that? Thanks!

lucidrains · 2020-07-18T22:53:58Z

@tomweingarten Hi Tom! I cannot believe you spotted this! The reason is because there is an outstanding bug that I couldn't solve. The bug occurs in a specific edge case when reversibility is turned on for the decoder (encoder is fine). For some reason, having auxiliary losses summed up from both encoder / decoder breaks things. Feel free to leave this open until I (or maybe you?) resolve it!

lucidrains · 2020-07-18T22:56:57Z

@tomweingarten Are you using reversibility in your decoder? If not, I could push a new version where both auxiliary losses are summed when decoder is not using reversibility. I still found that the networks converge even without the commitment loss, so I thought I would just omit the encoder commitment loss.

lucidrains · 2020-07-18T23:28:31Z

@tomweingarten I just pushed a new version where the encoder auxiliary loss is added in the case that the decoder is not reversible. I just realized, since I added mixture of experts recently (and it needs an auxiliary loss), that I may have to disable reversibility in decoder altogether until this is fixed

lucidrains · 2020-07-18T23:53:39Z

also, what are you training RT on? 🤔

lucidrains · 2020-07-19T03:27:46Z

@tomweingarten I discovered a semi-reasonable solution to the problem! 93ec372

lucidrains · 2020-07-27T18:49:13Z

closing because solution is found, even if not the cleanest

tomweingarten · 2020-07-29T14:35:36Z

Thanks for the fix and sorry for my slow reply! I'll email you with some more details. I'm mostly interested in biological uses for Transformers but for fun I've been playing around on a timeseries with mixed inputs and outputs.

lucidrains · 2020-07-30T02:16:34Z

@tomweingarten yea! I think sparse attention is perfect for bioinformatics!

lucidrains closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

tomweingarten commented Jul 18, 2020

lucidrains commented Jul 18, 2020

lucidrains commented Jul 18, 2020 •

edited

lucidrains commented Jul 18, 2020

lucidrains commented Jul 18, 2020

lucidrains commented Jul 19, 2020

lucidrains commented Jul 27, 2020

tomweingarten commented Jul 29, 2020

lucidrains commented Jul 30, 2020

Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

Comments

tomweingarten commented Jul 18, 2020

lucidrains commented Jul 18, 2020

lucidrains commented Jul 18, 2020 • edited

lucidrains commented Jul 18, 2020

lucidrains commented Jul 18, 2020

lucidrains commented Jul 19, 2020

lucidrains commented Jul 27, 2020

tomweingarten commented Jul 29, 2020

lucidrains commented Jul 30, 2020

lucidrains commented Jul 18, 2020 •

edited