Some questions about dropout #25

tatp22 · 2020-09-13T20:15:50Z

Hi again @lucidrains, I just had some quick questions about dropout with the Sinkhorn Transformer, as I was just using my Linformer implementation (which as you know is based off of this repo), but it was overfitting my dataset. Therefore, I just had some quick questions about some dropout and your implementation, and I wanted to ask whether some design choices here were intentional or not:

In the original Transformer, dropout was performed after each sublayer, before the residual connection. I noticed that you only have this after the SinkhornSelfAttention class, but not after the FeedForward class. Is this intentional?
Speaking of the FeedForward class, you insert dropout after the first linear layer. I couldn't find this anywhere in any literature, were you able to find a reference of why this was effective? I put it into my implementation, and it seems to help, but i just don't know where this idea came from.
On a similar note, do you know why the dots tensor in the self attention classes are dropped out? Again, I put it in my linformer and it seems to work, but I can't find a reference to this in the literature.
Finally, the original transformer also dropped out the input tokens, like so (From the SinkhornTransformerLM class):

    def forward(self, x, **kwargs):
        _, t, device = *x.shape, x.device
        assert t <= self.max_seq_len, f'sequence length {t} is greater than maximum sequence length {self.max_seq_len}'

        x = self.to_token_emb(x)
        x = self.axial_pos_emb(x) + x
        """ Dropout would go here"""
        x = self.sinkhorn_transformer(x, **kwargs)
        return self.to_logits(x)

Should they also be dropped out here as well?

I now updated my repo such that all 4 of these dropout possibilities exist. I'll let you know if this helps overfitting.

Thank you for your time!

The text was updated successfully, but these errors were encountered:

lucidrains · 2020-09-15T20:21:23Z

@tatp22 Hey Peter! I followed most of the dropouts from the annotated transformer example from https://nlp.seas.harvard.edu/2018/04/03/attention.html You are right, I have also encountered dropout post embedding. However, I think given GPT3 and the findings of double descent curve, I think the most effective way to counter overfitting is still more data, pretraining, and a larger model.

Do let me know what your findings with dropout placement is in the regime of small. :)

lucidrains · 2020-09-15T20:25:21Z

@tatp22 Any new follow ups with Linformer? The authors told me they were planning an auto-regressive version. Does it work well for your problems?

tatp22 · 2020-09-16T10:30:52Z

Yeah, that may be the case. I think that simply adding more parameters and more data is probably the way to go, since I really couldn't find anything else wrong with my Linformer...

And no, there aren't really any updates from my end on the Linformer. Maybe when there's an autoregressive version, I will implement that, but for now, I really don't see what else I could add to it, other than what's on there already...

As to how it works with my problems, it's working similarly to this repo, actually. That's probably a good thing; I expect that most of these sparse attention mechanisms will perform similarly on benchmarks, and the fact that these two are performing comparably is a good thing. I can send you my comparison if you'd like to see the fine details

lucidrains · 2020-09-16T17:09:46Z

@tatp22 Thanks for letting me know! What sequence lengths are you working at? (There hasn't been much data on Linformer on lengths greater than 4096, so any additional datapoint is appreciated)

tatp22 · 2020-09-16T20:04:33Z

@lucidrains Right now, I'm only working on seq lengths of 2048, and I am planning to scale it to 8096 soonish. To be honest, I did try a (practical) sequence length of 250k+ (with a k of 150), and it did end up competing with baselines, but I am not pursuing these experiments further atm.

From what I have experienced though, practically, the Linformer works very well (with respect to the standard transformer), even if the k is very small. However, one thing that one must watch out for is that the parameter numbers can seriously explode, especially with longer sequences (compared to standard attention, and even with this repo).

Personally, I have a feeling that attention in general does not need to be quadratic (in time and space), and there may just be better architectures that one can use that are faster and more memory efficient. Unfortunately, I am not really in a position to investigate this at the moment, due to time limitations

lucidrains · 2020-09-16T22:24:36Z

@tatp22 Nice! Thank you for sharing your experience :)

lucidrains closed this as completed Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about dropout #25

Some questions about dropout #25

tatp22 commented Sep 13, 2020

lucidrains commented Sep 15, 2020

lucidrains commented Sep 15, 2020

tatp22 commented Sep 16, 2020 •

edited

lucidrains commented Sep 16, 2020

tatp22 commented Sep 16, 2020

lucidrains commented Sep 16, 2020

Some questions about dropout #25

Some questions about dropout #25

Comments

tatp22 commented Sep 13, 2020

lucidrains commented Sep 15, 2020

lucidrains commented Sep 15, 2020

tatp22 commented Sep 16, 2020 • edited

lucidrains commented Sep 16, 2020

tatp22 commented Sep 16, 2020

lucidrains commented Sep 16, 2020

tatp22 commented Sep 16, 2020 •

edited