Hilach/dropout #7

hila-chefer · 2020-10-14T08:00:07Z

Thanks for the useful resource @lucidrains!
I saw this part from the paper:
Dropout, when used, is applied after every dense layer except for the the qkv-projections and directly after adding positional- to patch embeddings.
So 3 things regarding the dropouts you added:

I think there's a dropout missing right after adding the positional to the embedding.
I don't see why there should be a dropout after the dot product in the attention- line 57 (it's not a dense layer and it's not mentioned in the above description).
considering they refer to the qkv-projections as dense layers, I think they mean that all linear layers are dense layers, so I added dropouts after each linear layer.

Kindly let me know if you have any comments/ disagreements.

lucidrains · 2020-10-14T12:49:18Z

@hilach70 yes, you are right! i've modified it accordingly in the latest commit!

rwightman · 2020-10-14T16:24:14Z

@lucidrains @hila-chefer I spent some time pondering these dropouts as well, my first training without dropout overfit early despite heavy image augs... I've been training since yesterday with dropouts closer to what was just added here.

However, I didn't add the post pos/cls_token embedding dropout yet because the language in the paper was ambiguous even though BERT models do typically have that dropout layer.

The paper read to me as 'is applied after every dense layer EXECPT for 1) the the qkv-projections AND 2) directly after adding positional- to patch embeddings.' ... but it could also be 'applied after every dense layer except for the the qkv-projections, and also applied directly after adding positional- to patch embeddings.'

If either of you are training I'd be curious to know how it works out. I'll share training results/hparams when I get something working well.

hila-chefer · 2020-10-14T16:29:06Z

@rwightman that's a VERY good point about the positional embedding, it's really ambiguous and you managed to convince me that your interpretation may be the correct one!
I'm actually currently training on imagenet and the model doesn't overfit at all, but on the other hand it's a big dataset.
which are you using?

rwightman · 2020-10-14T16:41:13Z

@hila-chefer also training on ImageNet, which I don't think is that huge for these models. At 200 of 300 epochs with cosine decay it was pretty much done, validation loss was rising, I was training a smaller than base model with less layers and smaller MLP (roughly 50M params, instead of the base of 87M).

hila-chefer · 2020-10-14T16:51:25Z

@rwightman I see, I agree it's not big for this model, they use JFT-300M so obviously it's small in comparison, thought you may have tried a small dataset to get started... I think that epoch 300 sounds reasonable since this is the number of epochs they report in their paper for imagenet.

hila-chefer · 2020-10-17T14:03:34Z

@lucidrains @rwightman
Turns out my training gets saturated at around 50% accuracy on validation, can any of you share your hyperparametets choice and results after 200-300 epochs?
Thanks!

rwightman · 2020-10-18T17:47:19Z

@hila-chefer I've got a training run that's going pretty well right now. Command line with hparams at issue linked below. It's a model that's smaller than the paper 'base' to ease training times on my 2x GPU setup.

huggingface/pytorch-image-models#252

hila-chefer · 2020-10-18T18:05:22Z

@rwightman Thanks! :)

hila-chefer added 4 commits October 14, 2020 09:50

fix dropout

b4164e5

fix dropout

0ca32fb

fix dropout

c71a31d

fix dropout

ecbca0f

lucidrains closed this Oct 14, 2020

lucidrains mentioned this pull request Mar 30, 2022

Did you miss dropout? #209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hilach/dropout #7

Hilach/dropout #7

hila-chefer commented Oct 14, 2020 •

edited

lucidrains commented Oct 14, 2020

rwightman commented Oct 14, 2020

hila-chefer commented Oct 14, 2020 •

edited

rwightman commented Oct 14, 2020

hila-chefer commented Oct 14, 2020

hila-chefer commented Oct 17, 2020

rwightman commented Oct 18, 2020 •

edited

hila-chefer commented Oct 18, 2020

Hilach/dropout #7

Hilach/dropout #7

Conversation

hila-chefer commented Oct 14, 2020 • edited

lucidrains commented Oct 14, 2020

rwightman commented Oct 14, 2020

hila-chefer commented Oct 14, 2020 • edited

rwightman commented Oct 14, 2020

hila-chefer commented Oct 14, 2020

hila-chefer commented Oct 17, 2020

rwightman commented Oct 18, 2020 • edited

hila-chefer commented Oct 18, 2020

hila-chefer commented Oct 14, 2020 •

edited

hila-chefer commented Oct 14, 2020 •

edited

rwightman commented Oct 18, 2020 •

edited