Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hilach/dropout #7

Closed
wants to merge 4 commits into from
Closed

Conversation

hila-chefer
Copy link

@hila-chefer hila-chefer commented Oct 14, 2020

Thanks for the useful resource @lucidrains!
I saw this part from the paper:
Dropout, when used, is applied after every dense layer except for the the qkv-projections and directly after adding positional- to patch embeddings.
So 3 things regarding the dropouts you added:

  1. I think there's a dropout missing right after adding the positional to the embedding.
  2. I don't see why there should be a dropout after the dot product in the attention- line 57 (it's not a dense layer and it's not mentioned in the above description).
  3. considering they refer to the qkv-projections as dense layers, I think they mean that all linear layers are dense layers, so I added dropouts after each linear layer.

Kindly let me know if you have any comments/ disagreements.

@lucidrains
Copy link
Owner

@hilach70 yes, you are right! i've modified it accordingly in the latest commit!

@lucidrains lucidrains closed this Oct 14, 2020
@rwightman
Copy link

@lucidrains @hila-chefer I spent some time pondering these dropouts as well, my first training without dropout overfit early despite heavy image augs... I've been training since yesterday with dropouts closer to what was just added here.

However, I didn't add the post pos/cls_token embedding dropout yet because the language in the paper was ambiguous even though BERT models do typically have that dropout layer.

The paper read to me as 'is applied after every dense layer EXECPT for 1) the the qkv-projections AND 2) directly after adding positional- to patch embeddings.' ... but it could also be 'applied after every dense layer except for the the qkv-projections, and also applied directly after adding positional- to patch embeddings.'

If either of you are training I'd be curious to know how it works out. I'll share training results/hparams when I get something working well.

@hila-chefer
Copy link
Author

hila-chefer commented Oct 14, 2020

@rwightman that's a VERY good point about the positional embedding, it's really ambiguous and you managed to convince me that your interpretation may be the correct one!
I'm actually currently training on imagenet and the model doesn't overfit at all, but on the other hand it's a big dataset.
which are you using?

@rwightman
Copy link

@hila-chefer also training on ImageNet, which I don't think is that huge for these models. At 200 of 300 epochs with cosine decay it was pretty much done, validation loss was rising, I was training a smaller than base model with less layers and smaller MLP (roughly 50M params, instead of the base of 87M).

@hila-chefer
Copy link
Author

@rwightman I see, I agree it's not big for this model, they use JFT-300M so obviously it's small in comparison, thought you may have tried a small dataset to get started... I think that epoch 300 sounds reasonable since this is the number of epochs they report in their paper for imagenet.

@hila-chefer
Copy link
Author

@lucidrains @rwightman
Turns out my training gets saturated at around 50% accuracy on validation, can any of you share your hyperparametets choice and results after 200-300 epochs?
Thanks!

@rwightman
Copy link

rwightman commented Oct 18, 2020

@hila-chefer I've got a training run that's going pretty well right now. Command line with hparams at issue linked below. It's a model that's smaller than the paper 'base' to ease training times on my 2x GPU setup.

huggingface/pytorch-image-models#252

@hila-chefer
Copy link
Author

@rwightman Thanks! :)

@lucidrains lucidrains mentioned this pull request Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants