New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hilach/dropout #7
Conversation
@hilach70 yes, you are right! i've modified it accordingly in the latest commit! |
@lucidrains @hila-chefer I spent some time pondering these dropouts as well, my first training without dropout overfit early despite heavy image augs... I've been training since yesterday with dropouts closer to what was just added here. However, I didn't add the post pos/cls_token embedding dropout yet because the language in the paper was ambiguous even though BERT models do typically have that dropout layer. The paper read to me as 'is applied after every dense layer EXECPT for 1) the the qkv-projections AND 2) directly after adding positional- to patch embeddings.' ... but it could also be 'applied after every dense layer except for the the qkv-projections, and also applied directly after adding positional- to patch embeddings.' If either of you are training I'd be curious to know how it works out. I'll share training results/hparams when I get something working well. |
@rwightman that's a VERY good point about the positional embedding, it's really ambiguous and you managed to convince me that your interpretation may be the correct one! |
@hila-chefer also training on ImageNet, which I don't think is that huge for these models. At 200 of 300 epochs with cosine decay it was pretty much done, validation loss was rising, I was training a smaller than base model with less layers and smaller MLP (roughly 50M params, instead of the base of 87M). |
@rwightman I see, I agree it's not big for this model, they use JFT-300M so obviously it's small in comparison, thought you may have tried a small dataset to get started... I think that epoch 300 sounds reasonable since this is the number of epochs they report in their paper for imagenet. |
@lucidrains @rwightman |
@hila-chefer I've got a training run that's going pretty well right now. Command line with hparams at issue linked below. It's a model that's smaller than the paper 'base' to ease training times on my 2x GPU setup. |
@rwightman Thanks! :) |
Thanks for the useful resource @lucidrains!
I saw this part from the paper:
Dropout, when used, is applied after every dense layer except for the the qkv-projections and directly after adding positional- to patch embeddings.
So 3 things regarding the dropouts you added:
Kindly let me know if you have any comments/ disagreements.