Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Scaling down number of random features depending on number of heads? #4

Closed
Parskatt opened this issue Oct 19, 2020 · 4 comments

Comments

@Parskatt
Copy link

The theory for the paper gives a result which gives some guarantees for nb_features = O(dim*log(dim)).
When using multiple heads, e.g. dim = 512, heads = 8, you would get a lower dimensionality per head, is it then reasonable to scale the dimension of nb_features = O((dim/heads)*log(dim/heads)) ? Or is the variance too high when the number of features gets too low? Do you have any intuition for this, cause I'm feeling a bit unsure.

@lucidrains
Copy link
Owner

@Parskatt I think it makes sense for this to be dim / heads. For a standard 1024 dim and 8 heads, 128 * log(128) ~= 256

@lucidrains
Copy link
Owner

@Parskatt I was told this hyperparameter is pretty critical to good performance

@lucidrains
Copy link
Owner

@Parskatt Let us know what you find in your experiments!

@Parskatt
Copy link
Author

I will continue my experiments and let you know later :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants