Question: Scaling down number of random features depending on number of heads? #4

Parskatt · 2020-10-19T14:21:17Z

The theory for the paper gives a result which gives some guarantees for nb_features = O(dim*log(dim)).
When using multiple heads, e.g. dim = 512, heads = 8, you would get a lower dimensionality per head, is it then reasonable to scale the dimension of nb_features = O((dim/heads)*log(dim/heads)) ? Or is the variance too high when the number of features gets too low? Do you have any intuition for this, cause I'm feeling a bit unsure.

lucidrains · 2020-10-19T19:45:32Z

@Parskatt I think it makes sense for this to be dim / heads. For a standard 1024 dim and 8 heads, 128 * log(128) ~= 256

lucidrains · 2020-10-19T19:48:31Z

@Parskatt I was told this hyperparameter is pretty critical to good performance

lucidrains · 2020-10-19T19:48:42Z

@Parskatt Let us know what you find in your experiments!

Parskatt · 2020-10-20T15:47:47Z

I will continue my experiments and let you know later :)

Parskatt closed this as completed Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Scaling down number of random features depending on number of heads? #4

Question: Scaling down number of random features depending on number of heads? #4

Parskatt commented Oct 19, 2020

lucidrains commented Oct 19, 2020

lucidrains commented Oct 19, 2020

lucidrains commented Oct 19, 2020

Parskatt commented Oct 20, 2020

Question: Scaling down number of random features depending on number of heads? #4

Question: Scaling down number of random features depending on number of heads? #4

Comments

Parskatt commented Oct 19, 2020

lucidrains commented Oct 19, 2020

lucidrains commented Oct 19, 2020

lucidrains commented Oct 19, 2020

Parskatt commented Oct 20, 2020