You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The theory for the paper gives a result which gives some guarantees for nb_features = O(dim*log(dim)).
When using multiple heads, e.g. dim = 512, heads = 8, you would get a lower dimensionality per head, is it then reasonable to scale the dimension of nb_features = O((dim/heads)*log(dim/heads)) ? Or is the variance too high when the number of features gets too low? Do you have any intuition for this, cause I'm feeling a bit unsure.
The text was updated successfully, but these errors were encountered:
The theory for the paper gives a result which gives some guarantees for nb_features = O(dim*log(dim)).
When using multiple heads, e.g. dim = 512, heads = 8, you would get a lower dimensionality per head, is it then reasonable to scale the dimension of nb_features = O((dim/heads)*log(dim/heads)) ? Or is the variance too high when the number of features gets too low? Do you have any intuition for this, cause I'm feeling a bit unsure.
The text was updated successfully, but these errors were encountered: