You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your implementation is: Attention(LayerNorm(x)) + x, but the original Transformer is: LayerNorm(x +Attention(x)). Is this an error or deliberately implemented like this?
The text was updated successfully, but these errors were encountered:
As in paper On Layer Normalization in the Transformer Architecture, the position of Layer Normalization in Transformer implementation is used as pre-LN and post-LN. For example, Transformer Encoder-based BERT uses post-LN, but Vision Transformer uses pre-LN.
In conclusion, that implementation is correct.
Hi, I checked your code at
ViT-pytorch/models/modeling.py
Line 154 in 878ebc5
Your implementation is: Attention(LayerNorm(x)) + x, but the original Transformer is: LayerNorm(x +Attention(x)). Is this an error or deliberately implemented like this?
The text was updated successfully, but these errors were encountered: