-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how the model reflect 'bidirectional'? #319
Comments
here is some explanation: #83 |
You can find the description in the paper https://arxiv.org/abs/1810.04805: |
@libertatis The self-attention in Transformer calculates the weighted value vector at a position t by multiplying the query t with the key in each position and etc. I think it attends both the left and right words. Why it's said by the BERT authors that it's uni-directional? |
@JaneShenYY yes, the self attention layer attends to all the words to its left and right and itself too. This is the part of the network that makes it bidirectional. |
@hsm207 If all the tokens can look at the left, right context and itself, why is it that the [CLS] token carries the sentence representation? |
@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence. [CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n. So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position. I hope this clarifies. Let me know if you have further questions. |
This was a really good explanation @hsm207, clarifies a lot, Thanks. Thanks again. |
Could you give some examples about your text classification use case and how are you checking for similar sentences to interpret the false positives? BERTScore is it meant to evaluate the quality of a machine generated text e.g. machine translation, image captioning. I don't see how this metric is applicable in Text Classification since the outputs are class labels, not pieces of tokens. |
@hsm207 It's a great explanation for why is cls leart sentence representation, thanks a lot. |
BERT is 'Bidirectional Encoder Representations from Transformers', so how the transformer reflect 'bidirectional', and why GPT don't?
The text was updated successfully, but these errors were encountered: