Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how the model reflect 'bidirectional'? #319

Open
HaodaY opened this issue Dec 28, 2018 · 9 comments
Open

how the model reflect 'bidirectional'? #319

HaodaY opened this issue Dec 28, 2018 · 9 comments

Comments

@HaodaY
Copy link

HaodaY commented Dec 28, 2018

BERT is 'Bidirectional Encoder Representations from Transformers', so how the transformer reflect 'bidirectional', and why GPT don't?

@xwzhong
Copy link

xwzhong commented Dec 28, 2018

here is some explanation: #83

@libertatis
Copy link

libertatis commented Feb 25, 2019

You can find the description in the paper https://arxiv.org/abs/1810.04805:
"We note that in the literature the bidirectional Transformer is often referred to as a “Transformer
encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation."
So the BERT Transformer is the "Transformer encoder" with bidirectional self-attention where every token can attend context to its left and right. While the GPT Transformer is the "Transformer decoder" with constrained self-attention where every token can only attend to context to its left.

@JaneShenYY
Copy link

@libertatis The self-attention in Transformer calculates the weighted value vector at a position t by multiplying the query t with the key in each position and etc. I think it attends both the left and right words. Why it's said by the BERT authors that it's uni-directional?
Although BERT paper claims itself as bi-directional, which part does this job and how the bi-directional is implemented in the self-attention?

@hsm207
Copy link
Contributor

hsm207 commented Mar 12, 2019

@JaneShenYY yes, the self attention layer attends to all the words to its left and right and itself too. This is the part of the network that makes it bidirectional.

@chikubee
Copy link

@hsm207 If all the tokens can look at the left, right context and itself, why is it that the [CLS] token carries the sentence representation?
How is it learning that representation, some insights on the internal working?
Thanks in advance

@hsm207
Copy link
Contributor

hsm207 commented Aug 16, 2019

@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence.

[CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n.

So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position.

I hope this clarifies. Let me know if you have further questions.

@chikubee
Copy link

chikubee commented Aug 17, 2019

@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence.

[CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n.

So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position.

I hope this clarifies. Let me know if you have further questions.

This was a really good explanation @hsm207, clarifies a lot, Thanks.
What I still fail to understand is that is it really a good representation of the sentence, for when I try to check for similar sentences to interpret the false positives for Text Classification task, the results tell otherwise for some cases.
Can you share some insights on sentence similarity?
Or the correct way to go by it is at the token level cross computation BERTScore.

Thanks again.

@hsm207
Copy link
Contributor

hsm207 commented Sep 2, 2019

@chikubee

Could you give some examples about your text classification use case and how are you checking for similar sentences to interpret the false positives?

BERTScore is it meant to evaluate the quality of a machine generated text e.g. machine translation, image captioning. I don't see how this metric is applicable in Text Classification since the outputs are class labels, not pieces of tokens.

@wj-Mcat
Copy link

wj-Mcat commented May 9, 2021

@hsm207 It's a great explanation for why is cls leart sentence representation, thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants