how the model reflect 'bidirectional'? #319

HaodaY · 2018-12-28T09:47:51Z

BERT is 'Bidirectional Encoder Representations from Transformers', so how the transformer reflect 'bidirectional', and why GPT don't?

xwzhong · 2018-12-28T12:33:03Z

here is some explanation: #83

libertatis · 2019-02-25T01:49:09Z

You can find the description in the paper https://arxiv.org/abs/1810.04805:
"We note that in the literature the bidirectional Transformer is often referred to as a “Transformer
encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation."
So the BERT Transformer is the "Transformer encoder" with bidirectional self-attention where every token can attend context to its left and right. While the GPT Transformer is the "Transformer decoder" with constrained self-attention where every token can only attend to context to its left.

JaneShenYY · 2019-03-12T04:57:36Z

@libertatis The self-attention in Transformer calculates the weighted value vector at a position t by multiplying the query t with the key in each position and etc. I think it attends both the left and right words. Why it's said by the BERT authors that it's uni-directional?
Although BERT paper claims itself as bi-directional, which part does this job and how the bi-directional is implemented in the self-attention?

hsm207 · 2019-03-12T20:11:21Z

@JaneShenYY yes, the self attention layer attends to all the words to its left and right and itself too. This is the part of the network that makes it bidirectional.

chikubee · 2019-08-14T06:13:04Z

@hsm207 If all the tokens can look at the left, right context and itself, why is it that the [CLS] token carries the sentence representation?
How is it learning that representation, some insights on the internal working?
Thanks in advance

hsm207 · 2019-08-16T16:36:05Z

@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence.

[CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n.

So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position.

I hope this clarifies. Let me know if you have further questions.

chikubee · 2019-08-17T08:57:37Z

@chikubee The [CLS] token will be prepended to every input sentence. So, on the first layer, the representation of [CLS] is a function of the [CLS] token itself and all other tokens to the right. This pattern repeats until you reach the last transformer layer. I hope you can see that the [CLS] token would have multiple opportunities to look at an input sentence left and right since the token representations it depends on is looking at the sentences left and right. This means that the [CLS] token representation at the final layer can be considered a rich representation of the input sentence.

[CLS] token carries the sentence representation in sentence classification tasks because this is the token whose representation is finetuned to the task at hand. We don't pick any other token as the sentence representation because the same token have different representation depending on its location. For example, the representation for the word "the" in the "the cat in the hat" is different than in "I like the cat". We also don't pick the n-th token as the representation because it won't be handle cases where the input sentence's lengh is less than n.

So, to make things easy for us, let's just tack on a dummy token (which we will call [CLS]) to every input sentence. This way, we can be sure that we always have a token whose representation is simply a function of the other tokens in the input sentence and not its position.

I hope this clarifies. Let me know if you have further questions.

This was a really good explanation @hsm207, clarifies a lot, Thanks.
What I still fail to understand is that is it really a good representation of the sentence, for when I try to check for similar sentences to interpret the false positives for Text Classification task, the results tell otherwise for some cases.
Can you share some insights on sentence similarity?
Or the correct way to go by it is at the token level cross computation BERTScore.

Thanks again.

hsm207 · 2019-09-02T17:57:03Z

@chikubee

Could you give some examples about your text classification use case and how are you checking for similar sentences to interpret the false positives?

BERTScore is it meant to evaluate the quality of a machine generated text e.g. machine translation, image captioning. I don't see how this metric is applicable in Text Classification since the outputs are class labels, not pieces of tokens.

wj-Mcat · 2021-05-09T17:26:16Z

@hsm207 It's a great explanation for why is cls leart sentence representation, thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how the model reflect 'bidirectional'? #319

how the model reflect 'bidirectional'? #319

HaodaY commented Dec 28, 2018

xwzhong commented Dec 28, 2018

libertatis commented Feb 25, 2019 •

edited

Loading

JaneShenYY commented Mar 12, 2019

hsm207 commented Mar 12, 2019

chikubee commented Aug 14, 2019

hsm207 commented Aug 16, 2019 •

edited

Loading

chikubee commented Aug 17, 2019 •

edited

Loading

hsm207 commented Sep 2, 2019

wj-Mcat commented May 9, 2021

how the model reflect 'bidirectional'? #319

how the model reflect 'bidirectional'? #319

Comments

HaodaY commented Dec 28, 2018

xwzhong commented Dec 28, 2018

libertatis commented Feb 25, 2019 • edited Loading

JaneShenYY commented Mar 12, 2019

hsm207 commented Mar 12, 2019

chikubee commented Aug 14, 2019

hsm207 commented Aug 16, 2019 • edited Loading

chikubee commented Aug 17, 2019 • edited Loading

hsm207 commented Sep 2, 2019

wj-Mcat commented May 9, 2021

libertatis commented Feb 25, 2019 •

edited

Loading

hsm207 commented Aug 16, 2019 •

edited

Loading

chikubee commented Aug 17, 2019 •

edited

Loading