Why is there extra denser layer in pooler? #43

chsasank · 2018-11-04T19:39:55Z

I'm referring to this line

In the paper, you state

In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding. We denote this vector as C ∈ R^H. The only new parameters added during fine-tuning are for a classification layer W ∈ R^{K X H} , where K is the number of classifier labels.

But here, you have a H X H dense layer which is in contradiction to the above. Even more perplexing to me is that activation of this layer is tanh! I'm surprised all the models worked with tanh instead of rely activation.

I suspect that I'm missing something here. Thanks for your patience.

jacobdevlin-google · 2018-11-05T18:22:46Z

Yeah it was an oversight that we didn't mention it in the paper (we'll mention it in the updated version), but we have an extra projection layer for the classifier and LM before feeding it into the classification.

However, these layers are both pre-trained with the rest of the network and are included in the pre-trained checkpoint. So the part about "the only new parameters added during fine-tuning" is correct, it's just not correct to say "output of the Transformer", it's really "output of the Transformer fed through one additional non-linear transformation".

The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.

jacobdevlin-google closed this as completed Nov 5, 2018

sai-prasanna mentioned this issue Dec 28, 2019

Question about additional linear layer in BertPooler allenai/allennlp#3559

Closed

tonyduan mentioned this issue Jun 10, 2020

Why the activation function is tanh in BertPooler huggingface/transformers#782

Closed

NielsRogge mentioned this issue Nov 25, 2020

Documentation and source for RobertaClassificationHead huggingface/transformers#8776

Closed

NielsRogge mentioned this issue Nov 27, 2021

TAPAS tanh activation on the pooling layer huggingface/transformers#14543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is there extra denser layer in pooler? #43

Why is there extra denser layer in pooler? #43

chsasank commented Nov 4, 2018

jacobdevlin-google commented Nov 5, 2018

Why is there extra denser layer in pooler? #43

Why is there extra denser layer in pooler? #43

Comments

chsasank commented Nov 4, 2018

jacobdevlin-google commented Nov 5, 2018