Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there extra denser layer in pooler? #43

Closed
chsasank opened this issue Nov 4, 2018 · 1 comment
Closed

Why is there extra denser layer in pooler? #43

chsasank opened this issue Nov 4, 2018 · 1 comment

Comments

@chsasank
Copy link

chsasank commented Nov 4, 2018

I'm referring to this line

In the paper, you state

In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding. We denote this vector as C ∈ R^H. The only new parameters added during fine-tuning are for a classification layer W ∈ R^{K X H} , where K is the number of classifier labels.

But here, you have a H X H dense layer which is in contradiction to the above. Even more perplexing to me is that activation of this layer is tanh! I'm surprised all the models worked with tanh instead of rely activation.

I suspect that I'm missing something here. Thanks for your patience.

@jacobdevlin-google
Copy link
Contributor

Yeah it was an oversight that we didn't mention it in the paper (we'll mention it in the updated version), but we have an extra projection layer for the classifier and LM before feeding it into the classification.

However, these layers are both pre-trained with the rest of the network and are included in the pre-trained checkpoint. So the part about "the only new parameters added during fine-tuning" is correct, it's just not correct to say "output of the Transformer", it's really "output of the Transformer fed through one additional non-linear transformation".

The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants