Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation and source for RobertaClassificationHead #8776

Closed
mnschmit opened this issue Nov 25, 2020 · 4 comments
Closed

Documentation and source for RobertaClassificationHead #8776

mnschmit opened this issue Nov 25, 2020 · 4 comments

Comments

@mnschmit
Copy link
Contributor

The docstring for RobertaForSequenceClassification says

RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks

Looking at the code, this does not seem correct. Here, the RoBERTa output is fed into an instance of the class RobertaClassificationHead, which feeds the pooled output into a multilayer feedforward-network with one hidden layer and tanh activation. So this is more than only a simple linear layer.

I have two questions:

  1. Should the documentation reflect this different classification head for RoBERTa?
  2. Where does this classification head originally come from? I could not find a citable source where such a "deep" classification head is used. The original RoBERTa paper only seems to state that their task-specific fine-tuning procedure is the same as BERT uses (which is only a linear layer).

I would be glad if someone could shed light on this.

@NielsRogge
Copy link
Contributor

NielsRogge commented Nov 25, 2020

which feeds the pooled output into a multilayer feedforward-network with one hidden layer and tanh activation. So this is more than only a simple linear layer.

Actually, the final hidden representation of the [CLS] token (or <s> token in case of RoBERTa) is not the pooled output. Applying the feedforward neural network with tanh activation on this hidden representation actually gives you the pooled output (which is a vector of size 768 in case of the base sized model). Then, after this, a linear layer called out_proj is used to project the pooled output of size 768 into a vector of size num_labels. So the documentation is still correct.

For the second question, actually BERT does the same, it is just implemented differently. In modeling_bert.py, they use the pooled_output of BertModel, and then apply the linear layer on top of this. This pooled output has already applied the feedforward neural network + tanh activation on top of the [CLS] token hidden representation, as you can see here. In modeling_roberta.py, they implement it differently: they start from the sequence_output (which is a tensor containing the final hidden representations of all tokens in the sequence), then get the hidden repr of the <s> token by typing [:,0,:], then apply the feedforward nn + tanh and finally the linear projection layer.

So your confusion probably comes from the different ways in which this is implemented in BERT vs RoBERTa, and the meaning of pooled_output. Actually, some people use "pooled output" to denote the final hidden representation of the [CLS] token, but in HuggingFace transformers, this always refers to the output of a linear layer + tanh on top of this vector.

@mnschmit
Copy link
Contributor Author

Thank you very much for the explanation @NielsRogge !
My confusion indeed comes from the different implementations and the meaning of "pooled output".

So this makes it consistent for the HuggingFace transformers library. But do you know the origin of it (now I am interested for both models)? Why is the [CLS] token representation transformed by a linear layer with tanh? I couldn't find any reference to tanh in the original BERT paper. What they describe in section 4.1, e.g., sounds to me like there is only one linear layer on top of the [CLS] token representation. Is this a HuggingFace invention then? They don't seem to mention it in their arXiv paper either.

@NielsRogge
Copy link
Contributor

Interesting question! Turns out this has already been asked before here and the answer by the author is here.

@mnschmit
Copy link
Contributor Author

Thank you again @NielsRogge !
I had only searched for issues with RoBERTa. Now it makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants