Documentation and source for `RobertaClassificationHead` #8776

mnschmit · 2020-11-25T06:47:03Z

The docstring for RobertaForSequenceClassification says

RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks

Looking at the code, this does not seem correct. Here, the RoBERTa output is fed into an instance of the class RobertaClassificationHead, which feeds the pooled output into a multilayer feedforward-network with one hidden layer and tanh activation. So this is more than only a simple linear layer.

I have two questions:

Should the documentation reflect this different classification head for RoBERTa?
Where does this classification head originally come from? I could not find a citable source where such a "deep" classification head is used. The original RoBERTa paper only seems to state that their task-specific fine-tuning procedure is the same as BERT uses (which is only a linear layer).

I would be glad if someone could shed light on this.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2020-11-25T08:44:07Z

which feeds the pooled output into a multilayer feedforward-network with one hidden layer and tanh activation. So this is more than only a simple linear layer.

Actually, the final hidden representation of the [CLS] token (or <s> token in case of RoBERTa) is not the pooled output. Applying the feedforward neural network with tanh activation on this hidden representation actually gives you the pooled output (which is a vector of size 768 in case of the base sized model). Then, after this, a linear layer called out_proj is used to project the pooled output of size 768 into a vector of size num_labels. So the documentation is still correct.

For the second question, actually BERT does the same, it is just implemented differently. In modeling_bert.py, they use the pooled_output of BertModel, and then apply the linear layer on top of this. This pooled output has already applied the feedforward neural network + tanh activation on top of the [CLS] token hidden representation, as you can see here. In modeling_roberta.py, they implement it differently: they start from the sequence_output (which is a tensor containing the final hidden representations of all tokens in the sequence), then get the hidden repr of the <s> token by typing [:,0,:], then apply the feedforward nn + tanh and finally the linear projection layer.

So your confusion probably comes from the different ways in which this is implemented in BERT vs RoBERTa, and the meaning of pooled_output. Actually, some people use "pooled output" to denote the final hidden representation of the [CLS] token, but in HuggingFace transformers, this always refers to the output of a linear layer + tanh on top of this vector.

mnschmit · 2020-11-25T12:06:55Z

Thank you very much for the explanation @NielsRogge !
My confusion indeed comes from the different implementations and the meaning of "pooled output".

So this makes it consistent for the HuggingFace transformers library. But do you know the origin of it (now I am interested for both models)? Why is the [CLS] token representation transformed by a linear layer with tanh? I couldn't find any reference to tanh in the original BERT paper. What they describe in section 4.1, e.g., sounds to me like there is only one linear layer on top of the [CLS] token representation. Is this a HuggingFace invention then? They don't seem to mention it in their arXiv paper either.

NielsRogge · 2020-11-25T12:34:50Z

Interesting question! Turns out this has already been asked before here and the answer by the author is here.

mnschmit · 2020-11-25T12:48:48Z

Thank you again @NielsRogge !
I had only searched for issues with RoBERTa. Now it makes sense!

mnschmit closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation and source for `RobertaClassificationHead` #8776

Documentation and source for `RobertaClassificationHead` #8776

mnschmit commented Nov 25, 2020

NielsRogge commented Nov 25, 2020 •

edited

Loading

mnschmit commented Nov 25, 2020

NielsRogge commented Nov 25, 2020

mnschmit commented Nov 25, 2020

Documentation and source for RobertaClassificationHead #8776

Documentation and source for RobertaClassificationHead #8776

Comments

mnschmit commented Nov 25, 2020

NielsRogge commented Nov 25, 2020 • edited Loading

mnschmit commented Nov 25, 2020

NielsRogge commented Nov 25, 2020

mnschmit commented Nov 25, 2020

Documentation and source for `RobertaClassificationHead` #8776

Documentation and source for `RobertaClassificationHead` #8776

NielsRogge commented Nov 25, 2020 •

edited

Loading