Huggingface Bert vs. Fast Transformer full attention #100

lipmem · 2021-09-02T09:44:50Z

First of all thank you for this amazing work!

In my research I am comparing different encoders for relation extraction.
What I noticed is that the transformer implementation of this repo with full attention performs worse (regarding F1 score) than the huggingface bert implementation. I use a unpretrained huggingface bert.
My expectation is that this setup should perform the same as an untrained bert from huggingface.

TransformerEncoderBuilder.from_kwargs(
            n_layers=12,
            n_heads=12,
            query_dimensions=64,
            value_dimensions=64,
            feed_forward_dimensions=3072,
            attention_type="full",
            activation="gelu"
        ).get()

Is my expectation correct? Why does it perform worse?

The text was updated successfully, but these errors were encountered:

angeloskath · 2021-09-09T18:57:15Z

Hi,

Sure, the performance should be similar. You can also check that given the same weights the two implementations actually return exactly the same results. We have a test for this at

fast-transformers/tests/test_weight_mapper.py

Line 58 in f22c137

def test_huggin_bert(self):

.

Let me know if you are still experiencing problems.

Best,
Angelos

lipmem · 2021-09-10T11:10:43Z

Hi Angelos,
thank you for your answer. Unfortunatley I still face the problem.
I copied the source of the huggingface bert implementation and only replaced the encoder with your encoder version set to 'full' attention.
Like this:

# Old
self.encoder = BertEncoder(config) 
# New
self.encoder = TransformerEncoderBuilder.from_kwargs(
           n_layers=12,
           n_heads=12,
           query_dimensions=64,
           value_dimensions=64,
           feed_forward_dimensions=3072,
           attention_type="full",
           final_normalization=False,
           activation="gelu"
       ).get()

The encoder output is then used to do multiclass classifiction for relation extraction.
What happens is that the evaluating F1 score is singnificantly worse with the fast-transformer full attention encoder.

I have no explanantion. Do you have any clue about this?

angeloskath · 2021-09-10T17:04:02Z

Just to make sure if you run the test from my previous comment does it pass? If it does, then this means that there is probably a misconfiguration because that test means that the models are exactly equivalent so they should train exactly the same.

If I had to guess judging by the names in the plot, the lighter one maybe uses linear attention instead of full?

Cheers,
Angelos

lipmem · 2021-09-16T07:28:33Z

Sorry for the late reply.
The test passes on the machine I use for training.

About the plot: the lighter one is the fast-transformers implementation set to full attention and the darker one is the huggingface bert implementation without pretrained weights.
From my understanding the fast-transformers should be way closer to the huggingface bert.

angeloskath · 2021-09-17T15:40:39Z

Well if the test passes then that means that the networks are indentical! This means that they should train in exactly the same way as well. So I assume there is probably a bug in your configuration somewhere and the two networks are actually not the same.

In order to check, you should be able to copy the weights from the BertEncoder using the same code as in the test and you should get exactly the same evaluation scores.

huu4ontocord · 2021-09-26T01:37:07Z

Is it the same as this issue: #103

lipmem · 2021-09-27T19:16:34Z

I replicated the test with my classsification network. First I trained the variant with the huggingface bert. Then I copied all the weights to the variant with your Full Attention. Additionally I set the norm1.eps norm2.eps to 1e-12.
I passed a batch from my dataset into the huggingface variant and compared it to the output of the Full Attention variant and the output is exactly the same (Your unit test is only close, because the norm eps are different).

Despite the fact, that the two networks produce the same ouput with the same weights set. The Full attention variant learns considerable worse (F1=16.54) than the Huggingface Bert (F1=43.18).

I am still clueless why it behaves like this.

danieltudosiu · 2021-10-18T11:30:31Z

Sorry to bump this issue, but are there any updates on this?

lipmem · 2021-11-16T15:13:25Z

I resolved the issue by finding a mistake in my configuration. Thanks for the help here.

lipmem closed this as completed Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huggingface Bert vs. Fast Transformer full attention #100

Huggingface Bert vs. Fast Transformer full attention #100

lipmem commented Sep 2, 2021

angeloskath commented Sep 9, 2021

lipmem commented Sep 10, 2021 •

edited

angeloskath commented Sep 10, 2021

lipmem commented Sep 16, 2021

angeloskath commented Sep 17, 2021

huu4ontocord commented Sep 26, 2021

lipmem commented Sep 27, 2021 •

edited

danieltudosiu commented Oct 18, 2021

lipmem commented Nov 16, 2021

Huggingface Bert vs. Fast Transformer full attention #100

Huggingface Bert vs. Fast Transformer full attention #100

Comments

lipmem commented Sep 2, 2021

angeloskath commented Sep 9, 2021

lipmem commented Sep 10, 2021 • edited

angeloskath commented Sep 10, 2021

lipmem commented Sep 16, 2021

angeloskath commented Sep 17, 2021

huu4ontocord commented Sep 26, 2021

lipmem commented Sep 27, 2021 • edited

danieltudosiu commented Oct 18, 2021

lipmem commented Nov 16, 2021

lipmem commented Sep 10, 2021 •

edited

lipmem commented Sep 27, 2021 •

edited