Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface Bert vs. Fast Transformer full attention #100

Closed
lipmem opened this issue Sep 2, 2021 · 9 comments
Closed

Huggingface Bert vs. Fast Transformer full attention #100

lipmem opened this issue Sep 2, 2021 · 9 comments

Comments

@lipmem
Copy link

lipmem commented Sep 2, 2021

First of all thank you for this amazing work!

In my research I am comparing different encoders for relation extraction.
What I noticed is that the transformer implementation of this repo with full attention performs worse (regarding F1 score) than the huggingface bert implementation. I use a unpretrained huggingface bert.
My expectation is that this setup should perform the same as an untrained bert from huggingface.

TransformerEncoderBuilder.from_kwargs(
            n_layers=12,
            n_heads=12,
            query_dimensions=64,
            value_dimensions=64,
            feed_forward_dimensions=3072,
            attention_type="full",
            activation="gelu"
        ).get()

Is my expectation correct? Why does it perform worse?

@angeloskath
Copy link
Collaborator

Hi,

Sure, the performance should be similar. You can also check that given the same weights the two implementations actually return exactly the same results. We have a test for this at

def test_huggin_bert(self):
.

Let me know if you are still experiencing problems.

Best,
Angelos

@lipmem
Copy link
Author

lipmem commented Sep 10, 2021

Hi Angelos,
thank you for your answer. Unfortunatley I still face the problem.
I copied the source of the huggingface bert implementation and only replaced the encoder with your encoder version set to 'full' attention.
Like this:

# Old
self.encoder = BertEncoder(config) 
# New
self.encoder = TransformerEncoderBuilder.from_kwargs(
           n_layers=12,
           n_heads=12,
           query_dimensions=64,
           value_dimensions=64,
           feed_forward_dimensions=3072,
           attention_type="full",
           final_normalization=False,
           activation="gelu"
       ).get()

The encoder output is then used to do multiclass classifiction for relation extraction.
What happens is that the evaluating F1 score is singnificantly worse with the fast-transformer full attention encoder.
W B Chart 10_09_2021, 13_00_20
I have no explanantion. Do you have any clue about this?

@angeloskath
Copy link
Collaborator

Just to make sure if you run the test from my previous comment does it pass? If it does, then this means that there is probably a misconfiguration because that test means that the models are exactly equivalent so they should train exactly the same.

If I had to guess judging by the names in the plot, the lighter one maybe uses linear attention instead of full?

Cheers,
Angelos

@lipmem
Copy link
Author

lipmem commented Sep 16, 2021

Sorry for the late reply.
The test passes on the machine I use for training.

About the plot: the lighter one is the fast-transformers implementation set to full attention and the darker one is the huggingface bert implementation without pretrained weights.
From my understanding the fast-transformers should be way closer to the huggingface bert.

@angeloskath
Copy link
Collaborator

Well if the test passes then that means that the networks are indentical! This means that they should train in exactly the same way as well. So I assume there is probably a bug in your configuration somewhere and the two networks are actually not the same.

In order to check, you should be able to copy the weights from the BertEncoder using the same code as in the test and you should get exactly the same evaluation scores.

@huu4ontocord
Copy link

Is it the same as this issue: #103

@lipmem
Copy link
Author

lipmem commented Sep 27, 2021

I replicated the test with my classsification network. First I trained the variant with the huggingface bert. Then I copied all the weights to the variant with your Full Attention. Additionally I set the norm1.eps norm2.eps to 1e-12.
I passed a batch from my dataset into the huggingface variant and compared it to the output of the Full Attention variant and the output is exactly the same (Your unit test is only close, because the norm eps are different).

Despite the fact, that the two networks produce the same ouput with the same weights set. The Full attention variant learns considerable worse (F1=16.54) than the Huggingface Bert (F1=43.18).

I am still clueless why it behaves like this.

@danieltudosiu
Copy link

Sorry to bump this issue, but are there any updates on this?

@lipmem
Copy link
Author

lipmem commented Nov 16, 2021

I resolved the issue by finding a mistake in my configuration. Thanks for the help here.

@lipmem lipmem closed this as completed Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants