New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huggingface Bert vs. Fast Transformer full attention #100
Comments
Hi, Sure, the performance should be similar. You can also check that given the same weights the two implementations actually return exactly the same results. We have a test for this at
Let me know if you are still experiencing problems. Best, |
Just to make sure if you run the test from my previous comment does it pass? If it does, then this means that there is probably a misconfiguration because that test means that the models are exactly equivalent so they should train exactly the same. If I had to guess judging by the names in the plot, the lighter one maybe uses linear attention instead of full? Cheers, |
Sorry for the late reply. About the plot: the lighter one is the fast-transformers implementation set to full attention and the darker one is the huggingface bert implementation without pretrained weights. |
Well if the test passes then that means that the networks are indentical! This means that they should train in exactly the same way as well. So I assume there is probably a bug in your configuration somewhere and the two networks are actually not the same. In order to check, you should be able to copy the weights from the |
Is it the same as this issue: #103 |
I replicated the test with my classsification network. First I trained the variant with the huggingface bert. Then I copied all the weights to the variant with your Full Attention. Additionally I set the Despite the fact, that the two networks produce the same ouput with the same weights set. The Full attention variant learns considerable worse (F1=16.54) than the Huggingface Bert (F1=43.18). I am still clueless why it behaves like this. |
Sorry to bump this issue, but are there any updates on this? |
I resolved the issue by finding a mistake in my configuration. Thanks for the help here. |
First of all thank you for this amazing work!
In my research I am comparing different encoders for relation extraction.
What I noticed is that the transformer implementation of this repo with full attention performs worse (regarding F1 score) than the huggingface bert implementation. I use a unpretrained huggingface bert.
My expectation is that this setup should perform the same as an untrained bert from huggingface.
Is my expectation correct? Why does it perform worse?
The text was updated successfully, but these errors were encountered: