Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any sample code for fine-tuning BERT on sequence labeling tasks, e.g., NER on CoNLL-2003? #1216

Closed
tuvuumass opened this issue Sep 6, 2019 · 10 comments
Labels

Comments

@tuvuumass
Copy link
Contributor

tuvuumass commented Sep 6, 2019

❓ Questions & Help

Is there any sample code for fine-tuning BERT on sequence labeling tasks, e.g., NER on CoNLL-2003, using BertForTokenClassification?

@stefan-it
Copy link
Collaborator

Hi @tuvuumass,

Issue #64 is a good start for sequence labeling tasks. It also points to some repositories that show how to fine-tune BERT with PyTorch-Transformers (with focus on NER).

Nevertheless, it would be awesome to get some kind of fine-tuning examples (reference implementation) integrated into this outstanding PyTorch-Transformers library 🤗 Maybe run_glue.py could be a good start 🤔

@tuvuumass
Copy link
Contributor Author

tuvuumass commented Sep 7, 2019

Thanks, @stefan-it. I found #64 too. But it seems like none of the repositories in #64 could replicate BERT's results (i.e., 96.6 dev F1 and 92.8 test F1 for BERT large, 96.4 dev F1 and 92.4 test F1 for BERT base). Yes, I agree that it would be great if there is a fine-tuning example for sequence labeling tasks.

@thomwolf
Copy link
Member

thomwolf commented Sep 8, 2019

Yes I think it would be nice to have a clean example showing how the model can be trained and used on a token classification task like NER.

We won’t have the bandwidth/use-case to do that internally but if someone in the community has a (preferably self contained) script he can share, happy to welcome a PR and include it in the repo.

Maybe you have something Stefan?

@stefan-it
Copy link
Collaborator

Update on that:

I used the data preprocessing functions and forward implementation from @kamalkraj's BERT-NER ported it from pytorch-pretrained-bert to pytorch-transformers, and integrated it into a run_glue copy 😅

Fine-tuning is working - evaluation on dev set (using a BERT base and cased model):

           precision    recall  f1-score   support

      PER     0.9713    0.9745    0.9729      1842
     MISC     0.8993    0.9197    0.9094       922
      LOC     0.9769    0.9679    0.9724      1837
      ORG     0.9218    0.9403    0.9310      1341

micro avg     0.9503    0.9562    0.9533      5942
macro avg     0.9507    0.9562    0.9534      5942

Evaluation on test set:

09/09/2019 23:20:02 - INFO - __main__ -   
           precision    recall  f1-score   support

      LOC     0.9309    0.9287    0.9298      1668
     MISC     0.7937    0.8276    0.8103       702
      PER     0.9614    0.9549    0.9581      1617
      ORG     0.8806    0.9145    0.8972      1661

micro avg     0.9066    0.9194    0.9130      5648
macro avg     0.9078    0.9194    0.9135      5648

Trained for 5 epochs using the default parameters from run_glue. Each epoch took ~5 minutes on a RTX 2080 TI.

However, it's an early implementation and maybe (with a little help from @kamalkraj) we can integrate it here 🤗

@olix20
Copy link

olix20 commented Sep 10, 2019

@stefan-it could you pls share your fork? thanks :)

@stefan-it
Copy link
Collaborator

@olix20 Here's the first draft of an implementation:

https://gist.github.com/stefan-it/feb6c35bde049b2c19d8dda06fa0a465

(Just a gist at the moment) :)

@stecklin
Copy link

After working with BERT-NER for a few days now, I tried to come up with a script that could be integrated here.
Compared to that repo and @stefan-it's gist, I tried to do the following:

  • Use the default BertForTokenClassification class instead modifying the forward pass in a subclass. For that to work, I changed the way label ids are stored: I use the real label ids for the first sub-token of each word and padding ids for the remaining sub-tokens. Padding ids get ignored in the cross entropy loss function, instead of picking only the desired tokens in a for loop before feeding them to the loss computation.
  • Log metrics to tensorboard.
  • Remove unnecessary parts copied over from glue (e.g. DataProcessor class).

@kamalkraj
Copy link
Contributor

BERT-NER using tensorflow 2.0
https://github.com/kamalkraj/BERT-NER-TF

@stale
Copy link

stale bot commented Dec 21, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 21, 2019
@stale stale bot closed this as completed Dec 28, 2019
@Aj-232425
Copy link

Similar, can we use conll type/format data to fine tune BERT for relation extraction..!!?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants