Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should be the label of sub-word units in Token Classification with Bert #323

Closed
ereday opened this issue Feb 26, 2019 · 4 comments
Closed

Comments

@ereday
Copy link

ereday commented Feb 26, 2019

Hi,

I'm trying to use BERT for a token-level tagging problem such as NER in German.

This is what I've done so far for input preparation:

from pytorch_pretrained_bert.tokenization import BertTokenizer, WordpieceTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased", do_lower_case=False)

sentences=  ["Bis 2013 steigen die Mittel aus dem EU-Budget auf rund 120 Millionen Euro ."]
labels = [["O","O","O","O","O","O","O","B-ORGpart","O","O","O","O","B-OTH","O"]]
tokens = tokenizer.tokenize(sentences[0])

When I check the tokens I see that there are now 18 tokens instead of 14 (as expected) because of the sub-word units.

>>> tokens
['Bis', '2013', 'st', '##eig', '##en', 'die', 'Mittel', 'aus', 'dem', 'EU', '##-', '##B', '##ud', '##get', 'auf', 'rund', '120', 'Millionen', 'Euro', '.']

My question is that how should I modify the labels array. Should I label each sub-word unit with the label of the original word or should I do something else ? As second question, which one of the examples in the resository can be used as an example code for this purpose ? run_classifier.py ? run_squad.py ?

UPDATE

OK, according to the paper it should be handled as follows (From Section 4.3 of BERT paper):

To make this compatible with WordPiece tokenization, we feed each CoNLL-tokenized
input word into our WordPiece tokenizer and use the hidden state corresponding to the first
sub-token as input to the classifier. Where no prediction is made for X. Since
the WordPiece tokenization boundaries are a known part of the input, this is done for both
training and test.

Then, for the above example , the correct input output pair is :

['Bis', '2013', 'st', '##eig', '##en', 'die', 'Mittel', 'aus', 'dem', 'EU', '##-', '##B', '##ud', '##get', 'auf', 'rund', '120', 'Millionen', 'Euro', '.']
['O', 'O', 'O', 'X', 'X', 'O', 'O', 'O', 'O', 'B-ORGpart', 'X', 'X', 'X', 'X', 'O', 'O', 'O', 'O', 'B-OTH', 'O']

Then my question is evolved to " How the sub-tokens could be masked during training & testing ?"

@eric-yates
Copy link

I have a similar problem. I labeled the tokens as "X" and then got an error relating to NUM_LABELS. BERT appears to have thought the X was a third label, and I only specified there to be two labels.

@bheinzerling
Copy link

You do not need to introduce an additional tag. This is explained here:

#64 (comment)

@thomwolf
Copy link
Member

thomwolf commented Mar 6, 2019

Yes, I've left #64 open to discuss all these questions. Feel free to read the discussion there and ask questions if needed. Closing this issue.

@ritwikmishra
Copy link

@ereday AFAIK
To answer your question "How the sub-tokens could be masked during training & testing"
There is no need of masking. The sub-word token_ids (except for the first) are not fed to the BERT model.
Please tell me if i am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants