What should be the label of sub-word units in Token Classification with Bert #323

ereday · 2019-02-26T16:06:34Z

Hi,

I'm trying to use BERT for a token-level tagging problem such as NER in German.

This is what I've done so far for input preparation:

from pytorch_pretrained_bert.tokenization import BertTokenizer, WordpieceTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased", do_lower_case=False)

sentences=  ["Bis 2013 steigen die Mittel aus dem EU-Budget auf rund 120 Millionen Euro ."]
labels = [["O","O","O","O","O","O","O","B-ORGpart","O","O","O","O","B-OTH","O"]]
tokens = tokenizer.tokenize(sentences[0])

When I check the tokens I see that there are now 18 tokens instead of 14 (as expected) because of the sub-word units.

>>> tokens
['Bis', '2013', 'st', '##eig', '##en', 'die', 'Mittel', 'aus', 'dem', 'EU', '##-', '##B', '##ud', '##get', 'auf', 'rund', '120', 'Millionen', 'Euro', '.']

My question is that how should I modify the labels array. Should I label each sub-word unit with the label of the original word or should I do something else ? As second question, which one of the examples in the resository can be used as an example code for this purpose ? run_classifier.py ? run_squad.py ?

UPDATE

OK, according to the paper it should be handled as follows (From Section 4.3 of BERT paper):

To make this compatible with WordPiece tokenization, we feed each CoNLL-tokenized
input word into our WordPiece tokenizer and use the hidden state corresponding to the first
sub-token as input to the classifier. Where no prediction is made for X. Since
the WordPiece tokenization boundaries are a known part of the input, this is done for both
training and test.

Then, for the above example , the correct input output pair is :

['Bis', '2013', 'st', '##eig', '##en', 'die', 'Mittel', 'aus', 'dem', 'EU', '##-', '##B', '##ud', '##get', 'auf', 'rund', '120', 'Millionen', 'Euro', '.']
['O', 'O', 'O', 'X', 'X', 'O', 'O', 'O', 'O', 'B-ORGpart', 'X', 'X', 'X', 'X', 'O', 'O', 'O', 'O', 'B-OTH', 'O']

Then my question is evolved to " How the sub-tokens could be masked during training & testing ?"

The text was updated successfully, but these errors were encountered:

eric-yates · 2019-02-28T00:01:29Z

I have a similar problem. I labeled the tokens as "X" and then got an error relating to NUM_LABELS. BERT appears to have thought the X was a third label, and I only specified there to be two labels.

bheinzerling · 2019-02-28T15:23:21Z

You do not need to introduce an additional tag. This is explained here:

#64 (comment)

thomwolf · 2019-03-06T09:28:10Z

Yes, I've left #64 open to discuss all these questions. Feel free to read the discussion there and ask questions if needed. Closing this issue.

ritwikmishra · 2021-07-16T13:34:41Z

@ereday AFAIK
To answer your question "How the sub-tokens could be masked during training & testing"
There is no need of masking. The sub-word token_ids (except for the first) are not fed to the BERT model.
Please tell me if i am wrong.

thomwolf closed this as completed Mar 6, 2019

KennethEnevoldsen mentioned this issue Feb 11, 2020

wordpiece and labels HLasse/DanishBertFun#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should be the label of sub-word units in Token Classification with Bert #323

What should be the label of sub-word units in Token Classification with Bert #323

ereday commented Feb 26, 2019 •

edited

Loading

eric-yates commented Feb 28, 2019

bheinzerling commented Feb 28, 2019

thomwolf commented Mar 6, 2019 •

edited

Loading

ritwikmishra commented Jul 16, 2021

What should be the label of sub-word units in Token Classification with Bert #323

What should be the label of sub-word units in Token Classification with Bert #323

Comments

ereday commented Feb 26, 2019 • edited Loading

eric-yates commented Feb 28, 2019

bheinzerling commented Feb 28, 2019

thomwolf commented Mar 6, 2019 • edited Loading

ritwikmishra commented Jul 16, 2021

ereday commented Feb 26, 2019 •

edited

Loading

thomwolf commented Mar 6, 2019 •

edited

Loading