You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use BERT for a token-level tagging problem such as NER in German.
This is what I've done so far for input preparation:
from pytorch_pretrained_bert.tokenization import BertTokenizer, WordpieceTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased", do_lower_case=False)
sentences= ["Bis 2013 steigen die Mittel aus dem EU-Budget auf rund 120 Millionen Euro ."]
labels = [["O","O","O","O","O","O","O","B-ORGpart","O","O","O","O","B-OTH","O"]]
tokens = tokenizer.tokenize(sentences[0])
When I check the tokens I see that there are now 18 tokens instead of 14 (as expected) because of the sub-word units.
My question is that how should I modify the labels array. Should I label each sub-word unit with the label of the original word or should I do something else ? As second question, which one of the examples in the resository can be used as an example code for this purpose ? run_classifier.py ? run_squad.py ?
UPDATE
OK, according to the paper it should be handled as follows (From Section 4.3 of BERT paper):
To make this compatible with WordPiece tokenization, we feed each CoNLL-tokenized
input word into our WordPiece tokenizer and use the hidden state corresponding to the first
sub-token as input to the classifier. Where no prediction is made for X. Since
the WordPiece tokenization boundaries are a known part of the input, this is done for both
training and test.
Then, for the above example , the correct input output pair is :
I have a similar problem. I labeled the tokens as "X" and then got an error relating to NUM_LABELS. BERT appears to have thought the X was a third label, and I only specified there to be two labels.
@ereday AFAIK
To answer your question "How the sub-tokens could be masked during training & testing"
There is no need of masking. The sub-word token_ids (except for the first) are not fed to the BERT model.
Please tell me if i am wrong.
Hi,
I'm trying to use BERT for a token-level tagging problem such as NER in German.
This is what I've done so far for input preparation:
When I check the tokens I see that there are now 18 tokens instead of 14 (as expected) because of the sub-word units.
My question is that how should I modify the labels array. Should I label each sub-word unit with the label of the original word or should I do something else ? As second question, which one of the examples in the resository can be used as an example code for this purpose ?
run_classifier.py
?run_squad.py
?UPDATE
OK, according to the paper it should be handled as follows (From Section 4.3 of BERT paper):
Then, for the above example , the correct input output pair is :
Then my question is evolved to " How the sub-tokens could be masked during training & testing ?"
The text was updated successfully, but these errors were encountered: