-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"bert_chinese.bin" model gives wrong results #52
Comments
Not only the chinese model has this bug, also all other custom models have this bug.
|
Hi, For BERT models we support only TextToIds at the moment. TextToWords works for regular tokenization models right now. We plan to fix it but I don't have ETA now. At the moment if you need tokens you can call TextToIds and then do a reverse lookup to get the token text (note that TextToWords does not give you information of whether the token is internal or not as well.) So all of this should be fixed before TextToWords can work correctly on word-piece models. Another work around is to use offsets, we have just added TextToIdsWithOffsets API. The ID's and corresponding UTF-8 offsets for each token are returned including UNK's. However this code has just been added and I need more time for testing and updating and publishing python package. ETA: End of April. |
@SergeiAlonichau I cannot find how to do reverse lookup in your code. How do you retrieve the text matching the token ID? |
When tokenize by BlingFire model,
The above code prints:
While Huggingface tokenizer gives the correct result:
The text was updated successfully, but these errors were encountered: