Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"bert_chinese.bin" model gives wrong results #52

Closed
hitvoice opened this issue Apr 2, 2020 · 3 comments
Closed

"bert_chinese.bin" model gives wrong results #52

hitvoice opened this issue Apr 2, 2020 · 3 comments

Comments

@hitvoice
Copy link

hitvoice commented Apr 2, 2020

When tokenize by BlingFire model,

import os
import blingfire
h = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__),'bert_chinese.bin'))
print(blingfire.text_to_words_with_model(h, '搭船前往惠阳,在进入该区域后,他们随即被守军拦下并关押于当地军营。'))

The above code prints:

搭 搭 船 船 前 前 往 往 惠 惠 阳 阳 , , 在 在 进 进 入 入 该 该 区 区 域 域 后 后 , , 他 他 们

While Huggingface tokenizer gives the correct result:

from transformers import BertTokenizer
tknz = BertTokenizer.from_pretrained('bert_vocab_chinese.txt')
print(' '.join(tknz.tokenize('搭船前往惠阳,在进入该区域后,他们随即被守军拦下并关押于当地军营。')))
搭 船 前 往 惠 阳 , 在 进 入 该 区 域 后 , 他 们 随 即 被 守 军 拦 下 并 关 押 于 当 地 军 营 。
@xirect
Copy link

xirect commented Apr 9, 2020

Not only the chinese model has this bug, also all other custom models have this bug.
bert_base_tok.bin
bert_base_cased_tok.bin
bert_chinese.bin
bert_multi_cased.bin

text = 'This is the Bling-Fire tokenizer. 2007年9月日历表_2007年9月农历阳历一览表-万年历'
>>> text_to_words_with_model(h, text)
'This This is is the the Bling B ling - - Fire Fire tokenizer tok eni zer . . 2007 2007 年 年 9 9 月 月 日 日 历 历 表 表 _ _ 2007 2007 年 年 9 9 月 月 农 农 历 历 阳 阳 历 历 一 一 览 览 表 表 - - 万 万 年 年'

@SergeiAlonichau
Copy link
Member

Hi,

For BERT models we support only TextToIds at the moment.

TextToWords works for regular tokenization models right now.

We plan to fix it but I don't have ETA now.

At the moment if you need tokens you can call TextToIds and then do a reverse lookup to get the token text (note that TextToWords does not give you information of whether the token is internal or not as well.) So all of this should be fixed before TextToWords can work correctly on word-piece models.

Another work around is to use offsets, we have just added TextToIdsWithOffsets API. The ID's and corresponding UTF-8 offsets for each token are returned including UNK's. However this code has just been added and I need more time for testing and updating and publishing python package. ETA: End of April.

@xirect
Copy link

xirect commented Apr 21, 2020

@SergeiAlonichau I cannot find how to do reverse lookup in your code. How do you retrieve the text matching the token ID?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants