"bert_chinese.bin" model gives wrong results #52

hitvoice · 2020-04-02T07:41:50Z

When tokenize by BlingFire model,

import os
import blingfire
h = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__),'bert_chinese.bin'))
print(blingfire.text_to_words_with_model(h, '搭船前往惠阳，在进入该区域后，他们随即被守军拦下并关押于当地军营。'))

The above code prints:

搭 搭 船 船 前 前 往 往 惠 惠 阳 阳 ， ， 在 在 进 进 入 入 该 该 区 区 域 域 后 后 ， ， 他 他 们

While Huggingface tokenizer gives the correct result:

from transformers import BertTokenizer
tknz = BertTokenizer.from_pretrained('bert_vocab_chinese.txt')
print(' '.join(tknz.tokenize('搭船前往惠阳，在进入该区域后，他们随即被守军拦下并关押于当地军营。')))

搭 船 前 往 惠 阳 ， 在 进 入 该 区 域 后 ， 他 们 随 即 被 守 军 拦 下 并 关 押 于 当 地 军 营 。

The text was updated successfully, but these errors were encountered:

xirect · 2020-04-09T08:02:48Z

Not only the chinese model has this bug, also all other custom models have this bug.
bert_base_tok.bin
bert_base_cased_tok.bin
bert_chinese.bin
bert_multi_cased.bin

text = 'This is the Bling-Fire tokenizer. 2007年9月日历表_2007年9月农历阳历一览表-万年历'
>>> text_to_words_with_model(h, text)
'This This is is the the Bling B ling - - Fire Fire tokenizer tok eni zer . . 2007 2007 年 年 9 9 月 月 日 日 历 历 表 表 _ _ 2007 2007 年 年 9 9 月 月 农 农 历 历 阳 阳 历 历 一 一 览 览 表 表 - - 万 万 年 年'

SergeiAlonichau · 2020-04-17T19:29:59Z

Hi,

For BERT models we support only TextToIds at the moment.

TextToWords works for regular tokenization models right now.

We plan to fix it but I don't have ETA now.

At the moment if you need tokens you can call TextToIds and then do a reverse lookup to get the token text (note that TextToWords does not give you information of whether the token is internal or not as well.) So all of this should be fixed before TextToWords can work correctly on word-piece models.

Another work around is to use offsets, we have just added TextToIdsWithOffsets API. The ID's and corresponding UTF-8 offsets for each token are returned including UNK's. However this code has just been added and I need more time for testing and updating and publishing python package. ETA: End of April.

xirect · 2020-04-21T07:59:54Z

@SergeiAlonichau I cannot find how to do reverse lookup in your code. How do you retrieve the text matching the token ID?

SergeiAlonichau closed this as completed Apr 17, 2020

raj-shah14 mentioned this issue May 1, 2020

Does it support tokens to id? #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"bert_chinese.bin" model gives wrong results #52

"bert_chinese.bin" model gives wrong results #52

hitvoice commented Apr 2, 2020

xirect commented Apr 9, 2020 •

edited

Loading

SergeiAlonichau commented Apr 17, 2020

xirect commented Apr 21, 2020

"bert_chinese.bin" model gives wrong results #52

"bert_chinese.bin" model gives wrong results #52

Comments

hitvoice commented Apr 2, 2020

xirect commented Apr 9, 2020 • edited Loading

SergeiAlonichau commented Apr 17, 2020

xirect commented Apr 21, 2020

xirect commented Apr 9, 2020 •

edited

Loading