How to detokenize a BertTokenizer output? #36

bprabhakar · 2018-11-19T04:39:04Z

I was wondering if there's a proper way of detokenizing the output tokens, i.e., constructing the sentence back from the tokens? Considering the fact that the word-piece tokenisation introduces lots of #s.

The text was updated successfully, but these errors were encountered:

artemisart · 2018-11-19T14:18:46Z

You can remove ' ##' but you cannot know if there was a space around punctuations tokens or uppercase words.

thomwolf · 2018-11-20T09:07:39Z

Yes. I don't plan to include a reverse conversion of tokens in the tokenizer.
For an example on how to keep track of the original characters position, please read the run_squad.py example.

leitro · 2019-02-19T13:54:16Z

In my case, I do:

tokens = ['[UNK]', '[CLS]', '[SEP]', 'want', '##ed', 'wa', 'un', 'runn', '##ing', ',']
text = ' '.join([x for x in tokens])
fine_text = text.replace(' ##', '')

igrinis · 2019-11-12T10:09:54Z

Apostrophe is considered as a punctuation mark, but often it is an integrated part of the word. Regular .tokenize() always converts apostrophe to the stand alone token, so the information to which word it belongs is lost. If the original sentence contains apostrophes, it is impossible to recreate the original sentence from its' tokens (for example when apostrophe is a last symbol in some word convert_tokens_to_string() will join it with the following one). In order to overcome this, one can check the surroundings of the apostrophe and add ## immediately after the tokenization. For example:

sent = "The Smiths' used their son's car" 
tokens = tokenizer.tokenize(sent)

now if you fix tokens to look like:

original =>['the', 'smith', '##s', "'", 'used', 'their', 'son', "'", 's', 'car']
fixed => ['the', 'smith', '##s', "##'", 'used', 'their', 'son', "##'", '##s', 'car']

you will be able to restore the original words.

pertschuk · 2019-12-11T21:59:07Z

@thomwolf could you point to the specific section of run_squad.py that handles this, I'm having trouble

EDIT: is it this bit from processors/squad.py?

tok_to_orig_index = []
        orig_to_tok_index = []
        all_doc_tokens = []
        for (i, token) in enumerate(example.doc_tokens):
            orig_to_tok_index.append(len(all_doc_tokens))
            sub_tokens = tokenizer.tokenize(token)
            for sub_token in sub_tokens:
                tok_to_orig_index.append(i)
                all_doc_tokens.append(sub_token)

update coqa-ensemble codalab submission pipeline

ra

Summary: This pull request replaces the default nn.Linear with our patched version that doesn't flatten the high dimensional tensors. Test Plan: Tested on a V4-8.

thomwolf closed this as completed Nov 20, 2018

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

daden-ms mentioned this issue Jun 21, 2019

[FEATURE] Need to detokenize a BertTokenizer output microsoft/nlp-recipes#117

Closed

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020

Merge pull request huggingface#36 from stevezheng23/dev/zheng/coqa

30ff3e5

update coqa-ensemble codalab submission pipeline

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023

Merge pull request huggingface#36 from huggingface/main

aee0cea

ra

lwmlyy mentioned this issue Aug 15, 2023

add util for ram efficient loading of model when using fsdp #25107

Merged

1 task

ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023

Fix embeddings pipeline (huggingface#36)

851815b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to detokenize a BertTokenizer output? #36

How to detokenize a BertTokenizer output? #36

bprabhakar commented Nov 19, 2018

artemisart commented Nov 19, 2018 •

edited

thomwolf commented Nov 20, 2018

leitro commented Feb 19, 2019 •

edited

igrinis commented Nov 12, 2019 •

edited

pertschuk commented Dec 11, 2019 •

edited

How to detokenize a BertTokenizer output? #36

How to detokenize a BertTokenizer output? #36

Comments

bprabhakar commented Nov 19, 2018

artemisart commented Nov 19, 2018 • edited

thomwolf commented Nov 20, 2018

leitro commented Feb 19, 2019 • edited

igrinis commented Nov 12, 2019 • edited

pertschuk commented Dec 11, 2019 • edited

artemisart commented Nov 19, 2018 •

edited

leitro commented Feb 19, 2019 •

edited

igrinis commented Nov 12, 2019 •

edited

pertschuk commented Dec 11, 2019 •

edited