Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to detokenize a BertTokenizer output? #36

Closed
bprabhakar opened this issue Nov 19, 2018 · 5 comments
Closed

How to detokenize a BertTokenizer output? #36

bprabhakar opened this issue Nov 19, 2018 · 5 comments

Comments

@bprabhakar
Copy link

I was wondering if there's a proper way of detokenizing the output tokens, i.e., constructing the sentence back from the tokens? Considering the fact that the word-piece tokenisation introduces lots of #s.

@artemisart
Copy link

artemisart commented Nov 19, 2018

You can remove ' ##' but you cannot know if there was a space around punctuations tokens or uppercase words.

@thomwolf
Copy link
Member

Yes. I don't plan to include a reverse conversion of tokens in the tokenizer.
For an example on how to keep track of the original characters position, please read the run_squad.py example.

@leitro
Copy link

leitro commented Feb 19, 2019

In my case, I do:

tokens = ['[UNK]', '[CLS]', '[SEP]', 'want', '##ed', 'wa', 'un', 'runn', '##ing', ',']
text = ' '.join([x for x in tokens])
fine_text = text.replace(' ##', '')

@igrinis
Copy link

igrinis commented Nov 12, 2019

Apostrophe is considered as a punctuation mark, but often it is an integrated part of the word. Regular .tokenize() always converts apostrophe to the stand alone token, so the information to which word it belongs is lost. If the original sentence contains apostrophes, it is impossible to recreate the original sentence from its' tokens (for example when apostrophe is a last symbol in some word convert_tokens_to_string() will join it with the following one). In order to overcome this, one can check the surroundings of the apostrophe and add ## immediately after the tokenization. For example:

sent = "The Smiths' used their son's car" 
tokens = tokenizer.tokenize(sent)

now if you fix tokens to look like:

original =>['the', 'smith', '##s', "'", 'used', 'their', 'son', "'", 's', 'car']
fixed => ['the', 'smith', '##s', "##'", 'used', 'their', 'son', "##'", '##s', 'car']

you will be able to restore the original words.

@pertschuk
Copy link

pertschuk commented Dec 11, 2019

@thomwolf could you point to the specific section of run_squad.py that handles this, I'm having trouble

EDIT: is it this bit from processors/squad.py?

tok_to_orig_index = []
        orig_to_tok_index = []
        all_doc_tokens = []
        for (i, token) in enumerate(example.doc_tokens):
            orig_to_tok_index.append(len(all_doc_tokens))
            sub_tokens = tokenizer.tokenize(token)
            for sub_token in sub_tokens:
                tok_to_orig_index.append(i)
                all_doc_tokens.append(sub_token)

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
update coqa-ensemble codalab submission pipeline
jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023
ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023
jonb377 pushed a commit to jonb377/hf-transformers that referenced this issue Nov 3, 2023
Summary:
This pull request replaces the default nn.Linear with our patched version that doesn't flatten the high dimensional tensors.

Test Plan:
Tested on a V4-8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants