Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordPiece Tokenizer Clarification #763

Open
buckhx opened this issue Jul 16, 2019 · 0 comments
Open

WordPiece Tokenizer Clarification #763

buckhx opened this issue Jul 16, 2019 · 0 comments

Comments

@buckhx
Copy link

buckhx commented Jul 16, 2019

From my understanding the WordPiece tokenizer adheres to the following algorithm

  1. For each token.
  2. Find longest substring, if no substring add [UNK] to output
  3. else, Add substring to output
  4. Strip prefix from original token and prepend "##"
  5. Repeat until no substrings found

From this, it is unclear to my why "unwantedX" -> "[UNK]" instead of returning some tokens because "un" is a substring in the following test:

https://github.com/google-research/bert/blob/master/tokenization_test.py#L91

Specifically, the following go code produces, expands unwantedX from the vocab given in the python test

// Want:  [[UNK] runn ##ing]
// Got: [un ##want ##ed [UNK] runn ##ing] 
func (wt WordPiece) Tokenize(text string) []string {
        var toks []string
        for _, tok := range tokenizeWhitespace(text) {
                if len(tok) > maxWordChars {
                        toks = append(toks, unknownToken)
                        continue
                }
                for len(tok) > 0 && tok != "##" {
                        sub := wt.Vocab.LongestSubstring(tok)
                        if sub == "" {
                                toks = append(toks, unknownToken)
                                break
                        }
                        toks = append(toks, sub)
                        tok = fmt.Sprintf("##%s", tok[len(sub):])
                }
        }
        return toks
}

Treating unwantedX as an unknown token appears to be the expected behavior, but I am trying to understand why that is the case and how it impacts the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant