WordPiece Tokenizer Clarification #763

buckhx · 2019-07-16T01:10:36Z

From my understanding the WordPiece tokenizer adheres to the following algorithm

For each token.
Find longest substring, if no substring add [UNK] to output
else, Add substring to output
Strip prefix from original token and prepend "##"
Repeat until no substrings found

From this, it is unclear to my why "unwantedX" -> "[UNK]" instead of returning some tokens because "un" is a substring in the following test:

https://github.com/google-research/bert/blob/master/tokenization_test.py#L91

Specifically, the following go code produces, expands unwantedX from the vocab given in the python test

// Want:  [[UNK] runn ##ing]
// Got: [un ##want ##ed [UNK] runn ##ing] 
func (wt WordPiece) Tokenize(text string) []string {
        var toks []string
        for _, tok := range tokenizeWhitespace(text) {
                if len(tok) > maxWordChars {
                        toks = append(toks, unknownToken)
                        continue
                }
                for len(tok) > 0 && tok != "##" {
                        sub := wt.Vocab.LongestSubstring(tok)
                        if sub == "" {
                                toks = append(toks, unknownToken)
                                break
                        }
                        toks = append(toks, sub)
                        tok = fmt.Sprintf("##%s", tok[len(sub):])
                }
        }
        return toks
}

Treating unwantedX as an unknown token appears to be the expected behavior, but I am trying to understand why that is the case and how it impacts the implementation.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordPiece Tokenizer Clarification #763

WordPiece Tokenizer Clarification #763

buckhx commented Jul 16, 2019

WordPiece Tokenizer Clarification #763

WordPiece Tokenizer Clarification #763

Comments

buckhx commented Jul 16, 2019