You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Specifically, the following go code produces, expands unwantedX from the vocab given in the python test
// Want: [[UNK] runn ##ing]
// Got: [un ##want ##ed [UNK] runn ##ing]
func (wt WordPiece) Tokenize(text string) []string {
var toks []string
for _, tok := range tokenizeWhitespace(text) {
if len(tok) > maxWordChars {
toks = append(toks, unknownToken)
continue
}
for len(tok) > 0 && tok != "##" {
sub := wt.Vocab.LongestSubstring(tok)
if sub == "" {
toks = append(toks, unknownToken)
break
}
toks = append(toks, sub)
tok = fmt.Sprintf("##%s", tok[len(sub):])
}
}
return toks
}
Treating unwantedX as an unknown token appears to be the expected behavior, but I am trying to understand why that is the case and how it impacts the implementation.
The text was updated successfully, but these errors were encountered:
From my understanding the WordPiece tokenizer adheres to the following algorithm
From this, it is unclear to my why "unwantedX" -> "[UNK]" instead of returning some tokens because "un" is a substring in the following test:
https://github.com/google-research/bert/blob/master/tokenization_test.py#L91
Specifically, the following go code produces, expands unwantedX from the vocab given in the python test
Treating unwantedX as an unknown token appears to be the expected behavior, but I am trying to understand why that is the case and how it impacts the implementation.
The text was updated successfully, but these errors were encountered: