Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate tokens in BPE vocabulary #881

Open
astanic opened this issue Jun 9, 2023 · 1 comment
Open

Duplicate tokens in BPE vocabulary #881

astanic opened this issue Jun 9, 2023 · 1 comment

Comments

@astanic
Copy link

astanic commented Jun 9, 2023

Hi,

we're observing an "issue" with the sentencepiece tokenizer, where multiple tokens have identical string decoding.

We've generated the vocab of size 32768 using the wikitext-103 dataset with the following code:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    " --input=wikitext-103-raw/wiki.train.raw" +
    " --model_prefix=wiki_32768" +
    " --model_type=bpe" +
    " --vocab_size=32768" +
    " --hard_vocab_limit=True" +
    " --input_sentence_size=2000000" +
    " --unk_id=1" +
    " --bos_id=-1" +
    " --eos_id=-1" +
    " --pad_id=0" +
    " --max_sentencepiece_length=99" +
    " --split_by_unicode_script=True" +
    " --split_by_number=True" +
    " --split_by_whitespace=True" +
    " --add_dummy_prefix=False" +
    " --byte_fallback=True" +
    " --remove_extra_whitespaces=False" +
    " --allow_whitespace_only_pieces=True" +
    " --normalization_rule_name=identity" +
    " --user_defined_symbols=<D>" +
    " --split_digits=True" +
    " --vocabulary_output_piece_score=False")

After this we run the following inspection:

for i in range(32768):
    print(i, bytes(sp.Decode([i]), 'utf-8'))

Here we observe 2 "issues":

  1. 80 pairs of tokens to represent the same single character strings (see the full list below).
    We observed that these tokens group together into 2 sets that are close to each other in IDs: the 1st set has low IDs (almost at the start of the vocab) and the second one very high (almost at the end of the vocab).
    What we observe in the tokenized dataset is that the tokenizer "decides" to use only one of these tokens always - usually the token with the higher ID.
  2. 128 tokens decode to the same b'\xef\xbf\xbd' unknown symbol. b'\xef\xbf\xbd' seems to be the unicode replacement character. But we don't understand why there would be one since we use --split_by_unicode_script=True. Or maybe those tokens are due to the --byte_fallback=True option?

Are both of these known and expected behaviors?
To us it seems counterintuitive and wasteful that the vocab should use 2 token encodings to encode the same symbol.

Thanks!


List of duplicate tokens:
b'X' - stands for the string decoding (result of the sp.Decode([i]) operation)
A with [B] means that the sp.decode result for tokens A and B is identical.

Duplicate tokens (sp.Decode=b' ') 35 with [32685]
Duplicate tokens (sp.Decode=b'!') 36 with [32760]
Duplicate tokens (sp.Decode=b'"') 37 with [32710]
Duplicate tokens (sp.Decode=b'#') 38 with [32759]
Duplicate tokens (sp.Decode=b"'") 42 with [32711]
Duplicate tokens (sp.Decode=b'(') 43 with [32743]
Duplicate tokens (sp.Decode=b')') 44 with [32742]
Duplicate tokens (sp.Decode=b'*') 45 with [32725]
Duplicate tokens (sp.Decode=b',') 47 with [32706]
Duplicate tokens (sp.Decode=b'-') 48 with [32723]
Duplicate tokens (sp.Decode=b'.') 49 with [32705]
Duplicate tokens (sp.Decode=b'/') 50 with [32761]
Duplicate tokens (sp.Decode=b'0') 51 with [32733]
Duplicate tokens (sp.Decode=b'1') 52 with [32718]
Duplicate tokens (sp.Decode=b'2') 53 with [32728]
Duplicate tokens (sp.Decode=b'3') 54 with [32746]
Duplicate tokens (sp.Decode=b'4') 55 with [32749]
Duplicate tokens (sp.Decode=b'5') 56 with [32750]
Duplicate tokens (sp.Decode=b'6') 57 with [32752]
Duplicate tokens (sp.Decode=b'7') 58 with [32753]
Duplicate tokens (sp.Decode=b'8') 59 with [32751]
Duplicate tokens (sp.Decode=b'9') 60 with [32741]
Duplicate tokens (sp.Decode=b':') 61 with [32737]
Duplicate tokens (sp.Decode=b';') 62 with [32754]
Duplicate tokens (sp.Decode=b'?') 66 with [32738]
Duplicate tokens (sp.Decode=b'A') 68 with [32714]
Duplicate tokens (sp.Decode=b'B') 69 with [32722]
Duplicate tokens (sp.Decode=b'C') 70 with [32719]
Duplicate tokens (sp.Decode=b'D') 71 with [32730]
Duplicate tokens (sp.Decode=b'E') 72 with [32726]
Duplicate tokens (sp.Decode=b'F') 73 with [32734]
Duplicate tokens (sp.Decode=b'G') 74 with [32735]
Duplicate tokens (sp.Decode=b'H') 75 with [32716]
Duplicate tokens (sp.Decode=b'I') 76 with [32712]
Duplicate tokens (sp.Decode=b'J') 77 with [32744]
Duplicate tokens (sp.Decode=b'K') 78 with [32755]
Duplicate tokens (sp.Decode=b'L') 79 with [32729]
Duplicate tokens (sp.Decode=b'M') 80 with [32720]
Duplicate tokens (sp.Decode=b'N') 81 with [32731]
Duplicate tokens (sp.Decode=b'O') 82 with [32739]
Duplicate tokens (sp.Decode=b'P') 83 with [32727]
Duplicate tokens (sp.Decode=b'Q') 84 with [32763]
Duplicate tokens (sp.Decode=b'R') 85 with [32732]
Duplicate tokens (sp.Decode=b'S') 86 with [32715]
Duplicate tokens (sp.Decode=b'T') 87 with [32713]
Duplicate tokens (sp.Decode=b'U') 88 with [32757]
Duplicate tokens (sp.Decode=b'V') 89 with [32758]
Duplicate tokens (sp.Decode=b'W') 90 with [32724]
Duplicate tokens (sp.Decode=b'Y') 92 with [32748]
Duplicate tokens (sp.Decode=b'Z') 93 with [32764]
Duplicate tokens (sp.Decode=b'[') 94 with [32767]
Duplicate tokens (sp.Decode=b']') 96 with [32766]
Duplicate tokens (sp.Decode=b'_') 98 with [32717]
Duplicate tokens (sp.Decode=b'a') 100 with [32688]
Duplicate tokens (sp.Decode=b'b') 101 with [32707]
Duplicate tokens (sp.Decode=b'c') 102 with [32698]
Duplicate tokens (sp.Decode=b'd') 103 with [32696]
Duplicate tokens (sp.Decode=b'e') 104 with [32686]
Duplicate tokens (sp.Decode=b'f') 105 with [32701]
Duplicate tokens (sp.Decode=b'g') 106 with [32700]
Duplicate tokens (sp.Decode=b'h') 107 with [32694]
Duplicate tokens (sp.Decode=b'i') 108 with [32691]
Duplicate tokens (sp.Decode=b'j') 109 with [32736]
Duplicate tokens (sp.Decode=b'k') 110 with [32709]
Duplicate tokens (sp.Decode=b'l') 111 with [32695]
Duplicate tokens (sp.Decode=b'm') 112 with [32699]
Duplicate tokens (sp.Decode=b'n') 113 with [32690]
Duplicate tokens (sp.Decode=b'o') 114 with [32689]
Duplicate tokens (sp.Decode=b'p') 115 with [32704]
Duplicate tokens (sp.Decode=b'q') 116 with [32745]
Duplicate tokens (sp.Decode=b'r') 117 with [32693]
Duplicate tokens (sp.Decode=b's') 118 with [32692]
Duplicate tokens (sp.Decode=b't') 119 with [32687]
Duplicate tokens (sp.Decode=b'u') 120 with [32697]
Duplicate tokens (sp.Decode=b'v') 121 with [32708]
Duplicate tokens (sp.Decode=b'w') 122 with [32702]
Duplicate tokens (sp.Decode=b'x') 123 with [32721]
Duplicate tokens (sp.Decode=b'y') 124 with [32703]
Duplicate tokens (sp.Decode=b'z') 125 with [32740]
Duplicate tokens (sp.Decode=b'|') 127 with [32762]
Duplicate tokens (sp.Decode=b'\xef\xbf\xbd') 131 with [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258]
@chris-ha458
Copy link
Contributor

--byte_fallback=True"

This might be the problem.
In that case, those tokens are not identical.
One would encode a single character within the ascii range, the other might have the same byte representation but only used for unicode fall back.
The way sentencepiece works, the fallback tokens would not have a surface representation, so it would not be tokenized the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants