Tokenizer ignoring multiple spaces #40

jorgemcgomes · 2023-06-07T14:12:44Z

It appears the tokenizer is ignoring more than one consecutive space.
This behaviour is not observed with the original LLama tokenizer. See examples below.

Is this some issue with the configuration of the HF tokenizer? Or has the model really been trained like this?
This seems like a very big deal for everything concerning code understanding/generation.

OpenLLaMA

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained('openlm-research/open_llama_3b', use_fast=False)

>>> tokenizer("hello world")
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}
>>> tokenizer("hello     world")
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}

>>> tokenizer("hello\nworld")
{'input_ids': [1, 27701, 13, 7904], 'attention_mask': [1, 1, 1, 1]}

>>> tokenizer("hello\n world")
{'input_ids': [1, 27701, 13, 924], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello\n       world")
{'input_ids': [1, 27701, 13, 924], 'attention_mask': [1, 1, 1, 1]}

# line breaks seem fine
>>> tokenizer("hello\nworld")
{'input_ids': [1, 27701, 13, 7904], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello\n\nworld")
{'input_ids': [1, 27701, 13, 13, 7904], 'attention_mask': [1, 1, 1, 1, 1]}
>>> tokenizer("hello\n\n\nworld")
{'input_ids': [1, 27701, 13, 13, 13, 7904], 'attention_mask': [1, 1, 1, 1, 1, 1]}

Original LLaMA

tokenizer = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf', use_fast=False)
>>> tokenizer("hello world")
{'input_ids': [0, 22172, 3186], 'attention_mask': [1, 1, 1]}
>>> tokenizer("hello  world")
{'input_ids': [0, 22172, 29871, 3186], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello   world")
{'input_ids': [0, 22172, 259, 3186], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello    world")
{'input_ids': [0, 22172, 1678, 3186], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello     world")
{'input_ids': [0, 22172, 268, 3186], 'attention_mask': [1, 1, 1, 1]}

The text was updated successfully, but these errors were encountered:

danielhanchen · 2023-06-08T10:43:31Z

Interestingly all old model checkpoints also has the same issue if one uses use_fast = False. use_fast = True succeeds, albeit multiple spaces are tokenized independently (are multiple spaces supposed to be tokenized independently though?)

tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b_350bt_preview', use_fast = False)
tokenizer("hello    world")

returns
{'input_ids': [0, 27701, 924], 'attention_mask': [1, 1, 1]}

tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b_600bt_preview', use_fast = False)
tokenizer("hello     world")

returns
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}

tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b_600bt_preview', use_fast = True)
tokenizer("hello     world")

returns
{'input_ids': [1, 27701, 31822, 31822, 31822, 31822, 924], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

I thought first maybe my contributions, which was to enable use_fast = True (https://huggingface.co/openlm-research/open_llama_3b_600bt_preview/discussions/3) (https://huggingface.co/openlm-research/open_llama_7b_700bt_preview/discussions/2) might have caused the error, but then I did not contribute to open_llama_3b_350bt_preview, yet the error still persists.

I already opened 3 PRs each to the 3B, 7B and 13B models which allows use_fast = True to load in seconds rather than 5 minutes (since HF converts a slow tokenizer to a fast one under the hood), and that should coincidentally also solve the space tokenization issue.

https://huggingface.co/openlm-research/open_llama_13b_600bt/discussions/1
https://huggingface.co/openlm-research/open_llama_3b/discussions/2
https://huggingface.co/openlm-research/open_llama_7b/discussions/1

But my main Q is still whether spaces are supposed to be independently tokenized? Ie is 2 spaces just 2 tokens, and 3 spaces = 3 individual space tokens, and not like the original LLAMA where 2 spaces = token id X, 3 spaces = token id Y etc.

PS for the meantime, you can use my tokenizers which already implements use_fast = True:

tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_3b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_7b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_13b_600bt")

jorgemcgomes · 2023-06-08T11:44:28Z

Thanks @danielhanchen . I was just trying your tokenizers, thanks. They do "solve" the spaces issue.

As for the space merging, I think it depends on whether the vocab has a token for multiple spaces or not (and how many).

Testing the original llama tokenizer, we can see they do have them up until 16 (!) spaces.
In the examples below, note that ▁ is not an underscore, it is the UTF symbol ▁ used by the tokenizer to represent a space.

from transformers import LlamaTokenizer
tok_original = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf', use_fast=True)

>>> tok_original.get_vocab()["▁"]
29871
>>> tok_original.get_vocab()["▁▁"]
259
>>> tok_original.get_vocab()["▁▁▁"]
1678
>>> tok_original.get_vocab()["▁▁▁▁"]
268
>>> tok_original.get_vocab()["▁▁▁▁▁"]
418
>>> tok_original.get_vocab()["▁▁▁▁▁▁"]
539
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁"]
4706
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁"]
308
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁"]
3986
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁"]
965
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁"]
9651
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁"]
632
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁"]
795
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
1669
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
18884
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
462
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁'

The OpenLLama tokenizer only has the single space though:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b", use_fast=False)

>>> tok.get_vocab()["▁"]
31822
>>> tok.get_vocab()["▁▁"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '▁▁'

I think it is smart to have multiple spaces tokenized as a single token. When it comes to code data for example, it represents an enormous saving of tokens. Just think of all the tokens spent to encode simple indentations...

If OpenLLama was indeed trained like this, that's very unfortunate.

danielhanchen · 2023-06-08T13:27:54Z

Coolies on trying out my temporary tokenizers! :)

Interesting find on llama's support up to 16 spaces! I think Openllama did do the individual digit splitting correctly, just maybe not the spaces.

Quote from https://arxiv.org/pdf/2302.13971.pdf:

Tokenizer. We tokenize the data with the byte-
pair encoding (BPE) algorithm (Sennrich et al.,
2015), using the implementation from Sentence-
Piece (Kudo and Richardson, 2018). Notably, we
split all numbers into individual digits, and fallback
to bytes to decompose unknown UTF-8 characters

The original llama paper doesn't really mention on spaces, so presumably it's just treated like other tokens.

joytianya · 2023-06-09T04:49:37Z

When fine-tuning the code data downstream with https://github.com/young-geng/EasyLM/tree/main, there will be significant issues. Spaces are usually used for indentation. the result is that the indentations disappear.
Is there any way to solve it?

the code without indentation such as

def bubble_sort(arr):
 n = len(arr)
 for i in range(n-1):
 for j in range(n-i-1):
 if arr[j] > arr[j+1]:
 arr[j], arr[j+1] = arr[j+1], arr[j]
 return arr

danielhanchen · 2023-06-09T05:46:01Z

@joytianya I coincidentally opened 3 PRs to fix the 3B, 7B and 13B tokenizers :) If you're in a rush, you can temporary use my tokenizers which were the ones I pushed to the Openllama team's repo:

tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_3b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_7b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_13b_600bt")

young-geng · 2023-06-09T07:15:37Z

This is indeed a mistake on our side, as we have misconfigured the tokenizer to remove repeated spaces. I've updated that configuration and now the tokenizer should preserve all spaces. Please try it out.

belladoreai · 2023-06-09T11:39:44Z

@danielhanchen What are the differences between the 3B, 7B, and 13B tokenizers? I ask because I've been working for a few days to create a client-side JavaScript tokenizer for LLaMA, and I used the 13B tokenizer as a reference. I assumed that the tokenizer is the same for these different LLaMA versions, but maybe it's not?

codesoap · 2023-06-09T12:33:30Z

When I compare the three tokenizers, they seem to be the same:

$ curl -L https://huggingface.co/openlm-research/open_llama_3b/resolve/main/tokenizer.model -o tokenizer.model.3b
$ curl -L https://huggingface.co/openlm-research/open_llama_7b/resolve/main/tokenizer.model -o tokenizer.model.7b
$ curl -L https://huggingface.co/openlm-research/open_llama_13b_600bt/resolve/main/tokenizer.model -o tokenizer.model.13b

$ sha256 tokenizer.model.*
SHA256 (tokenizer.model.13b) = 81c4a3c9a9bbad64636d93660b6982940cec979a398f42684ba7194d118a3f21
SHA256 (tokenizer.model.3b) = 81c4a3c9a9bbad64636d93660b6982940cec979a398f42684ba7194d118a3f21
SHA256 (tokenizer.model.7b) = 81c4a3c9a9bbad64636d93660b6982940cec979a398f42684ba7194d118a3f21

danielhanchen · 2023-06-09T13:20:21Z

@belladoreai yep as @codesoap showed, it seems like the OpenLLAMA team most likely trained 1 tokenizer on the entire 1T token RJ dataset, then used all 3 for each of the 3 models.

But anyways it seems like @young-geng has successfully fixed the tokenizers - I just checked all 3 (3B, 7B, 13B):

For eg:

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_13b_600bt", pad_token = "</s>", use_fast = False)
tokenizer("Hello 1  2   3    4")

successfully returns:
{'input_ids': [1, 16644, 31822, 31853, 31822, 31822, 31855, 31822, 31822, 31822, 31878, 31822, 31822, 31822, 31822, 31882], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

I also updated the use_fast = True alternatives which enables Huggingface's batch processing to work on my tokenizer only repos for those who need fast tokenizations:
https://huggingface.co/danielhanchen/open_llama_3b
https://huggingface.co/danielhanchen/open_llama_7b
https://huggingface.co/danielhanchen/open_llama_13b_600bt

young-geng · 2023-07-07T21:39:26Z

We've just release a 7B v2 model with a better tokenizer and pretrained with a lot of code data. Check that out!

danielhanchen · 2023-07-08T09:14:52Z

@young-geng Congrats on the 7B v2 release! I can see multiple spaces are now tokenized properly! Good work!

danielhanchen mentioned this issue Jun 9, 2023

the code indentations disappear. Is there any way to solve it? #43

Closed

AkihikoWatanabe mentioned this issue Jun 25, 2023

OpenLLaMA 13B, 2023 AkihikoWatanabe/paper_notes#767

Open

jorgemcgomes mentioned this issue Jun 28, 2023

LORA fine-tuning with openlm-research/open_llama_7b as a plugin replacement for decapoda-research/llama-7b-hf #63

Open

lbeurerkellner mentioned this issue Jun 28, 2023

Running with OpenLlama takes forever eth-sri/lmql#98

Closed

gjmulder mentioned this issue Jun 29, 2023

tokenization issue for code #61

Open

jorgemcgomes mentioned this issue Jun 29, 2023

OpenLLaMA can quickly learn how to code #65

Open

young-geng closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer ignoring multiple spaces #40

Tokenizer ignoring multiple spaces #40

jorgemcgomes commented Jun 7, 2023

danielhanchen commented Jun 8, 2023

jorgemcgomes commented Jun 8, 2023 •

edited

danielhanchen commented Jun 8, 2023

joytianya commented Jun 9, 2023 •

edited

danielhanchen commented Jun 9, 2023

young-geng commented Jun 9, 2023 •

edited

belladoreai commented Jun 9, 2023 •

edited

codesoap commented Jun 9, 2023

danielhanchen commented Jun 9, 2023 •

edited

young-geng commented Jul 7, 2023

danielhanchen commented Jul 8, 2023

Tokenizer ignoring multiple spaces #40

Tokenizer ignoring multiple spaces #40

Comments

jorgemcgomes commented Jun 7, 2023

OpenLLaMA

Original LLaMA

danielhanchen commented Jun 8, 2023

jorgemcgomes commented Jun 8, 2023 • edited

danielhanchen commented Jun 8, 2023

joytianya commented Jun 9, 2023 • edited

danielhanchen commented Jun 9, 2023

young-geng commented Jun 9, 2023 • edited

belladoreai commented Jun 9, 2023 • edited

codesoap commented Jun 9, 2023

danielhanchen commented Jun 9, 2023 • edited

young-geng commented Jul 7, 2023

danielhanchen commented Jul 8, 2023

jorgemcgomes commented Jun 8, 2023 •

edited

joytianya commented Jun 9, 2023 •

edited

young-geng commented Jun 9, 2023 •

edited

belladoreai commented Jun 9, 2023 •

edited

danielhanchen commented Jun 9, 2023 •

edited