Llama3 models causing `TypeError: not a string` error in LlamaTokenizer #30607

KeitaW · 2024-05-02T07:27:28Z

System Info

transformers version: 4.40.1
Platform: Linux-5.15.0-1052-aws-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.3.0a0+6ddf5cf85e.nv24.04 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Initializing tokenizer for Llama3 model with LlamaTokentizer causes following:

TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 LlamaTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")

File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2090, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2088         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
   2089 print("resolved_vocab_files", resolved_vocab_files)
-> 2090 return cls._from_pretrained(
   2091     resolved_vocab_files,
   2092     pretrained_model_name_or_path,
   2093     init_configuration,
   2094     *init_inputs,
   2095     token=token,
   2096     cache_dir=cache_dir,
   2097     local_files_only=local_files_only,
   2098     _commit_hash=commit_hash,
   2099     _is_local=is_local,
   2100     trust_remote_code=trust_remote_code,
   2101     **kwargs,
   2102 )

File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2316, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2314     print("init_kwargs")
   2315     print(init_kwargs)
-> 2316     tokenizer = cls(*init_inputs, **init_kwargs)
   2317 except OSError:
   2318     raise OSError(
   2319         "Unable to load vocabulary from file. "
   2320         "Please check that the provided vocabulary is accessible and not corrupted."
   2321     )

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py:169, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
    167 self.add_eos_token = add_eos_token
    168 self.use_default_system_prompt = use_default_system_prompt
--> 169 self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
    170 self.add_prefix_space = add_prefix_space
    172 super().__init__(
    173     bos_token=bos_token,
    174     eos_token=eos_token,
   (...)
    185     **kwargs,
    186 )

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
    196 if self.legacy or from_slow:  # no dependency on protobuf
    197     print("legacy")
--> 198     tokenizer.Load(self.vocab_file)
    199     return tokenizer
    201 with open(self.vocab_file, "rb") as f:

File /usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py:963, in SentencePieceProcessor.Load(self, model_file, model_proto)
    961 if model_proto:
    962   return self.LoadFromSerializedProto(model_proto)
--> 963 return self.LoadFromFile(model_file)

File /usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py:317, in SentencePieceProcessor.LoadFromFile(self, arg)
    315 def LoadFromFile(self, arg):
    316     print("debug: arg is ", arg) 
--> 317     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

while AutoTokenizer gives us transformers.tokenization_utils_fast.PreTrainedTokenizerFast.

AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")

Expected behavior

Both LlamaTokenizer and AutoTokenizer return the same tokenizer as they do so for Llama2 models.

In [17]: t = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
In [18]: type(t)
Out[18]: transformers.models.llama.tokenization_llama.LlamaTokenizer
In [19]: t = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
In [20]: type(t)
Out[20]: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-05-02T07:33:52Z

? I don't understand why you expect both to return the same type? One is the slow tokenizer, relying on sentencepiece backend, while the other is the fast, which relies on the tokenizers backend 😉

ArthurZucker · 2024-05-02T07:34:17Z

Llama3 is a different tokenizer and should only be initialized with AutoTokenizer

KeitaW · 2024-05-02T08:03:35Z

Thank you very much @ArthurZucker for the quick response! I had a wrong assumption that LlamaTokenizer covers both Llama2 and Llama3.

ArthurZucker · 2024-05-02T08:15:51Z

No worries, I think we might not have been as clear as possible on this!

KeitaW closed this as completed May 2, 2024

KeitaW mentioned this issue May 2, 2024

use AutoTokenizer instead of LlamaTokenizer in checkpoint_converter_fsdp_hf.py meta-llama/llama-recipes#483

Merged

5 tasks

ArthurZucker mentioned this issue May 23, 2024

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 models causing `TypeError: not a string` error in LlamaTokenizer #30607

Llama3 models causing `TypeError: not a string` error in LlamaTokenizer #30607

KeitaW commented May 2, 2024

ArthurZucker commented May 2, 2024

ArthurZucker commented May 2, 2024

KeitaW commented May 2, 2024

ArthurZucker commented May 2, 2024

Llama3 models causing TypeError: not a string error in LlamaTokenizer #30607

Llama3 models causing TypeError: not a string error in LlamaTokenizer #30607

Comments

KeitaW commented May 2, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented May 2, 2024

ArthurZucker commented May 2, 2024

KeitaW commented May 2, 2024

ArthurZucker commented May 2, 2024

Llama3 models causing `TypeError: not a string` error in LlamaTokenizer #30607

Llama3 models causing `TypeError: not a string` error in LlamaTokenizer #30607