Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3 models causing TypeError: not a string error in LlamaTokenizer #30607

Closed
2 of 4 tasks
KeitaW opened this issue May 2, 2024 · 4 comments
Closed
2 of 4 tasks

Llama3 models causing TypeError: not a string error in LlamaTokenizer #30607

KeitaW opened this issue May 2, 2024 · 4 comments

Comments

@KeitaW
Copy link

KeitaW commented May 2, 2024

System Info

  • transformers version: 4.40.1
  • Platform: Linux-5.15.0-1052-aws-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0a0+6ddf5cf85e.nv24.04 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Initializing tokenizer for Llama3 model with LlamaTokentizer causes following:

TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 LlamaTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")

File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2090, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2088         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
   2089 print("resolved_vocab_files", resolved_vocab_files)
-> 2090 return cls._from_pretrained(
   2091     resolved_vocab_files,
   2092     pretrained_model_name_or_path,
   2093     init_configuration,
   2094     *init_inputs,
   2095     token=token,
   2096     cache_dir=cache_dir,
   2097     local_files_only=local_files_only,
   2098     _commit_hash=commit_hash,
   2099     _is_local=is_local,
   2100     trust_remote_code=trust_remote_code,
   2101     **kwargs,
   2102 )

File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2316, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2314     print("init_kwargs")
   2315     print(init_kwargs)
-> 2316     tokenizer = cls(*init_inputs, **init_kwargs)
   2317 except OSError:
   2318     raise OSError(
   2319         "Unable to load vocabulary from file. "
   2320         "Please check that the provided vocabulary is accessible and not corrupted."
   2321     )

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py:169, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
    167 self.add_eos_token = add_eos_token
    168 self.use_default_system_prompt = use_default_system_prompt
--> 169 self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
    170 self.add_prefix_space = add_prefix_space
    172 super().__init__(
    173     bos_token=bos_token,
    174     eos_token=eos_token,
   (...)
    185     **kwargs,
    186 )

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
    196 if self.legacy or from_slow:  # no dependency on protobuf
    197     print("legacy")
--> 198     tokenizer.Load(self.vocab_file)
    199     return tokenizer
    201 with open(self.vocab_file, "rb") as f:

File /usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py:963, in SentencePieceProcessor.Load(self, model_file, model_proto)
    961 if model_proto:
    962   return self.LoadFromSerializedProto(model_proto)
--> 963 return self.LoadFromFile(model_file)

File /usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py:317, in SentencePieceProcessor.LoadFromFile(self, arg)
    315 def LoadFromFile(self, arg):
    316     print("debug: arg is ", arg) 
--> 317     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

while AutoTokenizer gives us transformers.tokenization_utils_fast.PreTrainedTokenizerFast.

AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B")

Expected behavior

Both LlamaTokenizer and AutoTokenizer return the same tokenizer as they do so for Llama2 models.

In [17]: t = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
In [18]: type(t)
Out[18]: transformers.models.llama.tokenization_llama.LlamaTokenizer
In [19]: t = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
In [20]: type(t)
Out[20]: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast
@ArthurZucker
Copy link
Collaborator

? I don't understand why you expect both to return the same type? One is the slow tokenizer, relying on sentencepiece backend, while the other is the fast, which relies on the tokenizers backend 😉

@ArthurZucker
Copy link
Collaborator

Llama3 is a different tokenizer and should only be initialized with AutoTokenizer

@KeitaW
Copy link
Author

KeitaW commented May 2, 2024

Thank you very much @ArthurZucker for the quick response! I had a wrong assumption that LlamaTokenizer covers both Llama2 and Llama3.

@KeitaW KeitaW closed this as completed May 2, 2024
@ArthurZucker
Copy link
Collaborator

No worries, I think we might not have been as clear as possible on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants