Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Special tokens are still mismatched. #715

Closed
Li-Qingyun opened this issue Mar 5, 2024 · 2 comments
Closed

[Bug] Special tokens are still mismatched. #715

Li-Qingyun opened this issue Mar 5, 2024 · 2 comments

Comments

@Li-Qingyun
Copy link

Li-Qingyun commented Mar 5, 2024

Describe the bug

This commit added additional_special_tokens, which seems result in mismatch of tokenizer length and vocablary size in my transformers==4.31.0 version (although < 4.34).

  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|action_start|>",
    "<|action_end|>",
    "<|interpreter|>",
    "<|plugin|>"
  ],
tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|action_start|>', '<|action_end|>', '<|interpreter|>', '<|plugin|>']}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92550

It seems that the additional special tokens are made new ids, which is mismatched with the input_embeddings. But this pr seems to resolve the bug in 4.33.2 as described in this issue.

Environment

I'm still not sure. It seems that transformer==4.31.0 requires revision="f7dc28191037a297c086b5b70c6a226e2134e46d" for from_pretrained.

tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92544

Other information

No response

@RangiLyu
Copy link
Collaborator

RangiLyu commented Mar 6, 2024

Please install the correct transformers version as described in the readme and requirements file:
https://github.com/InternLM/InternLM#usages

transformers>=4.34

Lower versions of transformers cannot correctly identify the id set in added_tokens_decoder

@Li-Qingyun
Copy link
Author

yes, the final reason is added_tokens_decoder. i resolved the problem by modifying tokenization_internlm2.py

class InternLM2Tokenizer(PreTrainedTokenizer):
    """
    Construct a InternLM2 tokenizer. Based on byte-level Byte-Pair-Encoding.
    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    model_input_names = ["input_ids", "attention_mask"]
    _auto_class = "AutoTokenizer"

    def __init__(
        self,
        vocab_file,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="</s>",
        sp_model_kwargs: Optional[Dict[str, Any]] = None,
        add_bos_token=True,
        add_eos_token=False,
        decode_with_prefix_space=False,
        clean_up_tokenization_spaces=False,
        **kwargs,
    ):
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
        self.vocab_file = vocab_file
        self.add_bos_token = add_bos_token
        self.add_eos_token = add_eos_token
        self.decode_with_prefix_space = decode_with_prefix_space
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(vocab_file)
        self._no_prefix_space_tokens = None
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            pad_token=pad_token,
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            **kwargs,
        )
        
        # If a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
        # Modified from https://github.com/huggingface/transformers/blob/132852203a02e320049457316a63cffb64968aa1/src/transformers/tokenization_utils.py#L358-L360
        added_tokens_decoder = {int(k):v["content"] for k, v in kwargs.pop("added_tokens_decoder", {}).items()}
        added_tokens_encoder = {k:v for v, k in added_tokens_decoder.items()}
        self.added_tokens_decoder = added_tokens_decoder
        self.added_tokens_encoder = added_tokens_encoder

Thanks for reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants