[Bug] Special tokens are still mismatched. #715

Li-Qingyun · 2024-03-05T01:01:25Z

Describe the bug

This commit added additional_special_tokens, which seems result in mismatch of tokenizer length and vocablary size in my transformers==4.31.0 version (although < 4.34).

  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|action_start|>",
    "<|action_end|>",
    "<|interpreter|>",
    "<|plugin|>"
  ],

tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|action_start|>', '<|action_end|>', '<|interpreter|>', '<|plugin|>']}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92550

It seems that the additional special tokens are made new ids, which is mismatched with the input_embeddings. But this pr seems to resolve the bug in 4.33.2 as described in this issue.

Environment

I'm still not sure. It seems that transformer==4.31.0 requires revision="f7dc28191037a297c086b5b70c6a226e2134e46d" for from_pretrained.

tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92544

Other information

No response

The text was updated successfully, but these errors were encountered:

RangiLyu · 2024-03-06T02:17:29Z

Please install the correct transformers version as described in the readme and requirements file:
https://github.com/InternLM/InternLM#usages

InternLM/requirements.txt

Line 3 in bd57ff3

transformers>=4.34

Lower versions of transformers cannot correctly identify the id set in added_tokens_decoder

Li-Qingyun · 2024-03-06T02:33:29Z

yes, the final reason is added_tokens_decoder. i resolved the problem by modifying tokenization_internlm2.py

class InternLM2Tokenizer(PreTrainedTokenizer):
    """
    Construct a InternLM2 tokenizer. Based on byte-level Byte-Pair-Encoding.
    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    model_input_names = ["input_ids", "attention_mask"]
    _auto_class = "AutoTokenizer"

    def __init__(
        self,
        vocab_file,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="</s>",
        sp_model_kwargs: Optional[Dict[str, Any]] = None,
        add_bos_token=True,
        add_eos_token=False,
        decode_with_prefix_space=False,
        clean_up_tokenization_spaces=False,
        **kwargs,
    ):
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
        self.vocab_file = vocab_file
        self.add_bos_token = add_bos_token
        self.add_eos_token = add_eos_token
        self.decode_with_prefix_space = decode_with_prefix_space
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(vocab_file)
        self._no_prefix_space_tokens = None
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            pad_token=pad_token,
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            **kwargs,
        )
        
        # If a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
        # Modified from https://github.com/huggingface/transformers/blob/132852203a02e320049457316a63cffb64968aa1/src/transformers/tokenization_utils.py#L358-L360
        added_tokens_decoder = {int(k):v["content"] for k, v in kwargs.pop("added_tokens_decoder", {}).items()}
        added_tokens_encoder = {k:v for v, k in added_tokens_decoder.items()}
        self.added_tokens_decoder = added_tokens_decoder
        self.added_tokens_encoder = added_tokens_encoder

Thanks for reply!

Li-Qingyun closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Special tokens are still mismatched. #715

[Bug] Special tokens are still mismatched. #715

Li-Qingyun commented Mar 5, 2024 •

edited

Loading

RangiLyu commented Mar 6, 2024

Li-Qingyun commented Mar 6, 2024

[Bug] Special tokens are still mismatched. #715

[Bug] Special tokens are still mismatched. #715

Comments

Li-Qingyun commented Mar 5, 2024 • edited Loading

Describe the bug

Environment

Other information

RangiLyu commented Mar 6, 2024

Li-Qingyun commented Mar 6, 2024

Li-Qingyun commented Mar 5, 2024 •

edited

Loading