Special tokens not tokenized properly #12168

manueltonneau · 2021-06-15T07:32:25Z

Environment info

transformers version: 4.5.1
Python version: 3.8.5
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Information

Hi,

I have recently further pretrained a RoBERTa model with fairseq. I use a custom vocabulary, trained with the tokenizers module. After converting the fairseq model to pytorch, I loaded all my model-related files here.

When loading the tokenizer, I noticed that the special tokens are not tokenized properly.

To reproduce

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272') 
tokenizer.tokenize('<mask>') 
Out[7]: ['<mask>']
tokenizer.tokenize('<hashtag>') 
Out[8]: ['hashtag']
tokenizer.tokenize('<hashtag>')
Out[3]: [0, 23958, 2]

Expected behavior

Since <hashtag> is a special token in the vocabulary with ID 7 (see here), the last output should be: [0, 7, 2]. <hashtag> with the '<>' should also be recognized as a unique token.

Potential explanation

When looking at the files from a similar model, it seems that the vocab is in txt format and they also have the bpe.codes file, which I don't have. Could that be the issue? And if so, how do I convert my files to this format?

For vocab.txt, I have already found your lengthy explanation here, thanks for this.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-06-21T09:19:14Z

Hello! What is your tokenizer? Is it a WordPiece-based tokenizer, or a Byte-level BPE-based tokenizer like the original one from RoBERTa?

manueltonneau · 2021-06-23T07:57:13Z

Hi @LysandreJik, thanks for your reply and sorry that I'm just seeing this now. My tokenizer is a byte-level BPE-based tokenizer.

manueltonneau · 2021-06-29T07:16:55Z

Hi @LysandreJik, let me know if you have a solution for this or if you need more info, thanks a lot in advance :)

NielsRogge · 2021-07-02T12:30:15Z

Hi,

How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus?

Did you add these additional special tokens using the tokenizers library? Normally, one can add additional tokens as follows (based on huggingface/tokenizers#247 (comment)):

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

However, printing the following:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272') 
print(tokenizer.additional_special_tokens)

Returns []. So you can solve it by doing:

special_tokens_dict = {'additional_special_tokens': ['<hashtag>']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

When I then test your example:

tokenizer.tokenize('<hashtag>')

I get: ['<hashtag>'].

And when doing:

tokenizer.convert_tokens_to_ids(tokenizer.tokenize("<hashtag>", add_special_tokens=True))

I get: [0, 7, 2].

manueltonneau · 2021-07-04T12:15:07Z

Awesome @NielsRogge, thanks a lot! Will test this and get back to you/close if solved.

manueltonneau · 2021-07-04T12:29:54Z

How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus?

I created a new vocab with the tokenizers module for which I added new special tokens. Here is the code I use below:

# Initialize a tokenizer
  tokenizer = Tokenizer(models.BPE())

  # Customize pre-tokenization and decoding
  tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
  tokenizer.decoder = decoders.ByteLevel()
  tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

  # And then train
  trainer = trainers.BpeTrainer(vocab_size=args.vocab_size, min_frequency=2, special_tokens=[
      "<s>",
      "<pad>",
      "</s>",
      "<unk>",
      "<mask>",
      "@USER",
      "HTTPURL",
      "<hashtag>",
      "</hashtag>"
      ], show_progress=True)
  files = [os.path.join(args.corpus_dir, filename) for filename in os.listdir(args.corpus_dir)]
  i = 0
  start_time = time.time()
  for file in files:
      print(f'Starting training on {file}')
      tokenizer.train([file], trainer=trainer)
      i = i + 1
      print(f'{i} files done out of {len(files)} files')
      print(f'Time elapsed: {time.time() - start_time} seconds')

  # And Save it
  output_dir = f'/scratch/mt4493/twitter_labor/twitter-labor-data/data/pretraining/US/vocab_files/{args.vocab_size}/{args.vocab_name}'
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)
  tokenizer.model.save(output_dir)

manueltonneau · 2021-07-05T19:25:39Z

Works fine, thanks again!

manueltonneau closed this as completed Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special tokens not tokenized properly #12168

Special tokens not tokenized properly #12168

manueltonneau commented Jun 15, 2021

LysandreJik commented Jun 21, 2021

manueltonneau commented Jun 23, 2021

manueltonneau commented Jun 29, 2021

NielsRogge commented Jul 2, 2021 •

edited

manueltonneau commented Jul 4, 2021

manueltonneau commented Jul 4, 2021

manueltonneau commented Jul 5, 2021

Special tokens not tokenized properly #12168

Special tokens not tokenized properly #12168

Comments

manueltonneau commented Jun 15, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

Potential explanation

LysandreJik commented Jun 21, 2021

manueltonneau commented Jun 23, 2021

manueltonneau commented Jun 29, 2021

NielsRogge commented Jul 2, 2021 • edited

manueltonneau commented Jul 4, 2021

manueltonneau commented Jul 4, 2021

manueltonneau commented Jul 5, 2021

NielsRogge commented Jul 2, 2021 •

edited