Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special tokens not tokenized properly #12168

Closed
manueltonneau opened this issue Jun 15, 2021 · 7 comments
Closed

Special tokens not tokenized properly #12168

manueltonneau opened this issue Jun 15, 2021 · 7 comments

Comments

@manueltonneau
Copy link

Environment info

  • transformers version: 4.5.1
  • Python version: 3.8.5
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik

Information

Hi,

I have recently further pretrained a RoBERTa model with fairseq. I use a custom vocabulary, trained with the tokenizers module. After converting the fairseq model to pytorch, I loaded all my model-related files here.

When loading the tokenizer, I noticed that the special tokens are not tokenized properly.

To reproduce

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272') 
tokenizer.tokenize('<mask>') 
Out[7]: ['<mask>']
tokenizer.tokenize('<hashtag>') 
Out[8]: ['hashtag']
tokenizer.tokenize('<hashtag>')
Out[3]: [0, 23958, 2]

Expected behavior

Since <hashtag> is a special token in the vocabulary with ID 7 (see here), the last output should be: [0, 7, 2]. <hashtag> with the '<>' should also be recognized as a unique token.

Potential explanation

When looking at the files from a similar model, it seems that the vocab is in txt format and they also have the bpe.codes file, which I don't have. Could that be the issue? And if so, how do I convert my files to this format?

For vocab.txt, I have already found your lengthy explanation here, thanks for this.

@LysandreJik
Copy link
Member

Hello! What is your tokenizer? Is it a WordPiece-based tokenizer, or a Byte-level BPE-based tokenizer like the original one from RoBERTa?

@manueltonneau
Copy link
Author

Hi @LysandreJik, thanks for your reply and sorry that I'm just seeing this now. My tokenizer is a byte-level BPE-based tokenizer.

@manueltonneau
Copy link
Author

Hi @LysandreJik, let me know if you have a solution for this or if you need more info, thanks a lot in advance :)

@NielsRogge
Copy link
Contributor

NielsRogge commented Jul 2, 2021

Hi,

How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus?

Did you add these additional special tokens using the tokenizers library? Normally, one can add additional tokens as follows (based on huggingface/tokenizers#247 (comment)):

special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

However, printing the following:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272') 
print(tokenizer.additional_special_tokens)

Returns []. So you can solve it by doing:

special_tokens_dict = {'additional_special_tokens': ['<hashtag>']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

When I then test your example:

tokenizer.tokenize('<hashtag>')

I get: ['<hashtag>'].

And when doing:

tokenizer.convert_tokens_to_ids(tokenizer.tokenize("<hashtag>", add_special_tokens=True))

I get: [0, 7, 2].

@manueltonneau
Copy link
Author

Awesome @NielsRogge, thanks a lot! Will test this and get back to you/close if solved.

@manueltonneau
Copy link
Author

How did you add the additional special tokens? So you start from a pre-trained RoBERTa, then added additional special tokens and further pre-trained on a corpus?

I created a new vocab with the tokenizers module for which I added new special tokens. Here is the code I use below:

# Initialize a tokenizer
  tokenizer = Tokenizer(models.BPE())

  # Customize pre-tokenization and decoding
  tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
  tokenizer.decoder = decoders.ByteLevel()
  tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

  # And then train
  trainer = trainers.BpeTrainer(vocab_size=args.vocab_size, min_frequency=2, special_tokens=[
      "<s>",
      "<pad>",
      "</s>",
      "<unk>",
      "<mask>",
      "@USER",
      "HTTPURL",
      "<hashtag>",
      "</hashtag>"
      ], show_progress=True)
  files = [os.path.join(args.corpus_dir, filename) for filename in os.listdir(args.corpus_dir)]
  i = 0
  start_time = time.time()
  for file in files:
      print(f'Starting training on {file}')
      tokenizer.train([file], trainer=trainer)
      i = i + 1
      print(f'{i} files done out of {len(files)} files')
      print(f'Time elapsed: {time.time() - start_time} seconds')

  # And Save it
  output_dir = f'/scratch/mt4493/twitter_labor/twitter-labor-data/data/pretraining/US/vocab_files/{args.vocab_size}/{args.vocab_name}'
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)
  tokenizer.model.save(output_dir)

@manueltonneau
Copy link
Author

Works fine, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants