Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization issue with RoBERTa and DistilRoBERTa. #3867

Closed
2 of 4 tasks
vincentwen1995 opened this issue Apr 20, 2020 · 2 comments
Closed
2 of 4 tasks

Tokenization issue with RoBERTa and DistilRoBERTa. #3867

vincentwen1995 opened this issue Apr 20, 2020 · 2 comments
Labels
Core: Tokenization Internals of the library; Tokenization. wontfix

Comments

@vincentwen1995
Copy link

vincentwen1995 commented Apr 20, 2020

馃悰 Bug

Information

Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Dataset: SemEval 2016 Task 5, SB1 EN-REST

To reproduce

Steps to reproduce the behavior:

See in the following codes:

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

text = '(Besides that there should be more restaurants like it around the city).'
for model_name in ['roberta-base', 'distilroberta-base']:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    token_dict = tokenizer.encode_plus(text, None, return_tensors='pt')
    
    print('model_name: {}'.format(model_name))
    print("Token (str): {}".format(
        tokenizer.convert_ids_to_tokens(token_dict['input_ids'][0])))
    print("Token (int): {}".format(token_dict['input_ids']))
    print("Type: {}".format(
        token_dict['token_type_ids']))
    print('Output Embeddings: {}\n'.format(
        model(token_dict['input_ids'])[0].shape))

Expected behavior

Expected output:

model_name: roberta-base
Token (str): ['<s>', '臓(', 'Besides', '臓that', '臓there', '臓should', '臓be', '臓more', '臓restaurants', '臓like', '臓it', '臓around', '臓the', '臓city', ')', '臓.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

model_name: distilroberta-base
Token (str): ['<s>', '臓(', 'Besides', '臓that', '臓there', '臓should', '臓be', '臓more', '臓restaurants', '臓like', '臓it', '臓around', '臓the', '臓city', ')', '臓.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

Basically, the expected behavior is to tokenize ')' and '.' separately. Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this '臓' character so I am a bit confused.

Environment info

  • transformers version: 2.5.1
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False
@AdityaSoni19031997
Copy link
Contributor

AdityaSoni19031997 commented Apr 21, 2020

Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding?

It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt

@BramVanroy BramVanroy added the Core: Tokenization Internals of the library; Tokenization. label Apr 21, 2020
@stale
Copy link

stale bot commented Jun 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization. wontfix
Projects
None yet
Development

No branches or pull requests

3 participants