Tokenization issue with RoBERTa and DistilRoBERTa. #3867

vincentwen1995 · 2020-04-20T16:49:12Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

Dataset: SemEval 2016 Task 5, SB1 EN-REST

To reproduce

Steps to reproduce the behavior:

See in the following codes:

import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer

text = '(Besides that there should be more restaurants like it around the city).'
for model_name in ['roberta-base', 'distilroberta-base']:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    token_dict = tokenizer.encode_plus(text, None, return_tensors='pt')
    
    print('model_name: {}'.format(model_name))
    print("Token (str): {}".format(
        tokenizer.convert_ids_to_tokens(token_dict['input_ids'][0])))
    print("Token (int): {}".format(token_dict['input_ids']))
    print("Type: {}".format(
        token_dict['token_type_ids']))
    print('Output Embeddings: {}\n'.format(
        model(token_dict['input_ids'])[0].shape))

Expected behavior

Expected output:

model_name: roberta-base
Token (str): ['<s>', 'Ġ(', 'Besides', 'Ġthat', 'Ġthere', 'Ġshould', 'Ġbe', 'Ġmore', 'Ġrestaurants', 'Ġlike', 'Ġit', 'Ġaround', 'Ġthe', 'Ġcity', ')', 'Ġ.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

model_name: distilroberta-base
Token (str): ['<s>', 'Ġ(', 'Besides', 'Ġthat', 'Ġthere', 'Ġshould', 'Ġbe', 'Ġmore', 'Ġrestaurants', 'Ġlike', 'Ġit', 'Ġaround', 'Ġthe', 'Ġcity', ')', 'Ġ.', '</s>']
Token (int): tensor([[    0,    36, 41107,    14,    89,   197,    28,    55,  4329,   101,
            24,   198,     5,   343,    43,   479,     2]])
Type: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Output Embeddings: torch.Size([1, 17, 768])

Basically, the expected behavior is to tokenize ')' and '.' separately. Furthermore, I am also curious about what these 'Ġ' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this 'Ġ' character so I am a bit confused.

Environment info

transformers version: 2.5.1
Platform: Windows-10-10.0.18362-SP0
Python version: 3.7.6
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: False
Using distributed or parallel set-up in script?: False

The text was updated successfully, but these errors were encountered:

AdityaSoni19031997 · 2020-04-21T05:13:58Z

Furthermore, I am also curious about what these 'Ġ' characters are in the RoBERTa encoding?

It's a feature of byte-level BPE (an encoded space character)
Ref-bart-fairseq, Ref-openai-gpt

stale · 2020-06-27T20:53:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

BramVanroy added the Core: Tokenization Internals of the library; Tokenization. label Apr 21, 2020

stale bot added the wontfix label Jun 27, 2020

stale bot closed this as completed Jul 4, 2020

JohnGiorgi mentioned this issue Aug 26, 2020

Huggingface hosted inference api: rogue character in mask, also perhaps wrong task? JohnGiorgi/DeCLUTR#147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization issue with RoBERTa and DistilRoBERTa. #3867

Tokenization issue with RoBERTa and DistilRoBERTa. #3867

vincentwen1995 commented Apr 20, 2020 •

edited

AdityaSoni19031997 commented Apr 21, 2020 •

edited

stale bot commented Jun 27, 2020

Tokenization issue with RoBERTa and DistilRoBERTa. #3867

Tokenization issue with RoBERTa and DistilRoBERTa. #3867

Comments

vincentwen1995 commented Apr 20, 2020 • edited

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

AdityaSoni19031997 commented Apr 21, 2020 • edited

stale bot commented Jun 27, 2020

vincentwen1995 commented Apr 20, 2020 •

edited

AdityaSoni19031997 commented Apr 21, 2020 •

edited