You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:
the official example scripts: (give details below)
my own modified scripts: (give details below)
I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.
The tasks I am working on is:
an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)
Dataset: SemEval 2016 Task 5, SB1 EN-REST
To reproduce
Steps to reproduce the behavior:
See in the following codes:
importtorchimportnumpyasnpfromtransformersimportAutoModel, AutoTokenizertext='(Besides that there should be more restaurants like it around the city).'formodel_namein ['roberta-base', 'distilroberta-base']:
tokenizer=AutoTokenizer.from_pretrained(model_name)
model=AutoModel.from_pretrained(model_name)
token_dict=tokenizer.encode_plus(text, None, return_tensors='pt')
print('model_name: {}'.format(model_name))
print("Token (str): {}".format(
tokenizer.convert_ids_to_tokens(token_dict['input_ids'][0])))
print("Token (int): {}".format(token_dict['input_ids']))
print("Type: {}".format(
token_dict['token_type_ids']))
print('Output Embeddings: {}\n'.format(
model(token_dict['input_ids'])[0].shape))
Basically, the expected behavior is to tokenize ')' and '.' separately. Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this '臓' character so I am a bit confused.
Environment info
transformers version: 2.5.1
Platform: Windows-10-10.0.18362-SP0
Python version: 3.7.6
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: False
Using distributed or parallel set-up in script?: False
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
馃悰 Bug
Information
Model I am using (Bert, XLNet ...):
RoBERTa (roberta-base), DistilRoBERTa (distilroberta-base)
Language I am using the model on (English, Chinese ...):
English
The problem arises when using:
I am trying to encode the embeddings for the sentences, and I found a tokenization issue with a certain (type of) sentence which ends with ").". I noticed that the tokenizer cannot tokenize ')' from '.' and further causes issues with the sentence length.
The tasks I am working on is:
Dataset: SemEval 2016 Task 5, SB1 EN-REST
To reproduce
Steps to reproduce the behavior:
See in the following codes:
Expected behavior
Expected output:
Basically, the expected behavior is to tokenize ')' and '.' separately.
Furthermore, I am also curious about what these '臓' characters are in the RoBERTa encoding? I checked the vocabulary and I found both the normal words and the words starting with this '臓' character so I am a bit confused.Environment info
transformers
version: 2.5.1The text was updated successfully, but these errors were encountered: