Skip to content

bert-base-multilingual-uncased vocabulary not consecutive #990

@ntubertchen

Description

@ntubertchen

🐛 Bug

When I was checking out bert-base-multilingual-uncased vocabulary. I receive the warning "Saving vocabulary to ./vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted"

I ran the similar command on two different machine and got the same warning.

from pytorch_transformers import *
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased',do_lower_case=True)
tokenizer.save_vocabulary('./')

I ran it on

  • OS:
  • Python version: python3.5
  • PyTorch version: pytorch1.0.1.post2
  • PyTorch Transformers version (or branch): 1.0
  • Using GPU ? Yes
  • Distributed of parallel setup ?no

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions