Compatibility with torchtext #69

joeddav · 2020-01-15T00:13:19Z

Normally when using a custom tokenizer with torchtext fields, you can pass the tokenizer function to the Field constructor and then build a vocab attribute which keeps track of the stoi mapping.

TEXT = Field(sequential=True, tokenize=my_tokenizer_fn)
TEXT.build_vocab(train_data) # builds the stoi/itos mapping

Since 🤗 tokenizers build their own vocab mappings, what's the best way to use them with torchtext, for example to use one of their datasets? If you just did the above, the TEXT.vocab mappings wouldn't match the tokenizer mappings. Unfortunately I haven't seen a simple way of using custom mappings in torchtext. The best solution I've found so far is to follow the above procedure and then manually override the TEXT vocab with the tokenizer one. So that would look something like this:

from torchtext.datasets import WikiText2
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(...)
tokenizer_fn = lambda string: tokenizer.encode(string).tokens
TEXT = Field(sequential=True, tokenize=tokenizer_fn)
train, valid, test = WikiText2.splits(TEXT)
TEXT.build_vocab(train)

def set_vocab_mapping(vocab, tokenizer, unk_token='[UNK]'):
    stoi = defaultdict(lambda: tokenizer.token_to_id(unk_token))
    itos = []
    for i in range(tokenizer._tokenizer.get_vocab_size()):
        token = tokenizer.id_to_token(i)
        stoi[token] = i
        itos.append(token)
    vocab.stoi = stoi
    vocab.itos = itos

set_vocab_mapping(TEXT.vocab, tokenizer)

Is there a more straightforward way to do this? If not, it might be handy to have a helper function and/or example for others to reference since torchtext is so ubiquitous.

The text was updated successfully, but these errors were encountered:

xxllp · 2020-01-15T01:02:49Z

I also want to use it with tensorflow text

github-actions · 2024-06-03T01:51:51Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

This was referenced Jan 17, 2020

Ending Notes Suggestions iamtrask/Grokking-Deep-Learning#30

Open

[Feature Request] drawio-observer-iframe jgraph/drawio#715

Closed

huggingface deleted a comment from prateekrastogi Feb 20, 2020

github-actions bot added the Stale label Jun 3, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility with torchtext #69

Compatibility with torchtext #69

joeddav commented Jan 15, 2020 •

edited

Loading

xxllp commented Jan 15, 2020

github-actions bot commented Jun 3, 2024

Compatibility with torchtext #69

Compatibility with torchtext #69

Comments

joeddav commented Jan 15, 2020 • edited Loading

xxllp commented Jan 15, 2020

github-actions bot commented Jun 3, 2024

joeddav commented Jan 15, 2020 •

edited

Loading