Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with torchtext #69

Closed
joeddav opened this issue Jan 15, 2020 · 2 comments
Closed

Compatibility with torchtext #69

joeddav opened this issue Jan 15, 2020 · 2 comments
Labels

Comments

@joeddav
Copy link

joeddav commented Jan 15, 2020

Normally when using a custom tokenizer with torchtext fields, you can pass the tokenizer function to the Field constructor and then build a vocab attribute which keeps track of the stoi mapping.

TEXT = Field(sequential=True, tokenize=my_tokenizer_fn)
TEXT.build_vocab(train_data) # builds the stoi/itos mapping

Since 🤗 tokenizers build their own vocab mappings, what's the best way to use them with torchtext, for example to use one of their datasets? If you just did the above, the TEXT.vocab mappings wouldn't match the tokenizer mappings. Unfortunately I haven't seen a simple way of using custom mappings in torchtext. The best solution I've found so far is to follow the above procedure and then manually override the TEXT vocab with the tokenizer one. So that would look something like this:

from torchtext.datasets import WikiText2
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(...)
tokenizer_fn = lambda string: tokenizer.encode(string).tokens
TEXT = Field(sequential=True, tokenize=tokenizer_fn)
train, valid, test = WikiText2.splits(TEXT)
TEXT.build_vocab(train)

def set_vocab_mapping(vocab, tokenizer, unk_token='[UNK]'):
    stoi = defaultdict(lambda: tokenizer.token_to_id(unk_token))
    itos = []
    for i in range(tokenizer._tokenizer.get_vocab_size()):
        token = tokenizer.id_to_token(i)
        stoi[token] = i
        itos.append(token)
    vocab.stoi = stoi
    vocab.itos = itos

set_vocab_mapping(TEXT.vocab, tokenizer)

Is there a more straightforward way to do this? If not, it might be handy to have a helper function and/or example for others to reference since torchtext is so ubiquitous.

@xxllp
Copy link

xxllp commented Jan 15, 2020

I also want to use it with tensorflow text

Copy link

github-actions bot commented Jun 3, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 3, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants