Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unseen Vocab #63

Closed
siddsach opened this issue Nov 28, 2018 · 3 comments
Closed

Unseen Vocab #63

siddsach opened this issue Nov 28, 2018 · 3 comments

Comments

@siddsach
Copy link

Thank you so much for this well-documented and easy-to-understand implementation! I remember meeting you at WeCNLP and am so happy to see you push out usable implementations of the SOA in pytorch for the community!!!!!

I have a question: The convert_tokens_to_ids method in the BertTokenizer that provides input to the BertEncoder uses an OrderedDict for the vocab attribute, which throws an error (e.g. KeyError: 'ketorolac') for any words not in the vocab. Can I create another vocab object that adds unseen words and use that in the tokenizer? Does the pretrained BertEncoder depend on the default id mapping?

It seems to me that ideally in the long-term, this repo would incorporate character level embeddings to deal with unseen words, but idk if that is necessary for this use-case.

@artemisart
Copy link

artemisart commented Nov 29, 2018

If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.

@thomwolf
Copy link
Member

Hi @siddsach,
Thanks for your kind words!
@artemisart is right, BPE progressively falls-back on character level embeddings for unseen words.

@ilham-bintang
Copy link

If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.

Hi, what do you mean tokenize properly the input (tokenize before convert_tokens) ?
Can you refer a tokenization sample (before and after) or a sample code if any? thank you

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023
ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023
ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants