Unseen Vocab #63

siddsach · 2018-11-28T22:38:57Z

Thank you so much for this well-documented and easy-to-understand implementation! I remember meeting you at WeCNLP and am so happy to see you push out usable implementations of the SOA in pytorch for the community!!!!!

I have a question: The convert_tokens_to_ids method in the BertTokenizer that provides input to the BertEncoder uses an OrderedDict for the vocab attribute, which throws an error (e.g. KeyError: 'ketorolac') for any words not in the vocab. Can I create another vocab object that adds unseen words and use that in the tokenizer? Does the pretrained BertEncoder depend on the default id mapping?

It seems to me that ideally in the long-term, this repo would incorporate character level embeddings to deal with unseen words, but idk if that is necessary for this use-case.

The text was updated successfully, but these errors were encountered:

artemisart · 2018-11-29T14:44:43Z

If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.

thomwolf · 2018-11-30T22:31:36Z

Hi @siddsach,
Thanks for your kind words!
@artemisart is right, BPE progressively falls-back on character level embeddings for unseen words.

ilham-bintang · 2019-10-21T01:35:41Z

If you tokenize properly the input (tokenize before convert_tokens), it automatically 'fallbacks' to subword/character-level(-like) embedding.
You can add new words in the vocabulary but you'll have to train the corresponding embeddings.

Hi, what do you mean tokenize properly the input (tokenize before convert_tokens) ?
Can you refer a tokenization sample (before and after) or a sample code if any? thank you

Ra

…face#63)

thomwolf closed this as completed Nov 30, 2018

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

ilham-bintang mentioned this issue Oct 21, 2019

Can not use custom sentence for QG microsoft/unilm#14

Closed

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023

Merge pull request huggingface#63 from jamesthesnake/ra

6fc015e

Ra

ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023

Add support for MarianMT models (huggingface#63)

42f91d9

ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023

Add documented support for MarianMT models for translation (hugging…

0e4079d

…face#63)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unseen Vocab #63

Unseen Vocab #63

siddsach commented Nov 28, 2018

artemisart commented Nov 29, 2018 •

edited

thomwolf commented Nov 30, 2018

ilham-bintang commented Oct 21, 2019

Unseen Vocab #63

Unseen Vocab #63

Comments

siddsach commented Nov 28, 2018

artemisart commented Nov 29, 2018 • edited

thomwolf commented Nov 30, 2018

ilham-bintang commented Oct 21, 2019

artemisart commented Nov 29, 2018 •

edited