Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding issues #4

Closed
neerajvashistha opened this issue Jun 12, 2019 · 2 comments
Closed

Embedding issues #4

neerajvashistha opened this issue Jun 12, 2019 · 2 comments
Labels
bug Something isn't working

Comments

@neerajvashistha
Copy link

Hi Edward,

I am compiling the latest build from the github itself, and have encountered another issue!
What if the there is no 'index' for a particular word? This is my first thought on seeing the error below.

  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/flow/sometimes.py", line 22, in augment
    augmented_text = aug.augment(augmented_text)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 65, in augment
    return self.substitute(data)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/augmenter/word/word_embs_aug.py", line 57, in substitute
    candidate_words = self.model.predict(original_word, top_n=self.aug_n)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 48, in predict
    source_id = self.word2idx(word)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 26, in word2idx
    return self.w2i[word]
KeyError: 'thethe'

The above is acceptable but it should be handled and is fixable. But for below

  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/flow/sometimes.py", line 22, in augment
    augmented_text = aug.augment(augmented_text)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/base_augmenter.py", line 65, in augment
    return self.substitute(data)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/augmenter/word/word_embs_aug.py", line 57, in substitute
    candidate_words = self.model.predict(original_word, top_n=self.aug_n)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 48, in predict
    source_id = self.word2idx(word)
  File "/home/projects/bot/bot_py3/faq/bot-ml/lib/python3.6/site-packages/nlpaug/model/word_embs/word_embeddings.py", line 26, in word2idx
    return self.w2i[word]
KeyError: 'How'

This should not happen! as there is definately an embedding for 'How',

@ricardopieper
Copy link
Contributor

ricardopieper commented Jun 12, 2019

I've also seen this behavior. Perhaps this is due to cased/uncased embeddings.
For now I'm just lower casing everything and retrying if anything goes wrong.

@makcedward makcedward added the bug Something isn't working label Jun 12, 2019
@makcedward
Copy link
Owner

makcedward commented Jun 12, 2019

@neerajvashistha @ricardopieper
It happens when using any one of traditional word embeddgings (word2vec, GloVe, fasttext) augmenter. The root cause is that those words is out-of-vocabulary (OOV or unknown words).

Target to exclude OOV during augmentation. In other word, OOV will not be pick for augmentation. Although it is possible to calculate "most" similar word, I will prefer either exclude OOV. Up coming release will exclude OOV.

Open to discuss for better solution.

makcedward added a commit that referenced this issue Jun 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants