Using pre-trained embeddings for out of vocabulary words #6480

integrallyclosed · 2017-05-02T18:42:54Z

Hi,

I am training a text classification model with Embedding as the first layer. I am initializing this layer with pre-trained embeddings that were learnt on a much larger dataset and consequently have a larger vocabulary. Once the model is trained, the embeddings layer only has weights corresponding to the vocabulary of the training data. At prediction, time when I encounter a word not in the training data, it gets assigned an out of vocabulary index and cannot use weights from pre trained embeddings even if the word exists in there. Is there any way around this?

Xls1994 · 2017-05-03T02:29:21Z

Hi,
The same word in both the training and testing data should have the same index. If the testing data has some out of vocabulary words, you may add a "UNKNOWN" word in the vocabulary and give a unique index for this special word, then you should random initialize the weights of this word.

integrallyclosed · 2017-05-03T02:36:33Z

Yes, I know the same word in training and test will have the same index and how unknown words are handled. My question is slightly different. Let's say I have a training set of 100k examples with a vocabulary size of 10k. I have word vectors trained on a separate document corpus of size 10 million with a vocabulary of 500k. So I have pre trained word vectors for an additional 490k words. At prediction time (on inputs which are not part of the original labeled dataset), the model will receive texts with words not part of the original training data, but which may have pre trained word vectors. It will be useful to leverage these pre trained vectors since some of these might be synonyms or similar in meaning to words in the training data and as such the accuracy could be better that randomly initializing them and treating all of them as unknown.

Xls1994 · 2017-05-03T06:14:09Z

oh, I know what you want to do for the unknown words. You want to use the pre-trained word vectors instead of randomly initializing them. But in keras, all the model has been built before you load your data. In other words ,the embedding layer has the fixed shape, you could not change it. If you already know how much words only in the test data, you could add them in the embedding layer before you complie the model. But if you do not know how much words, you may choose other dynamic framework such as pytorch.

integrallyclosed · 2017-05-09T17:51:10Z

Figured out how to do this in keras. Instead of fitting the tokenizer on the training data, I fit the tokenizer on the corpus from which the embeddings were generated. This saves all the pre trained embeddings as part of the model.

aneesh-joshi · 2018-06-05T19:26:10Z

@integrallyclosed
Your apprach might work here, but on real world scenarios, you won't have the test set words.

aneesh-joshi · 2018-06-06T20:37:15Z

A better solution would probably be to add all the vectors in the pre-trained embedding to the Embedding Layer. It will make the layer "heavier", but should work.
What do you think, @integrallyclosed, @Xls1994 ?

switchfootsid · 2018-08-25T14:40:17Z

Still that won't work for real-world scenarios i.e when your corpus is domain specific with say "medical terms" or say "fashion terms". One way I know is to use FastText for sub/word character-level embeddings (they can compose the OOV word).

datistiquo · 2019-04-12T17:20:30Z

One way I know is to use FastText for sub/word character-level embeddings

How would you do that? I have the same issue like @integrallyclosed and thought of it too to add the training data for my custom fasttext embeddings to the embedding layer.

Can you fill the value for the UNK index on the fly for each OOV token @Xls1994 . So you have just one or two UNK index and the value is coming from eg fasttext subword embeddings?

integrallyclosed closed this as completed May 9, 2017

amaiya mentioned this issue Nov 17, 2019

BI-GRU Embeddings Training amaiya/ktrain#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pre-trained embeddings for out of vocabulary words #6480

Using pre-trained embeddings for out of vocabulary words #6480

integrallyclosed commented May 2, 2017 •

edited

Loading

Xls1994 commented May 3, 2017

integrallyclosed commented May 3, 2017 •

edited

Loading

Xls1994 commented May 3, 2017

integrallyclosed commented May 9, 2017

aneesh-joshi commented Jun 5, 2018

aneesh-joshi commented Jun 6, 2018

switchfootsid commented Aug 25, 2018

datistiquo commented Apr 12, 2019

Using pre-trained embeddings for out of vocabulary words #6480

Using pre-trained embeddings for out of vocabulary words #6480

Comments

integrallyclosed commented May 2, 2017 • edited Loading

Xls1994 commented May 3, 2017

integrallyclosed commented May 3, 2017 • edited Loading

Xls1994 commented May 3, 2017

integrallyclosed commented May 9, 2017

aneesh-joshi commented Jun 5, 2018

aneesh-joshi commented Jun 6, 2018

switchfootsid commented Aug 25, 2018

datistiquo commented Apr 12, 2019

integrallyclosed commented May 2, 2017 •

edited

Loading

integrallyclosed commented May 3, 2017 •

edited

Loading