Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pre-trained embeddings for out of vocabulary words #6480

Closed
integrallyclosed opened this issue May 2, 2017 · 8 comments
Closed

Using pre-trained embeddings for out of vocabulary words #6480

integrallyclosed opened this issue May 2, 2017 · 8 comments

Comments

@integrallyclosed
Copy link

integrallyclosed commented May 2, 2017

Hi,

I am training a text classification model with Embedding as the first layer. I am initializing this layer with pre-trained embeddings that were learnt on a much larger dataset and consequently have a larger vocabulary. Once the model is trained, the embeddings layer only has weights corresponding to the vocabulary of the training data. At prediction, time when I encounter a word not in the training data, it gets assigned an out of vocabulary index and cannot use weights from pre trained embeddings even if the word exists in there. Is there any way around this?

@Xls1994
Copy link

Xls1994 commented May 3, 2017

Hi,
The same word in both the training and testing data should have the same index. If the testing data has some out of vocabulary words, you may add a "UNKNOWN" word in the vocabulary and give a unique index for this special word, then you should random initialize the weights of this word.

@integrallyclosed
Copy link
Author

integrallyclosed commented May 3, 2017

Yes, I know the same word in training and test will have the same index and how unknown words are handled. My question is slightly different. Let's say I have a training set of 100k examples with a vocabulary size of 10k. I have word vectors trained on a separate document corpus of size 10 million with a vocabulary of 500k. So I have pre trained word vectors for an additional 490k words. At prediction time (on inputs which are not part of the original labeled dataset), the model will receive texts with words not part of the original training data, but which may have pre trained word vectors. It will be useful to leverage these pre trained vectors since some of these might be synonyms or similar in meaning to words in the training data and as such the accuracy could be better that randomly initializing them and treating all of them as unknown.

@Xls1994
Copy link

Xls1994 commented May 3, 2017

oh, I know what you want to do for the unknown words. You want to use the pre-trained word vectors instead of randomly initializing them. But in keras, all the model has been built before you load your data. In other words ,the embedding layer has the fixed shape, you could not change it. If you already know how much words only in the test data, you could add them in the embedding layer before you complie the model. But if you do not know how much words, you may choose other dynamic framework such as pytorch.

@integrallyclosed
Copy link
Author

Figured out how to do this in keras. Instead of fitting the tokenizer on the training data, I fit the tokenizer on the corpus from which the embeddings were generated. This saves all the pre trained embeddings as part of the model.

@aneesh-joshi
Copy link

@integrallyclosed
Your apprach might work here, but on real world scenarios, you won't have the test set words.

@aneesh-joshi
Copy link

A better solution would probably be to add all the vectors in the pre-trained embedding to the Embedding Layer. It will make the layer "heavier", but should work.
What do you think, @integrallyclosed, @Xls1994 ?

@switchfootsid
Copy link

Still that won't work for real-world scenarios i.e when your corpus is domain specific with say "medical terms" or say "fashion terms". One way I know is to use FastText for sub/word character-level embeddings (they can compose the OOV word).

@datistiquo
Copy link

One way I know is to use FastText for sub/word character-level embeddings

How would you do that? I have the same issue like @integrallyclosed and thought of it too to add the training data for my custom fasttext embeddings to the embedding layer.

Can you fill the value for the UNK index on the fly for each OOV token @Xls1994 . So you have just one or two UNK index and the value is coming from eg fasttext subword embeddings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants