Update pretrained_word_embeddings.py #13073

abhipn · 2019-07-06T13:20:20Z

Summary

Hello,

Here's a small example why I think there's a small mistake in the code.

Now if you check we've specified

MAX_NUM_WORDS = 3

but tokenizer return n-1 words i.e 2. So we only have two words in our integer coded corpus.

num_words = min(MAX_NUM_WORDS, len(tokenizer.word_index) + 1)      # i.e min(3, 8 + 1)
embedding_matrix = np.zeros((num_words, 10))
print(embedding_matrix.shape)
(3, 10)

Then this should be how we create embedding matrix.

for word, i in word_index.items():
    if i >= MAX_NUM_WORDS:     # i. e i >= 3 We also don't want word three to be included.
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
         embedding_matrix[i] = embedding_vector

Related Issues

None.

PR Overview

[n] This PR requires new unit tests [y/n] (make sure tests are included)
[n] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
[n] This PR is backwards compatible [y/n]
[n] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

flow2k · 2019-09-12T07:00:28Z

Great PR! Without it, we'd get errors at model-building time, because Embedding layer will complain that the embedding matrix has 1 more row its input_dim.

Note, as alluded to above, the MAX_NUM_WORDS parameter is the "effective vocabulary size" seen by the Embedding layer, and should be set to 1 larger than the number of words in the actual vocabulary.

Does some pipeline needs to be run for this PR's change to be reflected on the production page https://keras.io/examples/pretrained_word_embeddings/?

Update pretrained_word_embeddings.py

09521c9

fchollet approved these changes Jul 9, 2019

View reviewed changes

fchollet merged commit 3bda552 into keras-team:master Jul 9, 2019

abhipn deleted the patch-1 branch July 17, 2019 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pretrained_word_embeddings.py #13073

Update pretrained_word_embeddings.py #13073

abhipn commented Jul 6, 2019 •

edited

flow2k commented Sep 12, 2019

Update pretrained_word_embeddings.py #13073

Update pretrained_word_embeddings.py #13073

Conversation

abhipn commented Jul 6, 2019 • edited

Summary

Related Issues

PR Overview

flow2k commented Sep 12, 2019

abhipn commented Jul 6, 2019 •

edited