Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pretrained_word_embeddings.py #13073

Merged
merged 1 commit into from Jul 9, 2019
Merged

Update pretrained_word_embeddings.py #13073

merged 1 commit into from Jul 9, 2019

Conversation

abhipn
Copy link
Contributor

@abhipn abhipn commented Jul 6, 2019

Summary

Hello,

Here's a small example why I think there's a small mistake in the code.

Example Code

Now if you check we've specified

MAX_NUM_WORDS = 3

but tokenizer return n-1 words i.e 2. So we only have two words in our integer coded corpus.

num_words = min(MAX_NUM_WORDS, len(tokenizer.word_index) + 1)      # i.e min(3, 8 + 1)
embedding_matrix = np.zeros((num_words, 10))
print(embedding_matrix.shape)
(3, 10)

Then this should be how we create embedding matrix.

for word, i in word_index.items():
    if i >= MAX_NUM_WORDS:     # i. e i >= 3 We also don't want word three to be included.
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
         embedding_matrix[i] = embedding_vector

Related Issues

None.

PR Overview

  • [n] This PR requires new unit tests [y/n] (make sure tests are included)
  • [n] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • [n] This PR is backwards compatible [y/n]
  • [n] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

@fchollet fchollet merged commit 3bda552 into keras-team:master Jul 9, 2019
@abhipn abhipn deleted the patch-1 branch July 17, 2019 05:00
@flow2k
Copy link

flow2k commented Sep 12, 2019

Great PR! Without it, we'd get errors at model-building time, because Embedding layer will complain that the embedding matrix has 1 more row its input_dim.

Note, as alluded to above, the MAX_NUM_WORDS parameter is the "effective vocabulary size" seen by the Embedding layer, and should be set to 1 larger than the number of words in the actual vocabulary.

Does some pipeline needs to be run for this PR's change to be reflected on the production page https://keras.io/examples/pretrained_word_embeddings/?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants